We are trying to run distributed training with Torchtitan on a GB200 NVL72 cluster but using more than 10 trays (40 GPUs) fails with a NCCL segmentation fault. Using 10 and less trays works fine. We ...
Sanity test: - Running GDR perf-test (ib_send_bw --use_cuda -d mlx5_0) can get 393.69 GB/s - Running nccl-test all-reduce in a single host, get up to 479.70 GB/s out-of-place Bus-BW. Here are the ...