When running all_reduce_perf across 2 nodes / 4 GH200 GPUs per node I run into trouble: nccl-tests tell me GPUs are busy, although they are not. I have the following SLURM job:
#SBATCH --ntasks-per-node=4 --gpus-per-node=4 --nodes=2
srun ./build/all_reduce_perf -b 8 -e 128M -f 2
The output looks good in the start, but then the program fails:
# nccl-tests version 2.17.8 nccl-headers=22606 nccl-library=22606
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 3617715 on gpu-1-1 device 0 [0009:01:00] NVIDIA GH200 120GB
# Rank 1 Group 0 Pid 3617716 on gpu-1-1 device 1 [0019:01:00] NVIDIA GH200 120GB
# Rank 2 Group 0 Pid 3617717 on gpu-1-1 device 2 [0029:01:00] NVIDIA GH200 120GB
# Rank 3 Group 0 Pid 3617718 on gpu-1-1 device 3 [0039:01:00] NVIDIA GH200 120GB
# Rank 4 Group 0 Pid 3230358 on gpu-1-7 device 0 [0009:01:00] NVIDIA GH200 120GB
# Rank 5 Group 0 Pid 3230359 on gpu-1-7 device 1 [0019:01:00] NVIDIA GH200 120GB
# Rank 6 Group 0 Pid 3230360 on gpu-1-7 device 2 [0029:01:00] NVIDIA GH200 120GB
# Rank 7 Group 0 Pid 3230361 on gpu-1-7 device 3 [0039:01:00] NVIDIA GH200 120GB
gpu-1-1: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
.. gpu-1-1 pid 3617715: Test failure common.cu:1189
gpu-1-7: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
.. gpu-1-7 pid 3230359: Test failure common.cu:1189
gpu-1-7: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
.. gpu-1-7 pid 3230361: Test failure common.cu:1189
gpu-1-7: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
.. gpu-1-7 pid 3230358: Test failure common.cu:1189
gpu-1-7: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
.. gpu-1-7 pid 3230360: Test failure common.cu:1189
gpu-1-1: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
.. gpu-1-1 pid 3617718: Test failure common.cu:1189
Now, my GPUs are configured with Exclusive process, so multiple processes cannot use one GPU. I guess this is what is happening in nccl-tests, because when I change to Default:
nvidia-smi -i 0,1,2,3 -c 0
the test runs through. Inspection with nvidia-smi shows that indeed, many processes use GPU0:
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3203007 C ...ests-2.17.8/./build/all_reduce_perf 788MiB |
| 0 N/A N/A 3203008 C ...ests-2.17.8/./build/all_reduce_perf 556MiB |
| 0 N/A N/A 3203009 C ...ests-2.17.8/./build/all_reduce_perf 556MiB |
| 0 N/A N/A 3203010 C ...ests-2.17.8/./build/all_reduce_perf 556MiB |
| 1 N/A N/A 3203008 C ...ests-2.17.8/./build/all_reduce_perf 790MiB |
| 2 N/A N/A 3203009 C ...ests-2.17.8/./build/all_reduce_perf 786MiB |
| 3 N/A N/A 3203010 C ...ests-2.17.8/./build/all_reduce_perf 784MiB |
+-----------------------------------------------------------------------------------------+
Or rather, the 3 processes that should use GPUs 1,2,3 also in addition have a process running on GPU 0.
Is it possible to run nccl-tests with Exclusive process? Or do I have to consider changing that setting on the system? Also, do you know if this is this only nccl-tests limitation, or a general nccl issue?
When running
all_reduce_perfacross 2 nodes / 4 GH200 GPUs per node I run into trouble: nccl-tests tell me GPUs are busy, although they are not. I have the following SLURM job:The output looks good in the start, but then the program fails:
Now, my GPUs are configured with
Exclusive process, so multiple processes cannot use one GPU. I guess this is what is happening in nccl-tests, because when I change toDefault:the test runs through. Inspection with
nvidia-smishows that indeed, many processes use GPU0:Or rather, the 3 processes that should use GPUs 1,2,3 also in addition have a process running on GPU 0.
Is it possible to run nccl-tests with
Exclusive process? Or do I have to consider changing that setting on the system? Also, do you know if this is this onlynccl-testslimitation, or a generalncclissue?