running multi-node with Exclusive Process

When running `all_reduce_perf` across 2 nodes / 4 GH200 GPUs per node I run into trouble: nccl-tests tell me GPUs are busy, although they are not. I have the following SLURM job:
```
#SBATCH --ntasks-per-node=4 --gpus-per-node=4 --nodes=2
srun ./build/all_reduce_perf -b 8 -e 128M -f 2
```
The output looks good in the start, but then the program fails:
```
# nccl-tests version 2.17.8 nccl-headers=22606 nccl-library=22606
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 3617715 on    gpu-1-1 device  0 [0009:01:00] NVIDIA GH200 120GB
#  Rank  1 Group  0 Pid 3617716 on    gpu-1-1 device  1 [0019:01:00] NVIDIA GH200 120GB
#  Rank  2 Group  0 Pid 3617717 on    gpu-1-1 device  2 [0029:01:00] NVIDIA GH200 120GB
#  Rank  3 Group  0 Pid 3617718 on    gpu-1-1 device  3 [0039:01:00] NVIDIA GH200 120GB
#  Rank  4 Group  0 Pid 3230358 on    gpu-1-7 device  0 [0009:01:00] NVIDIA GH200 120GB
#  Rank  5 Group  0 Pid 3230359 on    gpu-1-7 device  1 [0019:01:00] NVIDIA GH200 120GB
#  Rank  6 Group  0 Pid 3230360 on    gpu-1-7 device  2 [0029:01:00] NVIDIA GH200 120GB
#  Rank  7 Group  0 Pid 3230361 on    gpu-1-7 device  3 [0039:01:00] NVIDIA GH200 120GB
gpu-1-1: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
 .. gpu-1-1 pid 3617715: Test failure common.cu:1189
gpu-1-7: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
 .. gpu-1-7 pid 3230359: Test failure common.cu:1189
gpu-1-7: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
 .. gpu-1-7 pid 3230361: Test failure common.cu:1189
gpu-1-7: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
 .. gpu-1-7 pid 3230358: Test failure common.cu:1189
gpu-1-7: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
 .. gpu-1-7 pid 3230360: Test failure common.cu:1189
gpu-1-1: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
 .. gpu-1-1 pid 3617718: Test failure common.cu:1189
```
Now, my GPUs are configured with `Exclusive process`, so multiple processes cannot use one GPU. I guess this is what is happening in nccl-tests, because when I change to `Default`:
```
nvidia-smi -i 0,1,2,3 -c 0
```
the test runs through. Inspection with `nvidia-smi` shows that indeed, many processes use GPU0:
```
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   3203007      C   ...ests-2.17.8/./build/all_reduce_perf        788MiB |
|    0   N/A  N/A   3203008      C   ...ests-2.17.8/./build/all_reduce_perf        556MiB |
|    0   N/A  N/A   3203009      C   ...ests-2.17.8/./build/all_reduce_perf        556MiB |
|    0   N/A  N/A   3203010      C   ...ests-2.17.8/./build/all_reduce_perf        556MiB |
|    1   N/A  N/A   3203008      C   ...ests-2.17.8/./build/all_reduce_perf        790MiB |
|    2   N/A  N/A   3203009      C   ...ests-2.17.8/./build/all_reduce_perf        786MiB |
|    3   N/A  N/A   3203010      C   ...ests-2.17.8/./build/all_reduce_perf        784MiB |
+-----------------------------------------------------------------------------------------+
```
Or rather, the 3 processes that should use GPUs 1,2,3 also in addition have a process running on GPU 0.

Is it possible to run nccl-tests with `Exclusive process`? Or do I have to consider changing that setting on the system? Also, do you know if this is this only `nccl-tests` limitation, or a general `nccl` issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

running multi-node with Exclusive Process #369

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

running multi-node with Exclusive Process #369

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions