Skip to content

running multi-node with Exclusive Process #369

@angainor

Description

@angainor

When running all_reduce_perf across 2 nodes / 4 GH200 GPUs per node I run into trouble: nccl-tests tell me GPUs are busy, although they are not. I have the following SLURM job:

#SBATCH --ntasks-per-node=4 --gpus-per-node=4 --nodes=2
srun ./build/all_reduce_perf -b 8 -e 128M -f 2

The output looks good in the start, but then the program fails:

# nccl-tests version 2.17.8 nccl-headers=22606 nccl-library=22606
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 3617715 on    gpu-1-1 device  0 [0009:01:00] NVIDIA GH200 120GB
#  Rank  1 Group  0 Pid 3617716 on    gpu-1-1 device  1 [0019:01:00] NVIDIA GH200 120GB
#  Rank  2 Group  0 Pid 3617717 on    gpu-1-1 device  2 [0029:01:00] NVIDIA GH200 120GB
#  Rank  3 Group  0 Pid 3617718 on    gpu-1-1 device  3 [0039:01:00] NVIDIA GH200 120GB
#  Rank  4 Group  0 Pid 3230358 on    gpu-1-7 device  0 [0009:01:00] NVIDIA GH200 120GB
#  Rank  5 Group  0 Pid 3230359 on    gpu-1-7 device  1 [0019:01:00] NVIDIA GH200 120GB
#  Rank  6 Group  0 Pid 3230360 on    gpu-1-7 device  2 [0029:01:00] NVIDIA GH200 120GB
#  Rank  7 Group  0 Pid 3230361 on    gpu-1-7 device  3 [0039:01:00] NVIDIA GH200 120GB
gpu-1-1: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
 .. gpu-1-1 pid 3617715: Test failure common.cu:1189
gpu-1-7: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
 .. gpu-1-7 pid 3230359: Test failure common.cu:1189
gpu-1-7: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
 .. gpu-1-7 pid 3230361: Test failure common.cu:1189
gpu-1-7: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
 .. gpu-1-7 pid 3230358: Test failure common.cu:1189
gpu-1-7: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
 .. gpu-1-7 pid 3230360: Test failure common.cu:1189
gpu-1-1: Test CUDA failure common.cu:1304 'CUDA-capable device(s) is/are busy or unavailable'
 .. gpu-1-1 pid 3617718: Test failure common.cu:1189

Now, my GPUs are configured with Exclusive process, so multiple processes cannot use one GPU. I guess this is what is happening in nccl-tests, because when I change to Default:

nvidia-smi -i 0,1,2,3 -c 0

the test runs through. Inspection with nvidia-smi shows that indeed, many processes use GPU0:

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   3203007      C   ...ests-2.17.8/./build/all_reduce_perf        788MiB |
|    0   N/A  N/A   3203008      C   ...ests-2.17.8/./build/all_reduce_perf        556MiB |
|    0   N/A  N/A   3203009      C   ...ests-2.17.8/./build/all_reduce_perf        556MiB |
|    0   N/A  N/A   3203010      C   ...ests-2.17.8/./build/all_reduce_perf        556MiB |
|    1   N/A  N/A   3203008      C   ...ests-2.17.8/./build/all_reduce_perf        790MiB |
|    2   N/A  N/A   3203009      C   ...ests-2.17.8/./build/all_reduce_perf        786MiB |
|    3   N/A  N/A   3203010      C   ...ests-2.17.8/./build/all_reduce_perf        784MiB |
+-----------------------------------------------------------------------------------------+

Or rather, the 3 processes that should use GPUs 1,2,3 also in addition have a process running on GPU 0.

Is it possible to run nccl-tests with Exclusive process? Or do I have to consider changing that setting on the system? Also, do you know if this is this only nccl-tests limitation, or a general nccl issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions