I’d like to ask a question. For NVIDIA servers equipped with SHARP(for example, B300), when running single-node Allreduce operations with the NVLS algorithm, should the factor between algorithm bandwidth and bus bandwidth be 1 instead of 2(n-1)/n? However, in this case, it seems that the algorithm bandwidth still falls short of the NVLink bandwidth.
I’d like to ask a question. For NVIDIA servers equipped with SHARP(for example, B300), when running single-node Allreduce operations with the NVLS algorithm, should the factor between algorithm bandwidth and bus bandwidth be 1 instead of 2(n-1)/n? However, in this case, it seems that the algorithm bandwidth still falls short of the NVLink bandwidth.