ALL_Reduce_Perf with NVLS in one node(8GPUs)

I’d like to ask a question. For NVIDIA servers equipped with SHARP（for example, B300）, when running single-node Allreduce operations with the NVLS algorithm, should the factor between algorithm bandwidth and bus bandwidth be 1 instead of 2(n-1)/n? However, in this case, it seems that the algorithm bandwidth still falls short of the NVLink bandwidth.