Validating Multi-Node GPU Clusters with NCCL Tests

How to run NCCL all_reduce benchmarks to verify your GPU cluster’s interconnect performance before running production training.

You’ve provisioned a multi-node GPU cluster with InfiniBand. Before you spend GPU-hours on training, run a 2-minute NCCL benchmark to confirm your interconnect is actually performing at spec.

This is the difference between “is it working?” (ibstat shows Active) and “is it working well?” (measured bandwidth matches hardware capability). Discovering a misconfigured cluster 4 hours into a training run is expensive.

Prerequisites: InfiniBand Cluster

This guide assumes you already have a multi-node GPU cluster with InfiniBand connectivity. If you’re still setting that up:

Before running NCCL tests, verify the basics:

# Check InfiniBand interface is active
ibstat | grep -A5 "Port 1"

You should see State: Active and Rate: 400 Gb/sec (for NDR). If not, see the troubleshooting section in the InfiniBand field guide.

Expected Performance

NCCL’s all_reduce_perf test reports “bus bandwidth,” which represents the effective throughput accounting for the all-reduce algorithm. Here’s what to expect for large message sizes (8GB+):

GPU TypeInterconnectExpected Bus Bandwidth
H100 SXMNVLink + IB NDR400-450 GB/s
H200 SXMNVLink + IB NDR400-450 GB/s
A100 SXMNVLink + IB HDR250-300 GB/s
L40SIB NDR (no NVLink)40-50 GB/s

If your measured bandwidth is significantly below these numbers, something is misconfigured.

Setting Up Cross-Node SSH

NCCL tests use MPI to launch processes across nodes. MPI requires passwordless SSH between all nodes in the cluster.

On each node, ensure the SSH key is distributed:

# Generate key if needed (on your local machine)
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N ""

# Copy to each node
ssh-copy-id -i ~/.ssh/id_ed25519.pub ubuntu@<node-ip>

Then copy the private key to each node so they can SSH to each other:

scp ~/.ssh/id_ed25519 ubuntu@<node-ip>:~/.ssh/

Add an SSH config on each node to skip host key checking (these are ephemeral training nodes):

cat >> ~/.ssh/config << 'EOF'
Host *
    IdentityFile ~/.ssh/id_ed25519
    IdentitiesOnly yes
    StrictHostKeyChecking no
    UserKnownHostsFile /dev/null
EOF
chmod 600 ~/.ssh/config

Creating the Hostfile

NCCL tests need a hostfile listing the private IPs of all nodes (the IPs on the InfiniBand network). Create this file on one of the nodes:

# hostfile - one private IP per line
10.0.0.2
10.0.0.3
10.0.0.4
10.0.0.5

If you provisioned with Terraform, you can extract these from the output:

terraform output -json training_nodes | jq -r '.[].private_ip' > hostfile

Or query Crusoe directly:

crusoe compute vms list --project-id <project-id> -f json | \
  jq -r '.[] | select(.type | contains("h100")) | .network_interfaces[0].ips[0].private_ipv4.address'

Running the NCCL Test

Crusoe’s GPU images include NCCL tests pre-built at /opt/nccl-tests/build/. SSH to any node and run:

#!/bin/bash
HOSTFILE='./hostfile'
GPUS_PER_NODE=8
NP=$(( $(grep -c -v '^$' "$HOSTFILE") * GPUS_PER_NODE ))

echo "Nodes: $(grep -c -v '^$' "$HOSTFILE")"
echo "GPUs per node: $GPUS_PER_NODE"
echo "Total processes: $NP"

mpirun \
    -x LD_LIBRARY_PATH \
    --mca plm_rsh_agent "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null" \
    --mca coll ^hcoll \
    -np $NP \
    -N $GPUS_PER_NODE \
    --hostfile $HOSTFILE \
    /opt/nccl-tests/build/all_reduce_perf -b 2G -e 32G -f 2 -g 1

The key parameters:

ParameterMeaning
-b 2GStart with 2GB messages
-e 32GEnd with 32GB messages
-f 2Double message size each iteration
-g 1One GPU per process (8 processes per node)

The test takes about 60-90 seconds for a 4-node cluster.

Reading the Output

#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
  2147483648     536870912     float     sum      -1   5765.2  372.56  698.55      0   5762.8  372.72  698.85      0
  4294967296    1073741824     float     sum      -1  11convivial443.3  375.30  703.69      0  11438.2  375.47  703.99      0
  8589934592    2147483648     float     sum      -1  22815.1  376.53  706.00      0  22808.4  376.64  706.20      0
 17179869184    4294967296     float     sum      -1  45553.2  377.13  707.12      0  45542.1  377.23  707.31      0
 34359738368    8589934592     float     sum      -1  91027.4  377.47  707.76      0  91012.3  377.53  707.87      0

The columns that matter:

ColumnWhat it means
sizeMessage size in bytes
timeTime to complete the all-reduce (microseconds)
algbwAlgorithm bandwidth: size / time
busbwBus bandwidth: accounts for data movement in the all-reduce algorithm
#wrongVerification errors (should always be 0)

Focus on busbw for the largest message sizes. This represents the effective interconnect throughput. For H100/H200 clusters with NVLink and InfiniBand NDR, you should see 400-450 GB/s.

What Bad Looks Like

Low Bandwidth (50-100 GB/s instead of 400+ GB/s)

NCCL is probably falling back to Ethernet instead of InfiniBand. Check:

# Run with debug output
NCCL_DEBUG=INFO mpirun ... /opt/nccl-tests/build/all_reduce_perf ...

Look for these lines:

# Good - using InfiniBand
NCCL INFO Using network IB

# Bad - falling back to sockets
NCCL INFO NET/Socket : Using [0]eth0:10.0.0.5<0>

Common causes:

  • Nodes are in different InfiniBand partitions
  • Missing host_channel_adapters in Terraform config
  • Instance type doesn’t end in -ib
  • NCCL_IB_DISABLE=1 is set in environment

Inconsistent Bandwidth Across Runs

Could indicate network congestion or a hardware issue. Run the test multiple times:

for i in {1..5}; do
  echo "Run $i"
  mpirun ... /opt/nccl-tests/build/all_reduce_perf -b 8G -e 8G -g 1
done

If results vary by more than 10%, investigate the physical network or check if other workloads are sharing the InfiniBand partition.

Verification Errors in Output

Data corruption during transfer. This is rare but serious. Check:

  • GPU memory errors: nvidia-smi -q -d ECC
  • InfiniBand link errors: perfquery

Integrating Into Your Workflow

Run NCCL tests:

  1. After provisioning - Validate the cluster before handing it to users
  2. Before long training runs - 2 minutes of testing can save hours of debugging
  3. After any infrastructure changes - New nodes, partition changes, driver updates

For automated validation, add a check to your provisioning pipeline:

# Fail if bus bandwidth is below threshold
RESULT=$(mpirun ... /opt/nccl-tests/build/all_reduce_perf -b 8G -e 8G -g 1 | tail -1 | awk '{print $12}')
if (( $(echo "$RESULT < 350" | bc -l) )); then
  echo "FAIL: Bus bandwidth $RESULT GB/s is below 350 GB/s threshold"
  exit 1
fi

Resources