Validating Multi-Node GPU Clusters with NCCL Tests

How to run NCCL all_reduce benchmarks to verify your GPU cluster’s interconnect performance before running production training.

By Hugo Shi | Monday, December 15, 2025 | AI/ML DevOps | Updated: Monday, December 22, 2025

You’ve provisioned a multi-node GPU cluster with InfiniBand. Before you spend GPU-hours on training, run a 2-minute NCCL benchmark to confirm your interconnect is actually performing at spec.

This is the difference between “is it working?” (ibstat shows Active) and “is it working well?” (measured bandwidth matches hardware capability). Discovering a misconfigured cluster 4 hours into a training run is expensive.

Comparing GPU cloud providers?

Download our GPU Cloud Comparison Report analyzing 17 providers across pricing, InfiniBand networking, storage, and enterprise readiness. Includes detailed Crusoe profile with infrastructure specs and use case recommendations.

Prerequisites: InfiniBand Cluster

This guide assumes you already have a multi-node GPU cluster with InfiniBand connectivity. If you’re still setting that up:

A Field Guide to Crusoe InfiniBand with Terraform covers capacity checking, partition configuration, and troubleshooting
Multi-Node GPU Training Infrastructure on Crusoe walks through a complete Terraform configuration

Before running NCCL tests, verify the basics:

# Check InfiniBand interface is active
ibstat | grep -A5 "Port 1"

You should see State: Active and Rate: 400 Gb/sec (for NDR). If not, see the troubleshooting section in the InfiniBand field guide.

Expected Performance

NCCL’s all_reduce_perf test reports “bus bandwidth,” which represents the effective throughput accounting for the all-reduce algorithm. Here’s what to expect for large message sizes (8GB+):

GPU Type	Interconnect	Expected Bus Bandwidth
H100 SXM	NVLink + IB NDR	400-450 GB/s
H200 SXM	NVLink + IB NDR	400-450 GB/s
A100 SXM	NVLink + IB HDR	250-300 GB/s
L40S	IB NDR (no NVLink)	40-50 GB/s

If your measured bandwidth is significantly below these numbers, something is misconfigured.

Setting Up Cross-Node SSH

NCCL tests use MPI to launch processes across nodes. MPI requires passwordless SSH between all nodes in the cluster.

On each node, ensure the SSH key is distributed:

# Generate key if needed (on your local machine)
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N ""

# Copy to each node
ssh-copy-id -i ~/.ssh/id_ed25519.pub ubuntu@<node-ip>

Then copy the private key to each node so they can SSH to each other:

scp ~/.ssh/id_ed25519 ubuntu@<node-ip>:~/.ssh/

Add an SSH config on each node to skip host key checking (these are ephemeral training nodes):

cat >> ~/.ssh/config << 'EOF'
Host *
    IdentityFile ~/.ssh/id_ed25519
    IdentitiesOnly yes
    StrictHostKeyChecking no
    UserKnownHostsFile /dev/null
EOF
chmod 600 ~/.ssh/config

Creating the Hostfile

NCCL tests need a hostfile listing the private IPs of all nodes (the IPs on the InfiniBand network). Create this file on one of the nodes:

# hostfile - one private IP per line
10.0.0.2
10.0.0.3
10.0.0.4
10.0.0.5

If you provisioned with Terraform, you can extract these from the output:

terraform output -json training_nodes | jq -r '.[].private_ip' > hostfile

Or query Crusoe directly:

crusoe compute vms list --project-id <project-id> -f json | \
  jq -r '.[] | select(.type | contains("h100")) | .network_interfaces[0].ips[0].private_ipv4.address'

Running the NCCL Test

Crusoe’s GPU images include NCCL tests pre-built at /opt/nccl-tests/build/. SSH to any node and run:

#!/bin/bash
HOSTFILE='./hostfile'
GPUS_PER_NODE=8
NP=$(( $(grep -c -v '^$' "$HOSTFILE") * GPUS_PER_NODE ))

echo "Nodes: $(grep -c -v '^$' "$HOSTFILE")"
echo "GPUs per node: $GPUS_PER_NODE"
echo "Total processes: $NP"

mpirun \
    -x LD_LIBRARY_PATH \
    --mca plm_rsh_agent "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null" \
    --mca coll ^hcoll \
    -np $NP \
    -N $GPUS_PER_NODE \
    --hostfile $HOSTFILE \
    /opt/nccl-tests/build/all_reduce_perf -b 2G -e 32G -f 2 -g 1

The key parameters:

Parameter	Meaning
`-b 2G`	Start with 2GB messages
`-e 32G`	End with 32GB messages
`-f 2`	Double message size each iteration
`-g 1`	One GPU per process (8 processes per node)

The test takes about 60-90 seconds for a 4-node cluster.

If you'd rather not manage cluster validation and distributed training infrastructure yourself, Saturn Cloud handles multi-node coordination, environment setup, and performance monitoring out of the box. Chat with our team to learn more.

Reading the Output

#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
  2147483648     536870912     float     sum      -1   5765.2  372.56  698.55      0   5762.8  372.72  698.85      0
  4294967296    1073741824     float     sum      -1  11convivial443.3  375.30  703.69      0  11438.2  375.47  703.99      0
  8589934592    2147483648     float     sum      -1  22815.1  376.53  706.00      0  22808.4  376.64  706.20      0
 17179869184    4294967296     float     sum      -1  45553.2  377.13  707.12      0  45542.1  377.23  707.31      0
 34359738368    8589934592     float     sum      -1  91027.4  377.47  707.76      0  91012.3  377.53  707.87      0

The columns that matter:

Column	What it means
`size`	Message size in bytes
`time`	Time to complete the all-reduce (microseconds)
`algbw`	Algorithm bandwidth: size / time
`busbw`	Bus bandwidth: accounts for data movement in the all-reduce algorithm
`#wrong`	Verification errors (should always be 0)

Focus on busbw for the largest message sizes. This represents the effective interconnect throughput. For H100/H200 clusters with NVLink and InfiniBand NDR, you should see 400-450 GB/s.

What Bad Looks Like

Low Bandwidth (50-100 GB/s instead of 400+ GB/s)

NCCL is probably falling back to Ethernet instead of InfiniBand. Check:

# Run with debug output
NCCL_DEBUG=INFO mpirun ... /opt/nccl-tests/build/all_reduce_perf ...

Look for these lines:

# Good - using InfiniBand
NCCL INFO Using network IB

# Bad - falling back to sockets
NCCL INFO NET/Socket : Using [0]eth0:10.0.0.5<0>

Common causes:

Nodes are in different InfiniBand partitions
Missing host_channel_adapters in Terraform config
Instance type doesn’t end in -ib
NCCL_IB_DISABLE=1 is set in environment

Inconsistent Bandwidth Across Runs

Could indicate network congestion or a hardware issue. Run the test multiple times:

for i in {1..5}; do
  echo "Run $i"
  mpirun ... /opt/nccl-tests/build/all_reduce_perf -b 8G -e 8G -g 1
done

If results vary by more than 10%, investigate the physical network or check if other workloads are sharing the InfiniBand partition.

Verification Errors in Output

Data corruption during transfer. This is rare but serious. Check:

GPU memory errors: nvidia-smi -q -d ECC
InfiniBand link errors: perfquery

Integrating Into Your Workflow

Run NCCL tests:

After provisioning - Validate the cluster before handing it to users
Before long training runs - 2 minutes of testing can save hours of debugging
After any infrastructure changes - New nodes, partition changes, driver updates

For automated validation, add a check to your provisioning pipeline:

# Fail if bus bandwidth is below threshold
RESULT=$(mpirun ... /opt/nccl-tests/build/all_reduce_perf -b 8G -e 8G -g 1 | tail -1 | awk '{print $12}')
if (( $(echo "$RESULT < 350" | bc -l) )); then
  echo "FAIL: Bus bandwidth $RESULT GB/s is below 350 GB/s threshold"
  exit 1
fi

Resources

A Field Guide to Crusoe InfiniBand - provisioning and troubleshooting
Multi-Node GPU Training Infrastructure on Crusoe - complete Terraform setup
NCCL Tests GitHub - source and additional test options
NCCL Documentation - environment variables and tuning