Validating Multi-Node GPU Clusters with NCCL Tests

You’ve provisioned a multi-node GPU cluster with InfiniBand. Before you spend GPU-hours on training, run a 2-minute NCCL benchmark to confirm your interconnect is actually performing at spec.
This is the difference between “is it working?” (ibstat shows Active) and “is it working well?” (measured bandwidth matches hardware capability). Discovering a misconfigured cluster 4 hours into a training run is expensive.
Comparing GPU cloud providers?
Download our GPU Cloud Comparison Report analyzing 17 providers across pricing, InfiniBand networking, storage, and enterprise readiness. Includes detailed Crusoe profile with infrastructure specs and use case recommendations.
Prerequisites: InfiniBand Cluster
This guide assumes you already have a multi-node GPU cluster with InfiniBand connectivity. If you’re still setting that up:
- A Field Guide to Crusoe InfiniBand with Terraform covers capacity checking, partition configuration, and troubleshooting
- Multi-Node GPU Training Infrastructure on Crusoe walks through a complete Terraform configuration
Before running NCCL tests, verify the basics:
# Check InfiniBand interface is active
ibstat | grep -A5 "Port 1"
You should see State: Active and Rate: 400 Gb/sec (for NDR). If not, see the troubleshooting section in the InfiniBand field guide.
Expected Performance
NCCL’s all_reduce_perf test reports “bus bandwidth,” which represents the effective throughput accounting for the all-reduce algorithm. Here’s what to expect for large message sizes (8GB+):
| GPU Type | Interconnect | Expected Bus Bandwidth |
|---|---|---|
| H100 SXM | NVLink + IB NDR | 400-450 GB/s |
| H200 SXM | NVLink + IB NDR | 400-450 GB/s |
| A100 SXM | NVLink + IB HDR | 250-300 GB/s |
| L40S | IB NDR (no NVLink) | 40-50 GB/s |
If your measured bandwidth is significantly below these numbers, something is misconfigured.
Setting Up Cross-Node SSH
NCCL tests use MPI to launch processes across nodes. MPI requires passwordless SSH between all nodes in the cluster.
On each node, ensure the SSH key is distributed:
# Generate key if needed (on your local machine)
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N ""
# Copy to each node
ssh-copy-id -i ~/.ssh/id_ed25519.pub ubuntu@<node-ip>
Then copy the private key to each node so they can SSH to each other:
scp ~/.ssh/id_ed25519 ubuntu@<node-ip>:~/.ssh/
Add an SSH config on each node to skip host key checking (these are ephemeral training nodes):
cat >> ~/.ssh/config << 'EOF'
Host *
IdentityFile ~/.ssh/id_ed25519
IdentitiesOnly yes
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
EOF
chmod 600 ~/.ssh/config
Creating the Hostfile
NCCL tests need a hostfile listing the private IPs of all nodes (the IPs on the InfiniBand network). Create this file on one of the nodes:
# hostfile - one private IP per line
10.0.0.2
10.0.0.3
10.0.0.4
10.0.0.5
If you provisioned with Terraform, you can extract these from the output:
terraform output -json training_nodes | jq -r '.[].private_ip' > hostfile
Or query Crusoe directly:
crusoe compute vms list --project-id <project-id> -f json | \
jq -r '.[] | select(.type | contains("h100")) | .network_interfaces[0].ips[0].private_ipv4.address'
Running the NCCL Test
Crusoe’s GPU images include NCCL tests pre-built at /opt/nccl-tests/build/. SSH to any node and run:
#!/bin/bash
HOSTFILE='./hostfile'
GPUS_PER_NODE=8
NP=$(( $(grep -c -v '^$' "$HOSTFILE") * GPUS_PER_NODE ))
echo "Nodes: $(grep -c -v '^$' "$HOSTFILE")"
echo "GPUs per node: $GPUS_PER_NODE"
echo "Total processes: $NP"
mpirun \
-x LD_LIBRARY_PATH \
--mca plm_rsh_agent "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null" \
--mca coll ^hcoll \
-np $NP \
-N $GPUS_PER_NODE \
--hostfile $HOSTFILE \
/opt/nccl-tests/build/all_reduce_perf -b 2G -e 32G -f 2 -g 1
The key parameters:
| Parameter | Meaning |
|---|---|
-b 2G | Start with 2GB messages |
-e 32G | End with 32GB messages |
-f 2 | Double message size each iteration |
-g 1 | One GPU per process (8 processes per node) |
The test takes about 60-90 seconds for a 4-node cluster.
If you'd rather not manage cluster validation and distributed training infrastructure yourself, Saturn Cloud handles multi-node coordination, environment setup, and performance monitoring out of the box. Chat with our team to learn more.
Reading the Output
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
2147483648 536870912 float sum -1 5765.2 372.56 698.55 0 5762.8 372.72 698.85 0
4294967296 1073741824 float sum -1 11convivial443.3 375.30 703.69 0 11438.2 375.47 703.99 0
8589934592 2147483648 float sum -1 22815.1 376.53 706.00 0 22808.4 376.64 706.20 0
17179869184 4294967296 float sum -1 45553.2 377.13 707.12 0 45542.1 377.23 707.31 0
34359738368 8589934592 float sum -1 91027.4 377.47 707.76 0 91012.3 377.53 707.87 0
The columns that matter:
| Column | What it means |
|---|---|
size | Message size in bytes |
time | Time to complete the all-reduce (microseconds) |
algbw | Algorithm bandwidth: size / time |
busbw | Bus bandwidth: accounts for data movement in the all-reduce algorithm |
#wrong | Verification errors (should always be 0) |
Focus on busbw for the largest message sizes. This represents the effective interconnect throughput. For H100/H200 clusters with NVLink and InfiniBand NDR, you should see 400-450 GB/s.
What Bad Looks Like
Low Bandwidth (50-100 GB/s instead of 400+ GB/s)
NCCL is probably falling back to Ethernet instead of InfiniBand. Check:
# Run with debug output
NCCL_DEBUG=INFO mpirun ... /opt/nccl-tests/build/all_reduce_perf ...
Look for these lines:
# Good - using InfiniBand
NCCL INFO Using network IB
# Bad - falling back to sockets
NCCL INFO NET/Socket : Using [0]eth0:10.0.0.5<0>
Common causes:
- Nodes are in different InfiniBand partitions
- Missing
host_channel_adaptersin Terraform config - Instance type doesn’t end in
-ib NCCL_IB_DISABLE=1is set in environment
Inconsistent Bandwidth Across Runs
Could indicate network congestion or a hardware issue. Run the test multiple times:
for i in {1..5}; do
echo "Run $i"
mpirun ... /opt/nccl-tests/build/all_reduce_perf -b 8G -e 8G -g 1
done
If results vary by more than 10%, investigate the physical network or check if other workloads are sharing the InfiniBand partition.
Verification Errors in Output
Data corruption during transfer. This is rare but serious. Check:
- GPU memory errors:
nvidia-smi -q -d ECC - InfiniBand link errors:
perfquery
Integrating Into Your Workflow
Run NCCL tests:
- After provisioning - Validate the cluster before handing it to users
- Before long training runs - 2 minutes of testing can save hours of debugging
- After any infrastructure changes - New nodes, partition changes, driver updates
For automated validation, add a check to your provisioning pipeline:
# Fail if bus bandwidth is below threshold
RESULT=$(mpirun ... /opt/nccl-tests/build/all_reduce_perf -b 8G -e 8G -g 1 | tail -1 | awk '{print $12}')
if (( $(echo "$RESULT < 350" | bc -l) )); then
echo "FAIL: Bus bandwidth $RESULT GB/s is below 350 GB/s threshold"
exit 1
fi
If you'd rather not manage cluster validation and distributed training infrastructure yourself, Saturn Cloud handles multi-node coordination, environment setup, and performance monitoring out of the box. Chat with our team to learn more.
Resources
- A Field Guide to Crusoe InfiniBand - provisioning and troubleshooting
- Multi-Node GPU Training Infrastructure on Crusoe - complete Terraform setup
- NCCL Tests GitHub - source and additional test options
- NCCL Documentation - environment variables and tuning
Saturn Cloud provides customizable, ready-to-use cloud environments
for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without having to switch tools.