Multi-node training
Distributed training that runs correctly on first launch
Multi-node training is the workload teams buy GPUs to run and the one most clusters get wrong. The job hangs at NCCL init, falls back to TCP at a fraction of the expected speed, or never schedules across nodes at all. We fix the whole path: fabric, operators, NCCL topology, and launcher, and verify with benchmarks before we call it done.
What we deliver
The full path from fabric to launcher
Distributed training fails at the seams between layers: the RDMA fabric not exposed to pods, NCCL falling back to the wrong interface, the launcher missing rendezvous variables, or the pod gang never scheduling together. We own all of it end to end.
RDMA fabric into the pods
InfiniBand or RoCE exposed to containers via the RDMA shared device plugin or SR-IOV, MOFED on the nodes, and the NVIDIA network operator managing the stack. GPUDirect RDMA so traffic goes NIC-to-GPU without passing through host memory.
NCCL on the fast path
NCCL configured with correct topology, HCA selection, and GPUDirect settings. Verified with nccl-tests all-reduce at the bandwidth the hardware should deliver. A training job that quietly fell back to TCP is not a working training job.
Gang scheduling
A distributed job needs all its pods running simultaneously. We configure gang scheduling (Kueue, Volcano, or the Kubeflow training operators) so a multi-node job gets its full allocation or waits, rather than scheduling half a gang and deadlocking.
Launcher environment variables
We wire MASTER_ADDR, WORLD_SIZE, RANK, LOCAL_RANK, and the relevant NCCL variables into the scheduler so torchrun, DeepSpeed, and Megatron receive a correct environment on every node, without a hand-rolled launch script.
Checkpointing and restart
Long training runs encounter preemptions and node failures. We set up checkpoint storage and a restart path so a run does not start over from scratch because one node went down.
A reproducible launch path
The end state: a researcher describes a job (image, nodes, GPUs per node, command) and gets a correctly configured distributed run every time, without needing to understand the fabric or the launcher internals.
Where multi-node training breaks
The failure modes we fix
The silent TCP fallback
The job runs, loss decreases, and it is running at a tenth of the hardware's capability because NCCL fell back to TCP over the management network. This failure looks like success and is caught by benchmarks, not by watching training curves.
NCCL init hangs
Pods come up, the job sits indefinitely at NCCL initialization. Usually the RDMA interface is not visible inside the pod, the topology configuration is wrong, or a network policy is blocking the traffic. Each cause has a different fix.
Half-scheduled gang deadlock
Half the pods schedule and hold GPUs. The other half wait for GPUs the first half is holding. Gang scheduling makes allocation all-or-nothing so the cluster does not deadlock itself.
Brittle launch scripts
Every team that hand-rolls this ends up with bash that hardcodes node addresses and breaks when the scheduler places pods differently. Replacing it with scheduler-populated env vars makes the launch path reproducible.
How we define done
Acceptance criteria, agreed before we start
| Check | How we verify it |
|---|---|
| RDMA is in use | nccl-tests all-reduce bandwidth at or near line rate for your fabric. If the number is the TCP number, the job is not done. |
| Scaling is close to linear | A training job timed on 1, 2, 4, and 8 nodes. Throughput scaling tracked and any efficiency drop explained. |
| Gang scheduling works under contention | A multi-node job gets its full allocation or stays pending, demonstrated with competing jobs in the queue. |
| A user can launch without help | Someone who has not touched the fabric submits a distributed job using the documented path and it runs correctly. |
| Restart from checkpoint works | A run is killed mid-training and resumes from checkpoint without manual intervention. |