Multi-node training

Distributed training that runs correctly on first launch

Multi-node training is the workload teams buy GPUs to run and the one most clusters get wrong. The job hangs at NCCL init, falls back to TCP at a fraction of the expected speed, or never schedules across nodes at all. We fix the whole path: fabric, operators, NCCL topology, and launcher, and verify with benchmarks before we call it done.

Talk to an engineer All services

What we deliver

The full path from fabric to launcher

Distributed training fails at the seams between layers: the RDMA fabric not exposed to pods, NCCL falling back to the wrong interface, the launcher missing rendezvous variables, or the pod gang never scheduling together. We own all of it end to end.

RDMA fabric into the pods

InfiniBand or RoCE exposed to containers via the RDMA shared device plugin or SR-IOV, MOFED on the nodes, and the NVIDIA network operator managing the stack. GPUDirect RDMA so traffic goes NIC-to-GPU without passing through host memory.

NCCL on the fast path

NCCL configured with correct topology, HCA selection, and GPUDirect settings. Verified with nccl-tests all-reduce at the bandwidth the hardware should deliver. A training job that quietly fell back to TCP is not a working training job.

Gang scheduling

A distributed job needs all its pods running simultaneously. We configure gang scheduling (Kueue, Volcano, or the Kubeflow training operators) so a multi-node job gets its full allocation or waits, rather than scheduling half a gang and deadlocking.

Launcher environment variables

We wire MASTER_ADDR, WORLD_SIZE, RANK, LOCAL_RANK, and the relevant NCCL variables into the scheduler so torchrun, DeepSpeed, and Megatron receive a correct environment on every node, without a hand-rolled launch script.

Checkpointing and restart

Long training runs encounter preemptions and node failures. We set up checkpoint storage and a restart path so a run does not start over from scratch because one node went down.

A reproducible launch path

The end state: a researcher describes a job (image, nodes, GPUs per node, command) and gets a correctly configured distributed run every time, without needing to understand the fabric or the launcher internals.

Where multi-node training breaks

The failure modes we fix

Launcher torchrun / DeepSpeed / Megatron · rendezvous env vars set by the scheduler

▼

Collective communications NCCL using RDMA interfaces · GPUDirect · correct topology and HCA selection

▼

Fabric InfiniBand or RoCE · MOFED · NVIDIA network operator · RDMA device plugin / SR-IOV

The silent TCP fallback

The job runs, loss decreases, and it is running at a tenth of the hardware's capability because NCCL fell back to TCP over the management network. This failure looks like success and is caught by benchmarks, not by watching training curves.

NCCL init hangs

Pods come up, the job sits indefinitely at NCCL initialization. Usually the RDMA interface is not visible inside the pod, the topology configuration is wrong, or a network policy is blocking the traffic. Each cause has a different fix.

Half-scheduled gang deadlock

Half the pods schedule and hold GPUs. The other half wait for GPUs the first half is holding. Gang scheduling makes allocation all-or-nothing so the cluster does not deadlock itself.

Brittle launch scripts

Every team that hand-rolls this ends up with bash that hardcodes node addresses and breaks when the scheduler places pods differently. Replacing it with scheduler-populated env vars makes the launch path reproducible.

How we define done

Acceptance criteria, agreed before we start

Check	How we verify it
RDMA is in use	`nccl-tests` all-reduce bandwidth at or near line rate for your fabric. If the number is the TCP number, the job is not done.
Scaling is close to linear	A training job timed on 1, 2, 4, and 8 nodes. Throughput scaling tracked and any efficiency drop explained.
Gang scheduling works under contention	A multi-node job gets its full allocation or stays pending, demonstrated with competing jobs in the queue.
A user can launch without help	Someone who has not touched the fabric submits a distributed job using the documented path and it runs correctly.
Restart from checkpoint works	A run is killed mid-training and resumes from checkpoint without manual intervention.