Choosing the wrong distributed training strategy is one of the most expensive mistakes you can make when training large language models. Pick DDP when you need FSDP, and youâll hit GPU memory walls before your job completes. Use DeepSpeed when PyTorchâs native FSDP would have been simpler, and youâll spend days debugging config files instead of training models.
This guide covers what DDP, FSDP, and DeepSpeed actually do, when each one makes sense, and how to set them up for LLM training on multi-GPU clusters.
The short answer: which one should you use?
| Your situation | Use this | Why |
|---|---|---|
| Fine-tuning a model that fits on 1â2 GPUs | DDP | Simplest setup, lowest overhead for small jobs |
| Full fine-tuning of 7Bâ70B models across multiple GPUs | FSDP | Native PyTorch, good memory efficiency, no extra dependencies |
| Training 70B+ models where memory is the hard constraint | DeepSpeed ZeRO-3 | Offloads optimizer state + gradients to CPU; most aggressive memory reduction |
| QLoRA or LoRA fine-tuning of any size model | DDP or FSDP | Adapters are small; DeepSpeed overhead rarely worth it for PEFT |
| Pre-training from scratch at scale (100B+ parameters) | DeepSpeed ZeRO-3 or Megatron-LM | Extreme memory pressure requires aggressive sharding + offloading |
DDP (Distributed Data Parallel): the baseline
DDP is PyTorchâs standard approach to distributed training. It replicates the full model across all GPUs, splits the data across them, computes gradients independently on each GPU, and then synchronizes the gradients via AllReduce before the optimizer step.
How DDP works
Each GPU holds a complete copy of the model weights. Forward and backward passes run in parallel across GPUs on different batches. After the backward pass, gradients are averaged across all GPUs using ring-AllReduce. Every GPU ends up with identical gradients and applies the same weight update.
When DDP works well
Consider using DDP as your primary distributed backend when your workload meets the following criteria:
- Your model fits entirely in the VRAM of a single GPU
- Youâre fine-tuning models up to ~7B parameters with standard precision
- Youâre running LoRA or QLoRA (adapter weights are small â no memory pressure)
- You want the simplest setup with minimal configuration overhead
Deploying DDP provides the lowest-friction distributed environment for workloads that do not strictly require memory sharding.
Where DDP breaks down
DDPâs fundamental constraint is that every GPU must hold the full model in memory. For a 70B model in FP16, thatâs roughly 140 GB per GPU, which exceeds the capacity of a single H100 (80 GB) or H200 (141 GB). DDP canât help you here. You need a strategy that shards the model, not just the data.
DDP setup (PyTorch)
torchrun --nproc_per_node=4 train.py \
--model_name meta-llama/Llama-3-8B \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 2
Saturn Cloud tip: Use torchrun for DDP on multi-GPU Saturn Cloud resources. Set --nproc_per_node to match the number of GPUs on your instance.
FSDP (Fully Sharded Data Parallel): the PyTorch-native choice for large models
FSDP shards everything from model weights, gradients, to optimizer states across all GPUs in your training job. Each GPU holds only a fraction of the total model at any one time. When a layer is needed for computation, FSDP collects the shards via AllGather, runs the forward or backward pass, then discards the gathered weights to free memory. This is called the âshard, gather, discardâ cycle.
The result: you can train models that would never fit on a single GPU, using only the memory budget divided by the number of GPUs, minus communication overhead.
FSDP sharding strategies
| Strategy | What it shards | Best for |
|---|---|---|
| FULL_SHARD | Weights + gradients + optimizer states (maximum savings) | 70B+ models, memory-constrained setups |
| SHARD_GRAD_OP | Gradients + optimizer states only (weights replicated) | Mid-size models where weights fit, but optimizer states donât |
| NO_SHARD | Nothing â equivalent to DDP | Testing or baseline comparison |
| HYBRID_SHARD | Full sharding within nodes; replication across nodes | Multi-node jobs where inter-node bandwidth is the bottleneck |
When FSDP is the right choice
Native PyTorch FSDP is the optimal sharding strategy when your training architecture aligns with the following conditions:
- Full fine-tuning of 7Bâ70B models across 4â8 GPUs
- You want native PyTorch without additional library dependencies
- Your cluster has fast NVLink interconnects (H100 SXM, H200 SXM)
- Youâre already using HuggingFace Accelerate, or TRL â both have first-class FSDP support
- You want FSDP + Flash Attention 2 for memory-efficient attention on long sequences
Utilizing native FSDP maximizes multi-GPU memory efficiency without introducing the brittleness of third-party orchestration dependencies.
FSDP config for Llama 3 70B
# accelerate_config.yaml
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
fsdp_sharding_strategy: FULL_SHARD
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
num_processes: 8
mixed_precision: bf16
FSDP memory requirements at a glance
| Model | Precision | Total VRAM | GPUs needed (H100) | GPUs needed (H200) |
|---|---|---|---|---|
| Llama 3 8B | BF16 | ~16 GB | 1 | 1 |
| Llama 3 70B | BF16 | ~140 GB | 2â4 | 1â2 |
| Llama 3 405B | BF16 | ~810 GB | 12+ | 8 |
VRAM estimates are for model weights only. Training requires additional memory for gradients, optimizer states, and activations. Use gradient checkpointing to reduce memory usage for activations.
Saturn Cloud supports multi-node H100 and H200 clusters with NVLink. FSDPâs AllGather operations benefit directly from NVLinkâs 900 GB/s bandwidth â significantly reducing the communication overhead that otherwise limits FSDP scaling efficiency.
DeepSpeed: maximum memory efficiency at the cost of complexity
DeepSpeed is a training library from Microsoft that implements the ZeRO (Zero Redundancy Optimizer) family of memory optimizations. Where FSDP shards weights, gradients, and optimizer states across GPUs, DeepSpeed ZeRO goes further by optionally offloading them to CPU or NVMe storage, allowing you to train models that wonât fit in GPU memory at all.
The three ZeRO stages
| Stage | What it shards | Memory savings vs baseline |
|---|---|---|
| ZeRO-1 | Optimizer states only | ~4x reduction in optimizer memory |
| ZeRO-2 | Optimizer states + gradients | ~8x reduction vs baseline DDP |
| ZeRO-3 | Optimizer states + gradients + model weights | Linear reduction with GPU count; near-unlimited with CPU offload |
ZeRO offloading: training beyond GPU memory limits
DeepSpeedâs most powerful feature is CPU and NVMe offloading. With ZeRO-3 + CPU offload enabled, optimizer states and gradients are kept in CPU RAM during the optimizer step and streamed back to the GPU as needed. This lets you train models substantially larger than your total GPU VRAM â at the cost of slower training due to PCIe transfer latency.
When DeepSpeed is the right choice
DeepSpeed ZeRO-3 should be integrated into your training pipeline only when facing the following extreme memory constraints:
- Youâre training 70B+ models, and FSDP still runs out of memory
- You need CPU offloading to handle optimizer states that wonât fit on the GPU
- Youâre pre-training from scratch, where communication overhead is tolerable
- Your team already has a DeepSpeed configuration from a prior project
Implementing DeepSpeed ZeRO-3 becomes a strict architectural requirement when your parameter count exceeds the physical memory limits of your combined GPU cluster.
When DeepSpeed is not worth the overhead
The architectural configuration complexity of DeepSpeed is generally counterproductive in the following scenarios:
- Youâre fine-tuning with LoRA or QLoRA â FSDP or DDP handles this more simply
- Your model fits comfortably with FSDP FULL_SHARD, no need for CPU offload
- Youâre iterating quickly and donât want to maintain a separate ds_config.json
- Youâre using HuggingFace TRL â FSDP integration, which is more stable and well-documented
Avoiding DeepSpeed in these scenarios prevents unnecessary configuration complexity and maintains higher training throughput.
DeepSpeed ZeRO-3 config
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": { "device": "cpu", "pin_memory": true },
"offload_param": { "device": "cpu", "pin_memory": true },
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": 5e8,
"stage3_prefetch_bucket_size": 5e8,
"stage3_param_persistence_threshold": 1e6,
"sub_group_size": 1e9
},
"bf16": { "enabled": true },
"gradient_clipping": 1.0
}
CPU offload trades memory for speed. Expect 20â40% slower training throughput with ZeRO-3 + CPU offload compared to pure GPU FSDP. Use it when you have no other option, not as a default.
Direct comparison: FSDP vs DeepSpeed ZeRO-3
For the workload most teams are actually running â full fine-tuning of 70B models across 4â8 H100s or H200s â FSDP and DeepSpeed ZeRO-3 are the two realistic options. Hereâs how they differ in practice.
| FSDP | DeepSpeed ZeRO-3 | |
|---|---|---|
| Memory efficiency | High: full sharding of weights, gradients, optimizer states | Highest: adds CPU/NVMe offload option |
| Training throughput | Higher: no CPU transfer overhead | Lower with CPU offload (20â40% slower) |
| Setup complexity | Moderate: accelerate config YAML | Higher: ds_config.json + launcher integration |
| HuggingFace integration | First-class (Accelerate, TRL, Trainer) | Supported but more brittle; known compatibility issues with newer models |
| Multi-node scaling | Good; HYBRID_SHARD helps with inter-node bandwidth | Good: communication-compute overlap built in |
| Flash Attention 2 support | Native â no extra config | Supported but requires manual setup in some configs |
| Best for | 70B fine-tuning on 4â8 H100s / 2â4 H200s | 100B+ pre-training; when FSDP runs OOM, and CPU offload is acceptable |
How to decide: a practical decision tree
Step 1: Does the model fit on one GPU?
If yes: use DDP. If no, continue to step 2.
Step 2: Are you using LoRA or QLoRA?
If yes: use DDP or FSDP SHARD_GRAD_OP. Adapter weights are small, so you donât need full-weight sharding. If no (full fine-tuning), continue to step 3.
Step 3: Does FSDP FULL_SHARD fit within your GPU VRAM budget?
If yes: use FSDP. Itâs simpler than DeepSpeed and has better HuggingFace integration. If no, continue to step 4.
Step 4: Can you add more GPUs to make FSDP work?
If yes: scale your Saturn Cloud cluster and use FSDP FULL_SHARD + HYBRID_SHARD across nodes. If no (hard GPU budget constraint), use DeepSpeed ZeRO-3 with CPU offload.
Step 5: Are you pre-training from scratch at 100B+ scale?
Use DeepSpeed ZeRO-3 or consider Megatron-LM with tensor + pipeline parallelism for maximum scaling efficiency.
Running distributed training on Saturn Cloud
Saturn Cloud supports multi-node H100 and H200 GPU clusters with NVLink for DDP, FSDP, and DeepSpeed workloads. You provision the cluster from the dashboard, and no manual node configuration is required.
Recommended configurations by training strategy
| Strategy | Model size | Saturn Cloud setup | Est. cost/hr |
|---|---|---|---|
| DDP | Up to 7B | 1â2x H100 instance | $2.95â$5.90 |
| FSDP FULL_SHARD | 7Bâ70B | 4x or 8x H100 multi-GPU | $11.80â$23.60 |
| FSDP FULL_SHARD | 70B (fewer GPUs) | 2x H200 (141 GB each) | $5.90 |
| DeepSpeed ZeRO-3 | 70B+ full precision | 4â8x H200 multi-node | $11.80â$23.60 |
Based on Saturn Cloudâs published rate of $2.95/hr per H100 or H200 GPU. Actual throughput and total job cost vary by model architecture, dataset size, and sequence length.
Saturn Cloud streamlines this infrastructure orchestration by providing the following managed capabilities:
- H100 and H200 SXM instances include NVLink 4.0 (900 GB/s) â critical for efficient FSDP AllGather operations
- Multi-node clusters are provisioned directly from the Saturn Cloud dashboard
- Jupyter and VS Code environments are available on every resource for interactive debugging before scaling
- Run on AWS, GCP, Azure, Nebius, or Crusoe, no vendor lock-in
Consolidating these compute and orchestration features into a single managed control plane accelerates the transition from local prototyping to distributed training.
For most fine-tuning work on models ranging from 7B to 70B, FSDP is the default. Itâs native PyTorch, integrates cleanly with HuggingFace, and delivers the memory efficiency you need without the configuration overhead of DeepSpeed. DDP handles anything smaller. DeepSpeed ZeRO-3 becomes complex when youâre pushing the boundaries of what fits entirely in GPU memory.
The decision doesnât need to be permanent. Saturn Cloud lets you switch GPU configurations without reprovisioning your environment, run a QLoRA job on a single H100, then scale to an 8-GPU FSDP cluster when youâre ready for full fine-tuning. Get started on Saturn Cloud â



