Fine-tuning Llama 3 is one of the most common workloads on GPU cloud platforms today. Whether you’re adapting Llama 3 8B for a domain-specific use case or running full fine-tuning on the 70B variant, the setup decisions you make before training starts, like the GPU selection, parallelism strategy, and quantization approach, have a larger impact on your total cost and iteration speed than almost anything else.
This guide covers everything you need to fine-tune Llama 3 on Saturn Cloud, including which GPU to use for which job, how to choose between LoRA, QLoRA, and FSDP, and how to get your first run off the ground quickly.
Which Llama 3 model are you fine-tuning?
The Llama 3 family (including the 3.1 architecture update) comes in three primary sizes: 8B, 70B, and 405B. The optimal GPU and training strategy depend almost entirely on which parameter tier you’re targeting or which one you’re working with.
| Model | Parameters | Recommended GPU | Recommended Approach |
|---|---|---|---|
| Llama 3 8B | 8 billion | H100 (single GPU) | QLoRA (4-bit) |
| Llama 3 70B | 70 billion | H100 or H200 (multi-GPU) | QLoRA or LoRA; FSDP for full fine-tune |
| Llama 3 405B | 405 billion | H200 multi-node (8+ GPUs) | FSDP with FP8 or BF16 |
H100 or H200: which GPU should you use?
The H200 is not strictly better than the H100 for fine-tuning. Which GPU makes sense depends on your model size and whether memory is your actual bottleneck.
Use an H100 when:
- You’re fine-tuning Llama 3 8B or 70B with QLoRA or LoRA
- Your model fits within 80 GB of VRAM at your target precision
- You want the best cost-per-run on standard fine-tuning workloads
- You’re running frequent short experiments, and iteration speed matters more than peak throughput
With QLoRA, a 70B model is quantized to 4 bits, reducing VRAM requirements to roughly 35–40 GB – well within an H100’s 80 GB. In this case, the H200’s extra memory is unused headroom. The H100 completes the same job at a lower hourly rate.
Use an H200 when:
- You’re fine-tuning Llama 3 70B at full precision (FP16 or BF16)
- You’re working with Llama 3 405B in any configuration
- You need to reduce the GPU count on multi-node jobs (fewer nodes = less communication overhead)
- Your workload involves long context windows, where the KV cache memory becomes a constraint
The H200 carries 141 GB of HBM3e memory, which is 76% more than the H100, and 4.8 TB/s of memory bandwidth. For workloads that require 12 GPUs, the H100 handles the same job on 8 GPUs. Fewer GPUs means simpler orchestration and less inter-GPU communication latency. Both H100 and H200 are available on Saturn Cloud.
Saturn Cloud tip: Both GPUs use the same Hopper architecture, so your existing PyTorch and HuggingFace code runs on either without changes.
LoRA, QLoRA, or FSDP: which training approach to use
QLoRA (4-bit quantization + LoRA)
QLoRA is the right default for most Llama 3 fine-tuning jobs. It quantizes the base model to 4-bit, then trains only the low-rank adapter weights. This dramatically reduces VRAM requirements without a meaningful accuracy penalty on most tasks.
- Best for: Llama 3 8B and 70B on a single H100 or H200
- VRAM: ~8–10 GB for 8B; ~35–40 GB for 70B at 4-bit
- Libraries: Unsloth, bitsandbytes + PEFT, or HuggingFace TRL
- Tradeoff: Slightly slower than full fine-tuning but usable on a single GPU
This aggressive reduction in memory footprint makes QLoRA the most cost-effective path for rapid, single-GPU deployments.
LoRA (full precision + low-rank adapters)
LoRA trains in FP16 or BF16 without base model quantization. It requires more VRAM than QLoRA but produces cleaner gradients and is better suited for tasks where output quality is critical.
- Best for: Llama 3 8B on H100; Llama 3 70B on H200 (or multi-H100)
- VRAM: ~24 - 32 GB for 8B at FP16; ~140 GB for 70B at FP16 – requires H200 or multi-GPU
- Libraries: HuggingFace PEFT, axolotl
Deploying standard LoRA strikes a balance between maintaining raw model precision and avoiding the multi-node overhead of full fine-tuning.
FSDP (Fully Sharded Data Parallel)
FSDP shards model weights, gradients, and optimizer states across GPUs. It’s the standard approach for full fine-tuning of 70B and 405B models across multiple nodes.
- Best for: Llama 3 70B full fine-tuning (4+ GPUs); Llama 3 405B (8+ H200s)
- Requires: Multi-GPU setup; NVLink for efficient intra-node communication, and InfiniBand/RoCE for inter-node scaling
- Libraries: PyTorch FSDP (native), often combined with Flash Attention 2
- Tradeoff: More infrastructure setup than QLoRA, but necessary for full-precision large model training
Implementing this sharding architecture ensures maximum memory efficiency without bottlenecking cross-node communication in your cluster.
Fine-tuning Llama 3 8B on Saturn Cloud: step by step
The following walkthrough uses QLoRA with Unsloth on a single H100 on Saturn Cloud. This setup works well for domain adaptation, instruction tuning, and classification tasks on Llama 3 8B and 3.1 8B.
1. Spin up a GPU resource
From the Saturn Cloud dashboard, go to Resources → New Python Server. Select your H100 instance, set disk space, and add your pip dependencies. For QLoRA with Unsloth:
unsloth[colab-new]
trl
peft
accelerate
bitsandbytes
datasets
2. Load the model in 4-bit
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
max_seq_length = 2048,
load_in_4bit = True,
)
3. Attach LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = "unsloth",
)
4. Run training with TRL SFTTrainer
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = 2048,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
num_train_epochs = 3,
learning_rate = 2e-4,
fp16 = True,
output_dir = "/outputs",
),
)
trainer.train()
Saturn Cloud mounts /outputs automatically. Checkpoints saved here persist after the job ends.
Fine-tuning Llama 3 70B: multi-GPU setup with FSDP
For full fine-tuning of Llama 3 70B, you’ll need at least 4 H100s or 2 H200s. Saturn Cloud supports multi-node GPU clusters – you can provision and configure these directly from the resource dashboard.
FSDP config for 70B
fsdp: "full_shard auto_wrap"
fsdp_config:
fsdp_offload_params: false
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
Enable Flash Attention 2 for significant throughput gains on H100/H200. Add attn_implementation=“flash_attention_2” to your model config. This is supported natively in HuggingFace Transformers for Llama architectures.
Cost estimates
These are approximate figures based on Saturn Cloud’s published pricing at $2.95/hr for H100 and H200 instances. Actual training time varies by dataset size, batch size, and number of epochs.
| Job | GPU Config | Approx. Time | Est. Cost | Approach |
|---|---|---|---|---|
| Llama 3 8B fine-tune (10K samples) | 1x H100 | 1–2 hours | $3–6 | QLoRA |
| Llama 3 70B fine-tune (10K samples) | 4x H100 | 4–8 hours | $47–94 | QLoRA |
| Llama 3 70B full fine-tune (10K samples) | 4x H100 or 2x H200 | 8–14 hours | $94–164 | FSDP |
Estimates assume standard training configuration. Run a short profiling job first to calibrate your actual throughput.
Running on Saturn Cloud
Saturn Cloud provides instant access to H100, H200, and B200 GPU instances without reservation queues. You can provision a single-GPU fine-tuning environment or a multi-node FSDP cluster from the same dashboard. There’s no infrastructure setup; Python environments, dependencies, and storage are managed for you.
- No reservation queues – GPU instances spin up on demand
- Persistent storage mounted automatically at /outputs
- Multi-node cluster support for FSDP training across H100 or H200 nodes
- Run on AWS, GCP, Azure, Nebius, or Crusoe – your account, your cloud
- Jupyter and VS Code environments are available on every resource
Stripping away this infrastructure layer allows your engineering team to focus strictly on dataset curation and model hyperparameter optimization.
For most Llama 3 fine-tuning jobs, the practical answer is: start with an H100 and QLoRA. It’s fast to set up, cost-effective, and produces strong results on 8B and 70B models. Move to H200 or multi-node FSDP when memory becomes the constraint, typically when you’re running full-precision training on 70B or above.
Saturn Cloud now offers H100, H200, B200, and B300 instances. Start a fine-tuning job on Saturn Cloud →



