How to Fine-Tune Llama 3 on GPU Clusters

Fine-tuning Llama 3 is one of the most common workloads on GPU cloud platforms today. Whether you’re adapting Llama 3 8B for a domain-specific use case or running full fine-tuning on the 70B variant, the setup decisions you make before training starts, like the GPU selection, parallelism strategy, and quantization approach, have a larger impact on your total cost and iteration speed than almost anything else.

This guide covers everything you need to fine-tune Llama 3 on Saturn Cloud, including which GPU to use for which job, how to choose between LoRA, QLoRA, and FSDP, and how to get your first run off the ground quickly.

Which Llama 3 model are you fine-tuning?

The Llama 3 family (including the 3.1 architecture update) comes in three primary sizes: 8B, 70B, and 405B. The optimal GPU and training strategy depend almost entirely on which parameter tier you’re targeting or which one you’re working with.

Model	Parameters	Recommended GPU	Recommended Approach
Llama 3 8B	8 billion	H100 (single GPU)	QLoRA (4-bit)
Llama 3 70B	70 billion	H100 or H200 (multi-GPU)	QLoRA or LoRA; FSDP for full fine-tune
Llama 3 405B	405 billion	H200 multi-node (8+ GPUs)	FSDP with FP8 or BF16

H100 or H200: which GPU should you use?

The H200 is not strictly better than the H100 for fine-tuning. Which GPU makes sense depends on your model size and whether memory is your actual bottleneck.

Use an H100 when:

You’re fine-tuning Llama 3 8B or 70B with QLoRA or LoRA
Your model fits within 80 GB of VRAM at your target precision
You want the best cost-per-run on standard fine-tuning workloads
You’re running frequent short experiments, and iteration speed matters more than peak throughput

With QLoRA, a 70B model is quantized to 4 bits, reducing VRAM requirements to roughly 35–40 GB – well within an H100’s 80 GB. In this case, the H200’s extra memory is unused headroom. The H100 completes the same job at a lower hourly rate.

Use an H200 when:

You’re fine-tuning Llama 3 70B at full precision (FP16 or BF16)
You’re working with Llama 3 405B in any configuration
You need to reduce the GPU count on multi-node jobs (fewer nodes = less communication overhead)
Your workload involves long context windows, where the KV cache memory becomes a constraint

The H200 carries 141 GB of HBM3e memory, which is 76% more than the H100, and 4.8 TB/s of memory bandwidth. For workloads that require 12 GPUs, the H100 handles the same job on 8 GPUs. Fewer GPUs means simpler orchestration and less inter-GPU communication latency. Both H100 and H200 are available on Saturn Cloud.

Saturn Cloud tip: Both GPUs use the same Hopper architecture, so your existing PyTorch and HuggingFace code runs on either without changes.

LoRA, QLoRA, or FSDP: which training approach to use

QLoRA (4-bit quantization + LoRA)

QLoRA is the right default for most Llama 3 fine-tuning jobs. It quantizes the base model to 4-bit, then trains only the low-rank adapter weights. This dramatically reduces VRAM requirements without a meaningful accuracy penalty on most tasks.

Best for: Llama 3 8B and 70B on a single H100 or H200
VRAM: ~8–10 GB for 8B; ~35–40 GB for 70B at 4-bit
Libraries: Unsloth, bitsandbytes + PEFT, or HuggingFace TRL
Tradeoff: Slightly slower than full fine-tuning but usable on a single GPU

This aggressive reduction in memory footprint makes QLoRA the most cost-effective path for rapid, single-GPU deployments.

LoRA (full precision + low-rank adapters)

LoRA trains in FP16 or BF16 without base model quantization. It requires more VRAM than QLoRA but produces cleaner gradients and is better suited for tasks where output quality is critical.

Best for: Llama 3 8B on H100; Llama 3 70B on H200 (or multi-H100)
VRAM: ~24 - 32 GB for 8B at FP16; ~140 GB for 70B at FP16 – requires H200 or multi-GPU
Libraries: HuggingFace PEFT, axolotl

Deploying standard LoRA strikes a balance between maintaining raw model precision and avoiding the multi-node overhead of full fine-tuning.

FSDP (Fully Sharded Data Parallel)

FSDP shards model weights, gradients, and optimizer states across GPUs. It’s the standard approach for full fine-tuning of 70B and 405B models across multiple nodes.

Best for: Llama 3 70B full fine-tuning (4+ GPUs); Llama 3 405B (8+ H200s)
Requires: Multi-GPU setup; NVLink for efficient intra-node communication, and InfiniBand/RoCE for inter-node scaling
Libraries: PyTorch FSDP (native), often combined with Flash Attention 2
Tradeoff: More infrastructure setup than QLoRA, but necessary for full-precision large model training

Implementing this sharding architecture ensures maximum memory efficiency without bottlenecking cross-node communication in your cluster.

Fine-tuning Llama 3 8B on Saturn Cloud: step by step

The following walkthrough uses QLoRA with Unsloth on a single H100 on Saturn Cloud. This setup works well for domain adaptation, instruction tuning, and classification tasks on Llama 3 8B and 3.1 8B.

1. Spin up a GPU resource

From the Saturn Cloud dashboard, go to Resources → New Python Server. Select your H100 instance, set disk space, and add your pip dependencies. For QLoRA with Unsloth:

unsloth[colab-new]
trl
peft
accelerate
bitsandbytes
datasets

2. Load the model in 4-bit

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length = 2048,
    load_in_4bit = True,
)

3. Attach LoRA adapters

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
)

4. Run training with TRL SFTTrainer

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        num_train_epochs = 3,
        learning_rate = 2e-4,
        fp16 = True,
        output_dir = "/outputs",
    ),
)

trainer.train()

Saturn Cloud mounts /outputs automatically. Checkpoints saved here persist after the job ends.

Fine-tuning Llama 3 70B: multi-GPU setup with FSDP

For full fine-tuning of Llama 3 70B, you’ll need at least 4 H100s or 2 H200s. Saturn Cloud supports multi-node GPU clusters – you can provision and configure these directly from the resource dashboard.

FSDP config for 70B

fsdp: "full_shard auto_wrap"

fsdp_config:
  fsdp_offload_params: false
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true

Enable Flash Attention 2 for significant throughput gains on H100/H200. Add attn_implementation=“flash_attention_2” to your model config. This is supported natively in HuggingFace Transformers for Llama architectures.

Cost estimates

These are approximate figures based on Saturn Cloud’s published pricing at $2.95/hr for H100 and H200 instances. Actual training time varies by dataset size, batch size, and number of epochs.

Job	GPU Config	Approx. Time	Est. Cost	Approach
Llama 3 8B fine-tune (10K samples)	1x H100	1–2 hours	$3–6	QLoRA
Llama 3 70B fine-tune (10K samples)	4x H100	4–8 hours	$47–94	QLoRA
Llama 3 70B full fine-tune (10K samples)	4x H100 or 2x H200	8–14 hours	$94–164	FSDP

Estimates assume standard training configuration. Run a short profiling job first to calibrate your actual throughput.

Running on Saturn Cloud

Saturn Cloud provides instant access to H100, H200, and B200 GPU instances without reservation queues. You can provision a single-GPU fine-tuning environment or a multi-node FSDP cluster from the same dashboard. There’s no infrastructure setup; Python environments, dependencies, and storage are managed for you.

No reservation queues – GPU instances spin up on demand
Persistent storage mounted automatically at /outputs
Multi-node cluster support for FSDP training across H100 or H200 nodes
Run on AWS, GCP, Azure, Nebius, or Crusoe – your account, your cloud
Jupyter and VS Code environments are available on every resource

Stripping away this infrastructure layer allows your engineering team to focus strictly on dataset curation and model hyperparameter optimization.

For most Llama 3 fine-tuning jobs, the practical answer is: start with an H100 and QLoRA. It’s fast to set up, cost-effective, and produces strong results on 8B and 70B models. Move to H200 or multi-node FSDP when memory becomes the constraint, typically when you’re running full-precision training on 70B or above.

Saturn Cloud now offers H100, H200, B200, and B300 instances. Start a fine-tuning job on Saturn Cloud →