📣 From $2.95/Hr H100, H200, B200s, and B300s: train, fine-tune, and scale ML models affordably, without having to DIY the infrastructure   📣 Run Saturn Cloud on AWS, GCP, Azure, Nebius, Crusoe, or on-prem. 📣 From $2.95/Hr H100, H200, B200s, and B300s: train, fine-tune, and scale ML models affordably, without having to DIY the infrastructure   📣 Run Saturn Cloud on AWS, GCP, Azure, Nebius, Crusoe, or on-prem. 📣 From $2.95/Hr H100, H200, B200s, and B300s: train, fine-tune, and scale ML models affordably, without having to DIY the infrastructure   📣 Run Saturn Cloud on AWS, GCP, Azure, Nebius, Crusoe, or on-prem.
← Back to Blog

Running NVIDIA NIM on Saturn Cloud

How to deploy NVIDIA NIM inference microservices on GPU infrastructure, including what NIM is, what it actually does to throughput, and a complete setup walkthrough on Saturn Cloud.

Running NVIDIA NIM on Saturn Cloud

Deploying a large language model to production used to mean weeks of work: selecting an inference engine, writing custom serving code, tuning batching parameters, and benchmarking until latency was acceptable. NVIDIA NIM compresses most of that into a single container pull.

This guide covers what NVIDIA NIM actually is, what it does under the hood, how it performs on H100 infrastructure, and how to run it on Saturn Cloud – from pulling the container to serving your first request.

What is NVIDIA NIM?

NVIDIA NIM (NVIDIA Inference Microservices) is a set of pre-built, pre-optimized container images that package a foundation model with the inference engine, runtime dependencies, and a serving API, ready for deployment on any NVIDIA GPU infrastructure.

Each NIM container ships with the model, an optimized inference backend (TensorRT-LLM, vLLM, or SGLang depending on the model and hardware), and an OpenAI-compatible REST API. You pull the container, point it at your GPU, and it handles the rest: engine selection, quantization, batching configuration, and memory management.

What NIM handles automatically

  • Selects the best inference engine for your specific GPU at startup (TensorRT-LLM vs vLLM vs SGLang)
  • Downloads and caches the optimized model from NVIDIA’s NGC registry
  • Applies hardware-specific optimizations, including FP8 precision on H100, kernel fusion, paged KV cache
  • Exposes an OpenAI-compatible API at port 8000 and drop-in for existing applications
  • Scales from a single GPU to multi-GPU serving without code changes

What NIM is not

NIM is an inference serving solution, not a training or fine-tuning framework. It’s designed for teams deploying models to production endpoints, not for training runs, distributed fine-tuning, or experimentation workflows. For training, you still need FSDP, DeepSpeed, or a LoRA/QLoRA setup. NIM comes after that work is done.

Why NIM: throughput and latency numbers

The clearest reason to use NIM over a self-assembled inference stack is the performance gap. On a single H100 SXM serving Llama 3.1 8B at FP8 precision with 200 concurrent requests:

ConfigurationThroughput (tokens/sec)Inter-token latency
NIM ON (TensorRT-LLM, FP8)1,20132ms
NIM OFF (standard deployment, FP8)61337ms
NIM improvement~2x14% lower latency

Source: NVIDIA published benchmark. Configuration: Llama 3.1 8B Instruct, 1x H100 SXM, 200 concurrent requests.

The throughput gap stems from TensorRT-LLM’s kernel-level optimizations, including continuous batching, paged attention, and CUDA graph capture, that NIM automatically applies without manual configuration. A standard vLLM deployment gets you part of the way there; NIM’s TensorRT-LLM backend takes it further.

Supported models

NIM supports the models most enterprise AI teams are actually deploying. The catalog includes:

  • Llama family: Llama 3.1 8B, 70B, 405B; Llama 3.2 multimodal variants
  • Mistral variants: Mistral 7B, Mixtral 8x7B, Mixtral 8x22B
  • NVIDIA Nemotron: Nemotron-4 340B and domain-specific variants
  • DeepSeek-R1: Added as a preview microservice in January 2025
  • Custom fine-tuned models: Via NIM’s multi-LLM container, which supports LoRA adapters trained with HuggingFace PEFT or NVIDIA NeMo

NIM automatically selects the optimized engine version for your specific GPU at container startup. On H100 SXM, it defaults to TensorRT-LLM with FP8 quantization, which is the configuration that produced the benchmark numbers above.

Setting up NVIDIA NIM on Saturn Cloud

The following walkthrough deploys Llama 3.1 8B Instruct via NIM on a Saturn Cloud H100 instance. The same steps apply to any NIM-supported model.

Prerequisites

  • A Saturn Cloud H100, H200, or B200 instance (NIM requires NVIDIA GPU with sufficient VRAM – 8B models need ~20 GB)
  • An NGC API key from the NVIDIA Developer Program (free to register at developer.nvidia.com)
  • Docker installed on your Saturn Cloud resource (available by default)

Step 1: Authenticate with NGC

export NGC_API_KEY=<your-ngc-api-key>

echo $NGC_API_KEY | docker login nvcr.io \
  --username '$oauthtoken' \
  --password-stdin

Step 2: Set a local cache directory

NIM downloads optimized model artifacts on the first run and caches them locally. Point this at Saturn Cloud’s persistent storage so you don’t re-download on every container restart.

export LOCAL_NIM_CACHE=/outputs/nim-cache

mkdir -p $LOCAL_NIM_CACHE

Step 3: pull and run the NIM container

export IMG_NAME=nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

docker run -it --rm \
  --name nim-llama \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

On first run, NIM inspects the available GPU, selects the optimal model version from the registry, and downloads it. On H100 SXM, this pulls the TensorRT-LLM FP8 engine. Expect 5–10 minutes on first launch; subsequent starts use the local cache and take under 30 seconds.

Step 4: Verify the container is serving

curl http://localhost:8000/v1/models

You should see a JSON response listing the loaded model. Once this returns, the endpoint is ready.

Step 5: Send an inference request

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [
      {"role": "user", "content": "Explain attention mechanisms in transformers."}
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'

The API is OpenAI-compatible – any application that calls the OpenAI Python SDK or REST API works with NIM by changing the base URL to your Saturn Cloud endpoint.

Serving a custom fine-tuned model with NIM

If you’ve fine-tuned a model using LoRA or QLoRA on Saturn Cloud, NIM can serve your adapter weights alongside the base model without requiring a separate deployment pipeline.

Using LoRA adapters with NIM

NIM’s multi-LLM container supports LoRA adapters trained with HuggingFace PEFT or NVIDIA NeMo. You mount your adapter weights into the container, and NIM loads them on top of the base model at startup.

docker run -it --rm \
  --name nim-custom \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -v "/outputs/my-lora-adapter:/lora" \
  -e NIM_PEFT_SOURCE=/lora \
  -u $(id -u) \
  -p 8000:8000 \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

NIM will load the base model and merge the LoRA adapter at startup. The endpoint behaves identically to the standard NIM API – the adapter is applied transparently.

Scaling NIM across multiple GPUs

For higher throughput or larger models (70B+), NIM supports tensor parallelism across multiple GPUs on the same node. Pass the GPU count via the TENSOR_PARALLEL_SIZE environment variable.

docker run -it --rm \
  --name nim-70b \
  --runtime=nvidia \
  --gpus all \
  --shm-size=32GB \
  -e NGC_API_KEY \
  -e NIM_TENSOR_PARALLEL_SIZE=4 \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  nvcr.io/nim/meta/llama-3.1-70b-instruct:latest
ModelMin. GPU config (H100)Min. GPU config (H200)Recommended for
Llama 3.1 8B1x H1001x H200Low-latency single-user or batch serving
Llama 3.1 70B2x H100 (TP=2)1x H200Production multi-user serving
Llama 3.1 405B8x H100 (TP=8)4x H200 (TP=4)High-throughput enterprise serving
TP = tensor parallel size. H200 requires fewer GPUs due to its 141 GB of VRAM vs. H100’s 80 GB.

Why run NIM on Saturn Cloud

NIM requires NVIDIA GPUs, specifically, hardware that supports the TensorRT-LLM optimizations that drive its performance advantage. Saturn Cloud provides H100, H200, and B200 instances on demand, without reservation queues or long-term commitments.

  • H100 SXM instances deliver the TensorRT-LLM FP8 engine NIM defaults to for optimal throughput
  • Multi-GPU instances for tensor-parallel 70B and 405B serving provision from the same dashboard
  • Persistent storage at /outputs for NIM model cache, no re-download on restart
  • No infrastructure setup – Docker is available on every Saturn Cloud resource by default
  • Run on AWS, GCP, Azure, Nebius, or Crusoe. NIM containers are portable across all of them
  • Saturn Cloud’s secrets manager stores NGC API keys securely without shell variable exposure

NIM removes most of the work between having a model and serving it at production throughput. The 2x performance gap over standard deployment isn’t due to exotic configuration – it’s from TensorRT-LLM optimizations that NIM automatically applies on the right hardware. On Saturn Cloud H100s, that hardware is available immediately.

If you’ve fine-tuned a Llama 3 model on Saturn Cloud and want to move it to a production endpoint, NIM is the most direct path from trained adapter weights to a live, OpenAI-compatible API.

Deploy NVIDIA NIM on Saturn Cloud →

Keep reading

Related articles

Running NVIDIA NIM on Saturn Cloud
Apr 3, 2026

Saturn Cloud vs AWS SageMaker for LLM Training

Running NVIDIA NIM on Saturn Cloud
Apr 2, 2026

Run Claude Code on a Cloud GPU in 10 Minutes – No Root Workarounds Required

Running NVIDIA NIM on Saturn Cloud
Mar 31, 2026

How to Fine-Tune Llama 3 on GPU Clusters