Running NVIDIA NIM on Saturn Cloud

Deploying a large language model to production used to mean weeks of work: selecting an inference engine, writing custom serving code, tuning batching parameters, and benchmarking until latency was acceptable. NVIDIA NIM compresses most of that into a single container pull.

This guide covers what NVIDIA NIM actually is, what it does under the hood, how it performs on H100 infrastructure, and how to run it on Saturn Cloud – from pulling the container to serving your first request.

What is NVIDIA NIM?

NVIDIA NIM (NVIDIA Inference Microservices) is a set of pre-built, pre-optimized container images that package a foundation model with the inference engine, runtime dependencies, and a serving API, ready for deployment on any NVIDIA GPU infrastructure.

Each NIM container ships with the model, an optimized inference backend (TensorRT-LLM, vLLM, or SGLang depending on the model and hardware), and an OpenAI-compatible REST API. You pull the container, point it at your GPU, and it handles the rest: engine selection, quantization, batching configuration, and memory management.

What NIM handles automatically

Selects the best inference engine for your specific GPU at startup (TensorRT-LLM vs vLLM vs SGLang)
Downloads and caches the optimized model from NVIDIA’s NGC registry
Applies hardware-specific optimizations, including FP8 precision on H100, kernel fusion, paged KV cache
Exposes an OpenAI-compatible API at port 8000 and drop-in for existing applications
Scales from a single GPU to multi-GPU serving without code changes

What NIM is not

NIM is an inference serving solution, not a training or fine-tuning framework. It’s designed for teams deploying models to production endpoints, not for training runs, distributed fine-tuning, or experimentation workflows. For training, you still need FSDP, DeepSpeed, or a LoRA/QLoRA setup. NIM comes after that work is done.

Why NIM: throughput and latency numbers

The clearest reason to use NIM over a self-assembled inference stack is the performance gap. On a single H100 SXM serving Llama 3.1 8B at FP8 precision with 200 concurrent requests:

Configuration	Throughput (tokens/sec)	Inter-token latency
NIM ON (TensorRT-LLM, FP8)	1,201	32ms
NIM OFF (standard deployment, FP8)	613	37ms
NIM improvement	~2x	14% lower latency

Source: NVIDIA published benchmark. Configuration: Llama 3.1 8B Instruct, 1x H100 SXM, 200 concurrent requests.

The throughput gap stems from TensorRT-LLM’s kernel-level optimizations, including continuous batching, paged attention, and CUDA graph capture, that NIM automatically applies without manual configuration. A standard vLLM deployment gets you part of the way there; NIM’s TensorRT-LLM backend takes it further.

At 1,201 tokens/sec on a single H100, a Saturn Cloud multi-GPU deployment scales proportionally. Two H100s in tensor-parallel mode roughly doubles effective throughput for latency-sensitive applications.

Supported models

NIM supports the models most enterprise AI teams are actually deploying. The catalog includes:

Llama family: Llama 3.1 8B, 70B, 405B; Llama 3.2 multimodal variants
Mistral variants: Mistral 7B, Mixtral 8x7B, Mixtral 8x22B
NVIDIA Nemotron: Nemotron-4 340B and domain-specific variants
DeepSeek-R1: Added as a preview microservice in January 2025
Custom fine-tuned models: Via NIM’s multi-LLM container, which supports LoRA adapters trained with HuggingFace PEFT or NVIDIA NeMo

NIM automatically selects the optimized engine version for your specific GPU at container startup. On H100 SXM, it defaults to TensorRT-LLM with FP8 quantization, which is the configuration that produced the benchmark numbers above.

Setting up NVIDIA NIM on Saturn Cloud

The following walkthrough deploys Llama 3.1 8B Instruct via NIM on a Saturn Cloud H100 instance. The same steps apply to any NIM-supported model.

Prerequisites

A Saturn Cloud H100, H200, or B200 instance (NIM requires NVIDIA GPU with sufficient VRAM – 8B models need ~20 GB)
An NGC API key from the NVIDIA Developer Program (free to register at developer.nvidia.com)
Docker installed on your Saturn Cloud resource (available by default)

Step 1: Authenticate with NGC

export NGC_API_KEY=<your-ngc-api-key>

echo $NGC_API_KEY | docker login nvcr.io \
  --username '$oauthtoken' \
  --password-stdin

Step 2: Set a local cache directory

NIM downloads optimized model artifacts on the first run and caches them locally. Point this at Saturn Cloud’s persistent storage so you don’t re-download on every container restart.

export LOCAL_NIM_CACHE=/outputs/nim-cache

mkdir -p $LOCAL_NIM_CACHE

Step 3: pull and run the NIM container

export IMG_NAME=nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

docker run -it --rm \
  --name nim-llama \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

On first run, NIM inspects the available GPU, selects the optimal model version from the registry, and downloads it. On H100 SXM, this pulls the TensorRT-LLM FP8 engine. Expect 5–10 minutes on first launch; subsequent starts use the local cache and take under 30 seconds.

Step 4: Verify the container is serving

curl http://localhost:8000/v1/models

You should see a JSON response listing the loaded model. Once this returns, the endpoint is ready.

Step 5: Send an inference request

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [
      {"role": "user", "content": "Explain attention mechanisms in transformers."}
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'

The API is OpenAI-compatible – any application that calls the OpenAI Python SDK or REST API works with NIM by changing the base URL to your Saturn Cloud endpoint.

Saturn Cloud tip: Expose port 8000 as a public endpoint from the resource settings to serve NIM from outside the instance. Use Saturn Cloud’s built-in secrets manager to store your NGC API key rather than passing it as a shell variable.*

Serving a custom fine-tuned model with NIM

If you’ve fine-tuned a model using LoRA or QLoRA on Saturn Cloud, NIM can serve your adapter weights alongside the base model without requiring a separate deployment pipeline.

Using LoRA adapters with NIM

NIM’s multi-LLM container supports LoRA adapters trained with HuggingFace PEFT or NVIDIA NeMo. You mount your adapter weights into the container, and NIM loads them on top of the base model at startup.

docker run -it --rm \
  --name nim-custom \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -v "/outputs/my-lora-adapter:/lora" \
  -e NIM_PEFT_SOURCE=/lora \
  -u $(id -u) \
  -p 8000:8000 \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

NIM will load the base model and merge the LoRA adapter at startup. The endpoint behaves identically to the standard NIM API – the adapter is applied transparently.

Scaling NIM across multiple GPUs

For higher throughput or larger models (70B+), NIM supports tensor parallelism across multiple GPUs on the same node. Pass the GPU count via the TENSOR_PARALLEL_SIZE environment variable.

docker run -it --rm \
  --name nim-70b \
  --runtime=nvidia \
  --gpus all \
  --shm-size=32GB \
  -e NGC_API_KEY \
  -e NIM_TENSOR_PARALLEL_SIZE=4 \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  nvcr.io/nim/meta/llama-3.1-70b-instruct:latest

Model	Min. GPU config (H100)	Min. GPU config (H200)	Recommended for
Llama 3.1 8B	1x H100	1x H200	Low-latency single-user or batch serving
Llama 3.1 70B	2x H100 (TP=2)	1x H200	Production multi-user serving
Llama 3.1 405B	8x H100 (TP=8)	4x H200 (TP=4)	High-throughput enterprise serving
TP = tensor parallel size. H200 requires fewer GPUs due to its 141 GB of VRAM vs. H100’s 80 GB.

Why run NIM on Saturn Cloud

NIM requires NVIDIA GPUs, specifically, hardware that supports the TensorRT-LLM optimizations that drive its performance advantage. Saturn Cloud provides H100, H200, and B200 instances on demand, without reservation queues or long-term commitments.

H100 SXM instances deliver the TensorRT-LLM FP8 engine NIM defaults to for optimal throughput
Multi-GPU instances for tensor-parallel 70B and 405B serving provision from the same dashboard
Persistent storage at /outputs for NIM model cache, no re-download on restart
No infrastructure setup – Docker is available on every Saturn Cloud resource by default
Run on AWS, GCP, Azure, Nebius, or Crusoe. NIM containers are portable across all of them
Saturn Cloud’s secrets manager stores NGC API keys securely without shell variable exposure

NIM removes most of the work between having a model and serving it at production throughput. The 2x performance gap over standard deployment isn’t due to exotic configuration – it’s from TensorRT-LLM optimizations that NIM automatically applies on the right hardware. On Saturn Cloud H100s, that hardware is available immediately.

If you’ve fine-tuned a Llama 3 model on Saturn Cloud and want to move it to a production endpoint, NIM is the most direct path from trained adapter weights to a live, OpenAI-compatible API.

Deploy NVIDIA NIM on Saturn Cloud →