Deploying a large language model to production used to mean weeks of work: selecting an inference engine, writing custom serving code, tuning batching parameters, and benchmarking until latency was acceptable. NVIDIA NIM compresses most of that into a single container pull.
This guide covers what NVIDIA NIM actually is, what it does under the hood, how it performs on H100 infrastructure, and how to run it on Saturn Cloud – from pulling the container to serving your first request.
What is NVIDIA NIM?
NVIDIA NIM (NVIDIA Inference Microservices) is a set of pre-built, pre-optimized container images that package a foundation model with the inference engine, runtime dependencies, and a serving API, ready for deployment on any NVIDIA GPU infrastructure.
Each NIM container ships with the model, an optimized inference backend (TensorRT-LLM, vLLM, or SGLang depending on the model and hardware), and an OpenAI-compatible REST API. You pull the container, point it at your GPU, and it handles the rest: engine selection, quantization, batching configuration, and memory management.
What NIM handles automatically
- Selects the best inference engine for your specific GPU at startup (TensorRT-LLM vs vLLM vs SGLang)
- Downloads and caches the optimized model from NVIDIA’s NGC registry
- Applies hardware-specific optimizations, including FP8 precision on H100, kernel fusion, paged KV cache
- Exposes an OpenAI-compatible API at port 8000 and drop-in for existing applications
- Scales from a single GPU to multi-GPU serving without code changes
What NIM is not
NIM is an inference serving solution, not a training or fine-tuning framework. It’s designed for teams deploying models to production endpoints, not for training runs, distributed fine-tuning, or experimentation workflows. For training, you still need FSDP, DeepSpeed, or a LoRA/QLoRA setup. NIM comes after that work is done.
Why NIM: throughput and latency numbers
The clearest reason to use NIM over a self-assembled inference stack is the performance gap. On a single H100 SXM serving Llama 3.1 8B at FP8 precision with 200 concurrent requests:
| Configuration | Throughput (tokens/sec) | Inter-token latency |
|---|---|---|
| NIM ON (TensorRT-LLM, FP8) | 1,201 | 32ms |
| NIM OFF (standard deployment, FP8) | 613 | 37ms |
| NIM improvement | ~2x | 14% lower latency |
Source: NVIDIA published benchmark. Configuration: Llama 3.1 8B Instruct, 1x H100 SXM, 200 concurrent requests.
The throughput gap stems from TensorRT-LLM’s kernel-level optimizations, including continuous batching, paged attention, and CUDA graph capture, that NIM automatically applies without manual configuration. A standard vLLM deployment gets you part of the way there; NIM’s TensorRT-LLM backend takes it further.
Supported models
NIM supports the models most enterprise AI teams are actually deploying. The catalog includes:
- Llama family: Llama 3.1 8B, 70B, 405B; Llama 3.2 multimodal variants
- Mistral variants: Mistral 7B, Mixtral 8x7B, Mixtral 8x22B
- NVIDIA Nemotron: Nemotron-4 340B and domain-specific variants
- DeepSeek-R1: Added as a preview microservice in January 2025
- Custom fine-tuned models: Via NIM’s multi-LLM container, which supports LoRA adapters trained with HuggingFace PEFT or NVIDIA NeMo
NIM automatically selects the optimized engine version for your specific GPU at container startup. On H100 SXM, it defaults to TensorRT-LLM with FP8 quantization, which is the configuration that produced the benchmark numbers above.
Setting up NVIDIA NIM on Saturn Cloud
The following walkthrough deploys Llama 3.1 8B Instruct via NIM on a Saturn Cloud H100 instance. The same steps apply to any NIM-supported model.
Prerequisites
- A Saturn Cloud H100, H200, or B200 instance (NIM requires NVIDIA GPU with sufficient VRAM – 8B models need ~20 GB)
- An NGC API key from the NVIDIA Developer Program (free to register at developer.nvidia.com)
- Docker installed on your Saturn Cloud resource (available by default)
Step 1: Authenticate with NGC
export NGC_API_KEY=<your-ngc-api-key>
echo $NGC_API_KEY | docker login nvcr.io \
--username '$oauthtoken' \
--password-stdin
Step 2: Set a local cache directory
NIM downloads optimized model artifacts on the first run and caches them locally. Point this at Saturn Cloud’s persistent storage so you don’t re-download on every container restart.
export LOCAL_NIM_CACHE=/outputs/nim-cache
mkdir -p $LOCAL_NIM_CACHE
Step 3: pull and run the NIM container
export IMG_NAME=nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
docker run -it --rm \
--name nim-llama \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
On first run, NIM inspects the available GPU, selects the optimal model version from the registry, and downloads it. On H100 SXM, this pulls the TensorRT-LLM FP8 engine. Expect 5–10 minutes on first launch; subsequent starts use the local cache and take under 30 seconds.
Step 4: Verify the container is serving
curl http://localhost:8000/v1/models
You should see a JSON response listing the loaded model. Once this returns, the endpoint is ready.
Step 5: Send an inference request
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta/llama-3.1-8b-instruct",
"messages": [
{"role": "user", "content": "Explain attention mechanisms in transformers."}
],
"max_tokens": 512,
"temperature": 0.7
}'
The API is OpenAI-compatible – any application that calls the OpenAI Python SDK or REST API works with NIM by changing the base URL to your Saturn Cloud endpoint.
Serving a custom fine-tuned model with NIM
If you’ve fine-tuned a model using LoRA or QLoRA on Saturn Cloud, NIM can serve your adapter weights alongside the base model without requiring a separate deployment pipeline.
Using LoRA adapters with NIM
NIM’s multi-LLM container supports LoRA adapters trained with HuggingFace PEFT or NVIDIA NeMo. You mount your adapter weights into the container, and NIM loads them on top of the base model at startup.
docker run -it --rm \
--name nim-custom \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-v "/outputs/my-lora-adapter:/lora" \
-e NIM_PEFT_SOURCE=/lora \
-u $(id -u) \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
NIM will load the base model and merge the LoRA adapter at startup. The endpoint behaves identically to the standard NIM API – the adapter is applied transparently.
Scaling NIM across multiple GPUs
For higher throughput or larger models (70B+), NIM supports tensor parallelism across multiple GPUs on the same node. Pass the GPU count via the TENSOR_PARALLEL_SIZE environment variable.
docker run -it --rm \
--name nim-70b \
--runtime=nvidia \
--gpus all \
--shm-size=32GB \
-e NGC_API_KEY \
-e NIM_TENSOR_PARALLEL_SIZE=4 \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-70b-instruct:latest
| Model | Min. GPU config (H100) | Min. GPU config (H200) | Recommended for |
|---|---|---|---|
| Llama 3.1 8B | 1x H100 | 1x H200 | Low-latency single-user or batch serving |
| Llama 3.1 70B | 2x H100 (TP=2) | 1x H200 | Production multi-user serving |
| Llama 3.1 405B | 8x H100 (TP=8) | 4x H200 (TP=4) | High-throughput enterprise serving |
| TP = tensor parallel size. H200 requires fewer GPUs due to its 141 GB of VRAM vs. H100’s 80 GB. |
Why run NIM on Saturn Cloud
NIM requires NVIDIA GPUs, specifically, hardware that supports the TensorRT-LLM optimizations that drive its performance advantage. Saturn Cloud provides H100, H200, and B200 instances on demand, without reservation queues or long-term commitments.
- H100 SXM instances deliver the TensorRT-LLM FP8 engine NIM defaults to for optimal throughput
- Multi-GPU instances for tensor-parallel 70B and 405B serving provision from the same dashboard
- Persistent storage at /outputs for NIM model cache, no re-download on restart
- No infrastructure setup – Docker is available on every Saturn Cloud resource by default
- Run on AWS, GCP, Azure, Nebius, or Crusoe. NIM containers are portable across all of them
- Saturn Cloud’s secrets manager stores NGC API keys securely without shell variable exposure
NIM removes most of the work between having a model and serving it at production throughput. The 2x performance gap over standard deployment isn’t due to exotic configuration – it’s from TensorRT-LLM optimizations that NIM automatically applies on the right hardware. On Saturn Cloud H100s, that hardware is available immediately.
If you’ve fine-tuned a Llama 3 model on Saturn Cloud and want to move it to a production endpoint, NIM is the most direct path from trained adapter weights to a live, OpenAI-compatible API.



