Deploying NVIDIA NIM on Saturn Cloud

NVIDIA NIM packages LLM inference into containers that run on any NVIDIA GPU infrastructure. Saturn Cloud makes deploying these containers straightforward so you get a running inference endpoint without managing Kubernetes, container registries, or GPU scheduling yourself.
This guide covers what NIM is, why it matters for production inference, and how to get a model running on Saturn Cloud.
What is NVIDIA NIM?
NIM (NVIDIA Inference Microservices) is a set of containerized inference services for AI models. Each NIM container includes:
- The model weights
- An optimized inference engine (TensorRT-LLM, vLLM, or SGLang, depending on the model)
- A REST API that follows OpenAI-compatible conventions
- GPU memory management and batching logic
You pull a container, point it at GPUs, and get an API endpoint. NIM handles the inference optimization, continuous batching, KV cache management, and tensor parallelism configuration, so you don’t have to tune these yourself.
NIM supports a range of models: Llama 3, Mistral, Mixtral, NVIDIA Nemotron, and several domain-specific models for biology and chemistry (AlphaFold2, MolMIM, DiffDock). The full catalog is available at build.nvidia.com.
Why use NIM instead of rolling your own inference?
You could deploy vLLM or TensorRT-LLM directly. NIM saves you from a few headaches:
Optimized defaults. NVIDIA tunes batch sizes, memory allocation, and parallelism settings for specific GPU configurations. Their benchmarks show NIM delivering 2-3x higher throughput than untuned vLLM deployments on the same hardware.
Consistent API. Every NIM exposes the same OpenAI-compatible API regardless of the underlying model or inference engine. If you switch models, your application code doesn’t change.
Enterprise licensing. NIM containers include validated, supported builds. This matters if you need to demonstrate to auditors that your inference stack has a support contract and a security update process.
The tradeoff: NIM containers are large (often 20-50GB), and you’re constrained to NVIDIA’s supported configurations. If you need custom model architectures or specific batching behavior, you’ll want direct control over the inference engine.
Deploying NIM on Saturn Cloud
Saturn Cloud installs directly into your cloud account: AWS, GCP, Azure, or neoclouds like Nebius and Crusoe. This matters for NIM deployments because GPU availability and pricing vary significantly across providers.
If you’re hitting GPU capacity limits or high costs on hyperscalers, deploying Saturn Cloud into a neocloud account gives you access to H100s and H200s at lower prices without changing your workflow. Your NIM deployment, notebooks, and pipelines all work the same way regardless of which cloud is underneath.
NIM containers deploy as Saturn Cloud deployments - long-running services with persistent endpoints.
Prerequisites
- An NGC API key from NVIDIA (free at ngc.nvidia.com)
- Saturn Cloud with NIM enabled (contact support@saturncloud.io if you don’t see the NIM templates)
- GPU instances available in your cloud account (H100 or A100 recommended for larger models)
Step 1: Add your NGC API key
Go to Secrets in the Saturn Cloud sidebar and add your NGC API key. This authenticates container pulls from NVIDIA’s registry.
Step 2: Create a deployment from a NIM template
Saturn Cloud provides templates for common NIM configurations. Select one (Llama 3 8B is a good starting point), name your deployment, and click create.
The template sets:
- The container image location
- GPU requirements (type and count)
- Memory limits
- Port configuration for the API endpoint
Step 3: Start the deployment
Click Start. The first launch takes several minutes - the container image is large and needs to download model weights on startup.
Once running, you’ll see the endpoint URL in the Saturn Cloud console.
Step 4: Test the endpoint
import requests
url = "https://your-deployment-url/v1/chat/completions"
headers = {
"Authorization": "Bearer YOUR_SATURN_API_TOKEN",
"Content-Type": "application/json"
}
payload = {
"model": "meta/llama3-8b-instruct",
"messages": [
{"role": "user", "content": "Explain transformer attention in two sentences."}
],
"max_tokens": 100
}
response = requests.post(url, headers=headers, json=payload)
print(response.json()["choices"][0]["message"]["content"])
The API follows OpenAI conventions, so any OpenAI-compatible client library works.
Controlling access
By default, deployments are accessible to anyone in your Saturn Cloud organization. To restrict access:
- Open the deployment’s Networking tab
- Change visibility from “Anyone in the organization” to “Only the owner”
- Add specific users or groups in the Viewers section
For external access, you can expose the endpoint publicly or route through your own API gateway.
Configuring GPU resources
NIM performance depends on having enough GPU memory. General guidelines:
| Model size | Minimum GPUs | Recommended |
|---|---|---|
| 7-8B parameters | 1x A100 80GB or H100 | 1x H100 |
| 13B parameters | 1x A100 80GB or H100 | 2x A100/H100 |
| 70B parameters | 4x A100 80GB or 2x H100 | 4x H100 |
Larger configurations improve throughput under load. A single H100 can serve a 70B model, but response latency increases significantly compared to a 4-GPU setup with tensor parallelism.
Saturn Cloud templates default to tested configurations, but you can customize GPU type and count in the deployment settings.
Choosing your cloud for NIM workloads
Since Saturn Cloud deploys into your own cloud account, you can run NIM wherever GPUs are available and priced right for your workload.
Hyperscalers (AWS, GCP, Azure): Good if your data and existing infrastructure are already there. GPU availability can be constrained, and on-demand H100 pricing is typically $3-4/hour per GPU.
Neoclouds (Nebius, Crusoe): Often have better H100 availability and lower pricing - sometimes 40-60% less than hyperscaler on-demand rates. Nebius also includes NVIDIA AI Enterprise licensing, which covers NIM usage without a separate license. If GPU cost or availability is a bottleneck, deploying Saturn Cloud into a neocloud account is worth evaluating.
The deployment process is the same either way. You install Saturn Cloud into your cloud account, and NIM containers run on whatever GPUs you provision.
Summary
NVIDIA NIM provides optimized LLM inference containers. Saturn Cloud handles the deployment infrastructure - GPU scheduling, networking, secrets management, so you can focus on your application rather than Kubernetes manifests.
To get started, add your NGC API key, create a deployment from a NIM template, and start serving inference requests in minutes.
For specific model requirements or custom NIM configurations, contact support@saturncloud.io.
Frequently Asked Questions
How much does NVIDIA NIM cost?
NIM itself is free for development and testing through the NVIDIA Developer Program. For production use, you need an NVIDIA AI Enterprise license, which is typically priced per GPU per year. Some cloud providers like Nebius include NVIDIA AI Enterprise licensing in their GPU pricing, so you don’t pay separately.
What’s the difference between NVIDIA NIM and vLLM?
vLLM is an open-source inference engine. NIM is a packaged container that can use vLLM (or TensorRT-LLM or SGLang) under the hood, along with pre-tuned configurations, model weights, and enterprise support. NIM saves setup time but gives you less control. vLLM gives you full control but requires more work to optimize and maintain.
What models does NVIDIA NIM support?
NIM supports Llama 3 (8B and 70B), Mistral, Mixtral, NVIDIA Nemotron, and several domain-specific models for biology and drug discovery including AlphaFold2, MolMIM, DiffDock, and others. The full list is at build.nvidia.com.
Can I run NVIDIA NIM on AWS?
Yes. You can run NIM on any cloud with NVIDIA GPUs. Saturn Cloud deploys into your AWS account (or GCP, Azure, or neoclouds), and NIM containers run on whatever GPU instances you provision.
What GPU do I need for NVIDIA NIM?
It depends on the model size. For 7-8B parameter models, a single A100 80GB or H100 works. For 70B parameter models, you’ll need 4x A100 80GB or 2x H100 minimum. More GPUs improve throughput under concurrent load.
How do I deploy Llama 3 with NVIDIA NIM?
Get an NGC API key from ngc.nvidia.com, add it to your Saturn Cloud secrets, create a deployment from the Llama 3 NIM template, and start it. The endpoint will be ready in a few minutes. See the step-by-step instructions above.
Is NVIDIA NIM faster than vLLM?
NIM using the same underlying engine (vLLM or TensorRT-LLM) with pre-tuned settings typically shows 2-3x better throughput than default vLLM configurations. The performance gain comes from optimized batch sizes, memory allocation, and parallelism settings for specific GPU configurations.
Can I use NVIDIA NIM for fine-tuned models?
Yes. NIM supports LoRA adapters trained with HuggingFace or NVIDIA NeMo through the multi-LLM container. You can deploy your fine-tuned models without rebuilding the container.
What’s the difference between NVIDIA NIM and NVIDIA Triton?
Triton Inference Server is a general-purpose model serving platform that supports multiple frameworks and model types. NIM is specifically optimized for generative AI models like LLMs and provides higher-level abstractions with pre-configured containers. NIM can use Triton under the hood for some deployments.
How do I scale NVIDIA NIM for production traffic?
Saturn Cloud handles autoscaling for NIM deployments. You configure minimum and maximum replicas, and the platform scales based on request load. Each replica runs the full NIM container on its own GPU allocation.
Saturn Cloud provides customizable, ready-to-use cloud environments
for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without having to switch tools.