How to Run Open-Source LLM Inference on Crusoe from Saturn Cloud

A guide to running open-source LLM inference – Llama 3.3, DeepSeek, Qwen, and more – from Saturn Cloud using Crusoe’s Managed Inference API. Covers how Crusoe’s MemoryAlloy engine uses a cluster-wide KV cache to reduce time-to-first-token and cut redundant compute on prefix-heavy workloads, with working Python code for chat completions, streaming, document QA, multi-turn conversations, and batch jobs.

Crusoe’s Managed Inference service runs open-source LLMs on a proprietary inference engine powered by MemoryAlloy – a cluster-wide KV cache that shares computed context across GPUs instead of keeping it isolated per node. The result is faster time-to-first-token (up to 9.9x faster than standard vLLM) and higher throughput (up to 5x) for workloads in which prompts share common prefixes.

Since Saturn Cloud runs natively on Crusoe Cloud, you can call these inference endpoints directly from your Saturn Cloud workspace – notebooks, jobs, or deployments with no extra networking or infrastructure setup. The API is OpenAI-compatible, so if you’ve used the OpenAI SDK before, you already know how to use it.

This guide covers what MemoryAlloy does, why it matters for production inference, and how to call Crusoe’s inference API from Saturn Cloud with working Python code.

What MemoryAlloy Does

When you send a prompt to an LLM, the model processes the entire input before generating the first token. The prefill stage scales linearly with prompt length and is the primary contributor to Time-To-First-Token (TTFT). In practice, many workloads repeat large portions of their prompts – multi-turn chat appends to the same history, document QA repeatedly references the same source material, and agentic workflows cycle over stable system prompts. Standard inference engines maintain a local KV cache on each GPU to avoid recomputing shared prefixes. However, standard local KV caches are limited by individual GPU VRAM; once full, older entries are evicted (Least Recently Used), requiring recomputation.

MemoryAlloy creates a shared memory layer across the entire GPU cluster instead. Any model instance on any node can retrieve precomputed KV segments from any other node with latency close to a local GPU memory read. Crusoe’s gateway then routes each request to the node most likely to already have the relevant KV segments cached, picking the one that can return the first token fastest.

The practical impact is that your API calls return faster, and cost per token drops because the system isn’t redundantly recomputing the same prefixes.

Available Models

Crusoe Managed Inference currently supports several open-source models through the Intelligence Foundry:

  • meta-llama/Llama-3.3-70B-Instruct — 128K context, Meta’s Llama 3.3 Community License
  • deepseek-ai/DeepSeek-R1-0528 — 160K context, MIT License
  • deepseek-ai/DeepSeek-V3-0324 — 160K context, MIT License
  • deepseek-ai/DeepSeek-V3.1 — 160K context, MIT License
  • Qwen/Qwen3-235B-A22B — 131K context, Apache 2.0
  • openai/gpt-oss-120b — 128K context, Apache 2.0
  • google/gemma-3-12b-it — 128K context
  • moonshotai/Kimi-K2-Thinking — 131K context

All models are served through an OpenAI-compatible API at api.crusoe.ai.

Getting Your API Key

Before writing any code, you need a Crusoe Inference API key:

  1. Log in to the Crusoe Cloud console
  2. Click the Personal Settings icon in the top right corner of the dashboard
  3. Navigate to the left sidebar.
  4. Scroll to the Personal Settings section.
  5. Click Inference API Keys.
  6. Click Create Inference API Key
  7. Add Alias
  8. Click Create
  9. Save the key – you’ll need it for the examples below

Set up in Saturn Cloud

In your Saturn Cloud workspace (Jupyter, VS Code, or PyCharm), install the OpenAI SDK if it’s not already available:

pip install openai

Store your Crusoe API key as an environment variable. In Saturn Cloud, you can add this as a secret through the workspace settings so it persists across restarts:

export CRUSOE_API_KEY="your-api-key-here"

Basic Inference: Chat Completion

The simplest use case: send a prompt and get a response. Because Crusoe’s API is OpenAI-compatible, you just point the OpenAI client at api.crusoe.ai:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["CRUSOE_API_KEY"],
    base_url="https://api.crusoe.ai/v1",
)

response = client.chat.completions.create(
    model="Qwen/Qwen3-235B-A22B-Instruct-2507",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the difference between data parallelism and model parallelism in distributed training."},
    ],
    max_tokens=512,
    temperature=0.7,
)

print(response.choices[0].message.content)

This works exactly like calling the OpenAI API. The difference is that, behind the scenes, MemoryAlloy caches the KV segments from your system prompt and shared context. If you send another request with the same system prompt prefix, the engine skips the prefill for that portion entirely.

Streaming Responses

For interactive applications or notebooks where you want to see output as it’s generated:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["CRUSOE_API_KEY"],
    base_url="https://api.crusoe.ai/v1",
)

stream = client.chat.completions.create(
    model="Qwen/Qwen3-235B-A22B-Instruct-2507",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a Python function that implements gradient checkpointing for a transformer model."},
    ],
    max_tokens=1024,
    temperature=0.7,
    stream=True,
)

for chunk in stream:
    if chunk.choices and len(chunk.choices) > 0:
        choice = chunk.choices[0]
        delta = choice.delta
        
        if delta.content is not None:
            print(delta.content, end="", flush=True)
            
        if choice.finish_reason is not None:
            print(f"\n\n[Stream ended. Reason: {choice.finish_reason}]")

Document QA: Where MemoryAlloy Pays Off

MemoryAlloy’s performance advantage is most evident when you repeatedly query the same context. Document QA is a textbook case – you have a large document and multiple questions about it. Here’s an example that loads a document and asks multiple questions against it. Notice how the system prompt (which contains the document) stays the same across requests. MemoryAlloy caches the KV segments for that shared prefix, so the second and subsequent questions skip the expensive prefill:

import os
import time
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["CRUSOE_API_KEY"],
    base_url="https://api.crusoe.ai/v1",
)

# Simulate a long document as the shared context
document = """
[Your document content here — financial report, research paper, codebase, etc.
The longer this is, the more MemoryAlloy helps on subsequent queries,
because the prefill for this prefix is computed once and reused.]
"""

system_prompt = f"""You are an analyst. Answer questions based only on the following document.
If the answer is not in the document, say so.

Document:
{document}
"""

questions = [
    "What were the key findings?",
    "What methodology was used?",
    "What are the limitations mentioned?",
    "Summarize the conclusions in three sentences.",
]

for i, question in enumerate(questions):
    start = time.time()

    response = client.chat.completions.create(
        model="Qwen/Qwen3-235B-A22B-Instruct-2507",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question},
        ],
        max_tokens=512,
        temperature=0.3,
    )

    elapsed = time.time() - start
    print(f"\nQuestion {i+1}: {question}")
    print(f"Response time: {elapsed:.2f}s")
    print(f"Answer: {response.choices[0].message.content[:200]}...")

After the first question processes the full document prefix, subsequent questions should return noticeably faster. The exact speedup depends on the document length – Crusoe’s benchmarks show the gap grows with longer contexts.

Multi-Turn Chat

Another pattern where MemoryAlloy helps: conversations that build up history. Each turn appends to the same prefix, and MemoryAlloy keeps the previously computed KV segments cached:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["CRUSOE_API_KEY"],
    base_url="https://api.crusoe.ai/v1",
)

conversation = [
    {"role": "system", "content": "You are a senior ML engineer helping debug distributed training issues."},
]

def chat(user_message):
    conversation.append({"role": "user", "content": user_message})

    response = client.chat.completions.create(
        model="Qwen/Qwen3-235B-A22B-Instruct-2507",
        messages=conversation,
        max_tokens=512,
        temperature=0.7,
    )

    assistant_message = response.choices[0].message.content
    conversation.append({"role": "assistant", "content": assistant_message})
    return assistant_message

# First turn — full prefill
print(chat("My training job hangs after the first epoch when using 4 GPUs with DDP. No error messages."))

# Subsequent turns — MemoryAlloy reuses the cached prefix
print(chat("I'm using PyTorch 2.1 with NCCL backend. The hang happens at the gradient sync step."))

print(chat("Would switching to FSDP help, or is this likely a NCCL configuration issue?"))

Running Inference as a Saturn Cloud Job

For batch workloads, processing a dataset, evaluating a model across test cases, and generating synthetic data, you can run the inference code as a scheduled or on-demand Saturn Cloud job. This keeps your notebook free and lets you process larger volumes.

Here’s a script you might run as a Saturn Cloud job to process a batch of prompts:

import os
import json
import time
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["CRUSOE_API_KEY"],
    base_url="https://api.crusoe.ai/v1",
)

# Load your batch of prompts (from a file, database, S3, etc.)
prompts = [
    "Summarize the key risks in this quarterly filing.",
    "Extract all numerical claims from this research abstract.",
    "Classify this support ticket as billing, technical, or account.",
]

results = []

for i, prompt in enumerate(prompts):
    start = time.time()

    response = client.chat.completions.create(
        model="Qwen/Qwen3-235B-A22B-Instruct-2507",
        messages=[
            {"role": "system", "content": "You are a helpful assistant. Be concise."},
            {"role": "user", "content": prompt},
        ],
        max_tokens=256,
        temperature=0.3,
    )

    elapsed = time.time() - start
    results.append({
        "prompt": prompt,
        "response": response.choices[0].message.content,
        "latency_seconds": round(elapsed, 3),
        "model": response.model,
        "usage": {
            "prompt_tokens": response.usage.prompt_tokens,
            "completion_tokens": response.usage.completion_tokens,
        },
    })

    print(f"Processed {i+1}/{len(prompts)} in {elapsed:.2f}s")

# Save results
with open("inference_results.json", "w") as f:
    json.dump(results, f, indent=2)

print(f"\nDone. Processed {len(results)} prompts.")

Using DeepSeek R1 for Reasoning Tasks

For tasks that benefit from chain-of-thought reasoning, code generation, math, and multi-step analysis. DeepSeek R1 is available on Crusoe with a 160K context window:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["CRUSOE_API_KEY"],
    base_url="https://api.crusoe.ai/v1",
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-0528",
    messages=[
        {"role": "user", "content": """I have a PyTorch training loop that processes 1000 samples/second on a single H100.
When I scale to 8 H100s with DDP, I only get 5200 samples/second instead of the expected ~8000.

What are the most likely bottlenecks, and how would you diagnose each one?
Include specific commands or code I can run to identify the issue."""},
    ],
    max_tokens=2048,
    temperature=0.6,
)

print(response.choices[0].message.content)

When to Use Crusoe Managed Inference vs. Self-Hosted

Crusoe Managed Inference is a good fit when you want to call open-source models without managing the serving infrastructure yourself. You get the performance benefits of MemoryAlloy without setting up vLLM, TensorRT-LLM, or any other serving stack. If you need to serve a fine-tuned model that isn’t in Crusoe’s model catalog, or you need full control over the serving configuration (batch sizes, quantization settings, custom tokenizers), self-hosting on Saturn Cloud with vLLM or NVIDIA NIM is the better path. Saturn Cloud supports both; you can use Managed Inference for off-the-shelf models and self-host your custom models on the same platform.

Pricing

Crusoe Managed Inference uses per-token pricing. You can check current rates for each model on the Intelligence Foundry model cards. For production workloads with predictable volume, Crusoe also offers provisioned throughput pricing with reserved capacity.

Getting Started

If your team is already running Saturn Cloud on Crusoe, you can start using Managed Inference immediately by generating an API key and pointing the OpenAI SDK at api.crusoe.ai. If you’re on Saturn Cloud with a different cloud provider, the API is still accessible over the public internet, though you’ll get the lowest latency from a Crusoe-hosted Saturn Cloud deployment. To set up Saturn Cloud on Crusoe, contact support@saturncloud.io with your Crusoe project ID and requirements.