Saturn Cloud on Crusoe: Platform Architecture

How to deploy Saturn Cloud on Crusoe for teams that need H100, H200, and GB200 GPUs without hyperscaler quota constraints.

By Hugo Shi | Friday, December 12, 2025 | AI/ML DevOps | Updated: Thursday, December 18, 2025

The Platform Engineer’s Problem

If you’re running platform for an AI team, you likely have a backlog of projects: integrating the training pipeline with your feature store, building compliance automation for your industry’s data retention requirements, setting up model drift monitoring specific to your domain, optimizing the data loader for your particular workloads.

Before you can work on that backlog, you’re frontline support for issues that eat up your week:

IAM/RBAC: explaining permissions, debugging access issues
Cost tracking: manual usage reports, idle instance cleanup
Certificate management: TLS renewal, cert-manager debugging
Image building: Docker troubleshooting, registry management, CVE patching
Resource quotas: constant adjustment requests
Network debugging: service-to-service communication, VPN issues
Kubernetes upgrades: compatibility testing, deprecated API fixes
Logging: pipeline maintenance, retention management
User support: access setup, answering the same questions

Building this infrastructure takes a few weeks. The problem is maintaining it. Two years of edge cases, user requests, security patches, and daily operational interrupts.

Crusoe provides H100 GPUs at $3.90/hour (or $1.60/hour spot with 7-day advance notice) without the quota approvals and multi-month waitlists common on hyperscalers. Saturn Cloud handles the operational baseline on top, so AI/ML engineers get workspaces, jobs, and deployments without your team building and maintaining that infrastructure.

Comparing GPU cloud providers?

Download our GPU Cloud Comparison Report analyzing 17 providers across pricing, InfiniBand networking, storage, and enterprise readiness. Includes detailed Crusoe profile with infrastructure specs and use case recommendations.

The Stack

Layer	Provider	What’s Included
GPU Compute	Crusoe	H100, H200, B200, GB200, AMD MI300X/MI355X. InfiniBand NDR 400 Gb/s. Spot with 7-day notice.
Kubernetes	Crusoe	Managed Kubernetes (CMK), control plane, node autoscaling, GPU drivers.
Block Storage	Crusoe	Persistent disks (Lightbits Labs), shared filesystem (VAST Data) for multi-node access.
Platform	Saturn Cloud	Workspaces, jobs, deployments, SSO, cost tracking, multi-node training coordination.
Support	Saturn Cloud	AI/ML engineers contact Saturn Cloud directly for workspace and job issues.

All compute, storage, and networking stays in your Crusoe account. Saturn Cloud only accesses the Kubernetes API to manage the platform components.

Getting Started

Option 1: Managed Installation

Saturn Cloud runs the Terraform and operator deployment for you.

Have a Crusoe account with a project. Note your project ID.
Email support@saturncloud.io with your organization name, Crusoe project ID, and requirements (GPU types, region).
Provide Saturn Cloud a service account with resource creation permissions.
Saturn Cloud provisions the cluster (15-30 minutes).
Receive your Saturn Cloud URL and admin credentials.

Option 2: Self-Service Installation

Run the reference Terraform yourself.

1. Register your organization:

curl -X POST https://manager.saturnenterprise.io/api/v2/customers/register \
    -H "Content-Type: application/json" \
    -d '{
      "name": "your-organization-name",
      "email": "your-email@example.com",
      "cloud": "crusoe"
    }'

2. Activate your account. You’ll receive an activation email with a token. Either click the activation link or run:

curl -X POST https://manager.saturnenterprise.io/v2/activate \
    -H "Content-Type: application/json" \
    -d '{"token": "YOUR_ACTIVATION_TOKEN"}'

After activation, you’ll receive a terraform.tfvars file containing a 4-hour bootstrap token. If the token expires before you deploy, regenerate it:

curl -X POST https://manager.saturnenterprise.io/v2/resend-setup \
    -H "Content-Type: application/json" \
    -d '{"name": "your-organization-name", "email": "your-email@example.com"}'

3. Clone the reference Terraform:

git clone https://github.com/saturncloud/saturncloud-reference-terraform.git
cd saturncloud-reference-terraform/crusoe

4. Configure and deploy. Copy your terraform.tfvars into the directory, then:

terraform init
terraform plan
terraform apply

The Terraform provisions the CMK cluster, node groups, InfiniBand partitions, and installs the Saturn Cloud operator.

5. Verify:

export KUBECONFIG=./kubeconfig
kubectl get nodes
kubectl get pods -A

GPU nodes scale from zero and appear when users create GPU workloads. The installation typically completes in 15-30 minutes.

Architecture

Infrastructure Layer

The reference Terraform provisions a Crusoe Managed Kubernetes (CMK) cluster with separate node groups for different workload types. A VPC network and subnet are created for cluster networking.

For multi-node GPU workloads, Terraform creates InfiniBand partitions. Crusoe’s crusoe_ib_networks data source exposes IB network capacity through the API, so you can check availability before provisioning rather than discovering capacity issues mid-deployment. Each 8-GPU instance consumes 8 slices from the IB network; the Terraform filters for networks with sufficient capacity and fails fast at plan time if none exist.

data "crusoe_ib_networks" "available" {}

locals {
  required_slices = var.vm_count * 8
  suitable_networks = [
    for net in data.crusoe_ib_networks.available.ib_networks :
    net if net.location == var.location && anytrue([
      for cap in net.capacities :
      cap.slice_type == var.slice_type && cap.quantity >= local.required_slices
    ])
  ]
}

resource "crusoe_ib_partition" "training" {
  name          = "training-partition"
  ib_network_id = local.suitable_networks[0].id
}

GPU instances attach to IB partitions via the host_channel_adapters block. All instances in the same partition can communicate over InfiniBand at 400 Gb/s. Node groups carry node.saturncloud.io/role labels for scheduling, and GPU nodes use Crusoe’s ubuntu22.04-nvidia-sxm-docker image with pre-installed drivers.

Platform Layer

Saturn Cloud installs via a Kubernetes operator that manages platform components as custom resources. The saturn-helm-operator follows the standard operator pattern: it watches CRDs and reconciles Helm releases every 2 minutes. It ships as a Helm chart from oci://ghcr.io/saturncloud/charts/saturn-helm-operator-crusoe.

The bootstrap process:

Terraform provisions the CMK cluster and node groups
Terraform installs the saturn-helm-operator via Helm with a short-lived bootstrap token
The operator exchanges this bootstrap token for a long-lived token and stores it in cluster secrets
The operator creates custom resources for each Saturn component
The operator reconciles those CRs into Helm releases, installing all services

Core services:

Component	Function
Atlas	API server and PostgreSQL-backed database for resource management
Auth-server	RS256 JWT tokens for user sessions and API access
Traefik	Ingress routing to workspaces, jobs, deployments, and UI
SSH-proxy	Gateway for IDE connections (VS Code, PyCharm, Cursor)
Cluster-autoscaler	Scales Crusoe node groups based on pending pods

Infrastructure services include cert-manager for TLS, Fluent Bit for logging, and Prometheus for metrics.

What Saturn Cloud Handles

Resource	Purpose	Access Methods
Workspaces	Development environments with persistent home directories	JupyterLab, RStudio, SSH (VS Code/PyCharm/Cursor)
Jobs	Training runs from Git, scheduled or on-demand	Single-node or multi-node distributed
Deployments	Long-running services behind authenticated endpoints	Model APIs, dashboards

Multi-node distributed training: Each worker needs rank, world size, and leader address injected before the script starts. Workers must land on the same InfiniBand partition or communication bottlenecks at 25 Gbps instead of 400 Gbps. NCCL needs correct environment variables. When training fails, you need correlated logs from all nodes. Saturn Cloud handles the coordination: automatic environment setup (SATURN_JOB_LEADER, SATURN_JOB_RANK, worker DNS), IB partition scheduling, NCCL configuration, and log aggregation. You handle checkpointing and recovery logic.

IAM/RBAC: User management, groups, project-based access, SSO integration.

Cost tracking: Per-user and per-project usage reports, not just cluster-level costs. Integrates with your internal cost allocation.

Certificate management: Automated TLS provisioning and renewal via cert-manager.

Image management: Pre-built images with NVIDIA libraries (CUDA, NeMo, RAPIDS), or bring your own from any registry.

Platform upgrades: Kubernetes compatibility, operator updates, security patches. Typically every 6 months, causing 1-2 minutes of UI/API downtime. User workloads continue running during upgrades.

AI/ML engineers contact Saturn Cloud support directly for workspace and job questions. Platform engineers work on their actual project backlog.

What You Handle

Training code and checkpointing: Saturn Cloud coordinates multi-node jobs; you write the training logic and decide checkpointing strategy.

Data pipelines: Saturn Cloud doesn’t manage your feature stores, ETL, or data versioning. Your existing tools (Prefect, Dagster, Airflow) run alongside Saturn Cloud in the same cluster.

Compliance specifics: Saturn Cloud provides the platform primitives (SSO, audit logs, network isolation). Industry-specific compliance automation is yours.

Object storage: Crusoe doesn’t have native managed object storage. Deploy MinIO on Crusoe, use third-party S3-compatible services (Backblaze B2, Cloudflare R2), or access existing hyperscaler storage during hybrid migration.

GPU Options and Pricing

Crusoe operates in US (Virginia, Texas), Canada (Alberta), and EU (Iceland, Norway).

GPU	Memory	On-Demand	Spot	Notes
H100 SXM	80GB HBM3	$3.90/GPU-hr	$1.60/GPU-hr	8-GPU instances, InfiniBand
H200 SXM	141GB HBM3e	$4.29/GPU-hr	-	8-GPU instances, InfiniBand
B200	192GB HBM3e	Contact Crusoe	-	Blackwell architecture
GB200	Blackwell	Contact Crusoe	-	Grace-Blackwell
MI300X	192GB HBM3	$3.45/GPU-hr	$0.95/GPU-hr	AMD, ROCm

Spot instances include 7-day advance notice before interruption, which is enough time to complete most training runs or migrate to on-demand. Reserved capacity contracts offer up to 81% savings on 3-year terms.

For comparison: an 8-GPU H100 instance runs $31.20/hour on Crusoe versus $88-98/hour on Azure/GCP.

Customization and Portability

Customization: Saturn Cloud runs your Docker containers with code from your Git repositories. Node pool configurations are customizable via Terraform. You can deploy additional services (Prefect, Dagster, ClickHouse, Datadog) into the same cluster and Saturn Cloud workloads can connect to them.

Portability: Your workloads are standard Kubernetes pods. Resource configurations export as YAML recipes via CLI or API. All data stays in your Crusoe account. If you stop using Saturn Cloud, containers run on standard Kubernetes without modification.

Failure modes: If the saturn-helm-operator stops reconciling, existing workloads continue running. New workloads fail after 12 hours when registry credentials expire. SSL certificates expire after 3 months. Saturn Cloud support receives automatic alerts when the operator stops.

When This Isn’t the Right Fit

You need direct pod spec control: Saturn Cloud is opinionated about how workspaces, jobs, and deployments work. You can’t use custom scheduler plugins or modify operator internals. If you need that level of control, build on raw Kubernetes.

You’re spending under $10k/month on GPU: Migration effort may not pay off at smaller scale.

You need regions Crusoe doesn’t cover: Crusoe operates in US, Canada, and EU. No Asia-Pacific, Middle East, or Latin America yet.

Saturn Cloud doesn’t have to be your entire platform. It handles GPU workspaces and training jobs; your existing orchestration, databases, and internal tools run next to it.

Saturn Cloud: the platform layer for AI teams

Saturn Cloud handles the infrastructure work every AI team needs but nobody wants to build: dev environments, job scheduling, deployments, SSO, and usage tracking. It runs on your cloud (including Crusoe) so your platform team can focus on what actually differentiates your business. Try the hosted version or talk to our team about enterprise deployment.

Saturn Cloud on Crusoe: Platform Architecture

The Platform Engineer’s Problem

Comparing GPU cloud providers?

The Stack

Getting Started

Option 1: Managed Installation

Option 2: Self-Service Installation

Architecture

Infrastructure Layer

Platform Layer

What Saturn Cloud Handles

What You Handle

GPU Options and Pricing

Customization and Portability

When This Isn’t the Right Fit

Saturn Cloud: the platform layer for AI teams

Resources

Saturn Cloud provides customizable, ready-to-use cloud environments
for collaborative data teams.

Saturn Cloud on Crusoe: Platform Architecture

The Platform Engineer’s Problem

Comparing GPU cloud providers?

The Stack

Getting Started

Option 1: Managed Installation

Option 2: Self-Service Installation

Architecture

Infrastructure Layer

Platform Layer

What Saturn Cloud Handles

What You Handle

GPU Options and Pricing

Customization and Portability

When This Isn’t the Right Fit

Saturn Cloud: the platform layer for AI teams

Resources

Saturn Cloud provides customizable, ready-to-use cloud environmentsfor collaborative data teams.

Saturn Cloud provides customizable, ready-to-use cloud environments
for collaborative data teams.