Saturn Cloud on Crusoe: Platform Architecture

How to deploy Saturn Cloud on Crusoe for teams that need H100, H200, and GB200 GPUs without hyperscaler quota constraints.

The Platform Engineer’s Problem

If you’re running platform for an AI team, you likely have a backlog of projects: integrating the training pipeline with your feature store, building compliance automation for your industry’s data retention requirements, setting up model drift monitoring specific to your domain, optimizing the data loader for your particular workloads.

Before you can work on that backlog, you’re frontline support for issues that eat up your week:

  • IAM/RBAC: explaining permissions, debugging access issues
  • Cost tracking: manual usage reports, idle instance cleanup
  • Certificate management: TLS renewal, cert-manager debugging
  • Image building: Docker troubleshooting, registry management, CVE patching
  • Resource quotas: constant adjustment requests
  • Network debugging: service-to-service communication, VPN issues
  • Kubernetes upgrades: compatibility testing, deprecated API fixes
  • Logging: pipeline maintenance, retention management
  • User support: access setup, answering the same questions

Building this infrastructure takes a few weeks. The problem is maintaining it. Two years of edge cases, user requests, security patches, and daily operational interrupts.

Crusoe provides H100 GPUs at $3.90/hour (or $1.60/hour spot with 7-day advance notice) without the quota approvals and multi-month waitlists common on hyperscalers. Saturn Cloud handles the operational baseline on top, so AI/ML engineers get workspaces, jobs, and deployments without your team building and maintaining that infrastructure.

The Stack

LayerProviderWhat’s Included
GPU ComputeCrusoeH100, H200, B200, GB200, AMD MI300X/MI355X. InfiniBand NDR 400 Gb/s. Spot with 7-day notice.
KubernetesCrusoeManaged Kubernetes (CMK), control plane, node autoscaling, GPU drivers.
Block StorageCrusoePersistent disks (Lightbits Labs), shared filesystem (VAST Data) for multi-node access.
PlatformSaturn CloudWorkspaces, jobs, deployments, SSO, cost tracking, multi-node training coordination.
SupportSaturn CloudAI/ML engineers contact Saturn Cloud directly for workspace and job issues.

All compute, storage, and networking stays in your Crusoe account. Saturn Cloud only accesses the Kubernetes API to manage the platform components.

Getting Started

Option 1: Managed Installation

Saturn Cloud runs the Terraform and operator deployment for you.

  1. Have a Crusoe account with a project. Note your project ID.
  2. Email support@saturncloud.io with your organization name, Crusoe project ID, and requirements (GPU types, region).
  3. Provide Saturn Cloud a service account with resource creation permissions.
  4. Saturn Cloud provisions the cluster (15-30 minutes).
  5. Receive your Saturn Cloud URL and admin credentials.

Option 2: Self-Service Installation

Run the reference Terraform yourself.

1. Register your organization:

curl -X POST https://manager.saturnenterprise.io/api/v2/customers/register \
    -H "Content-Type: application/json" \
    -d '{
      "name": "your-organization-name",
      "email": "your-email@example.com",
      "cloud": "crusoe"
    }'

2. Activate your account. You’ll receive an activation email with a token. Either click the activation link or run:

curl -X POST https://manager.saturnenterprise.io/v2/activate \
    -H "Content-Type: application/json" \
    -d '{"token": "YOUR_ACTIVATION_TOKEN"}'

After activation, you’ll receive a terraform.tfvars file containing a 4-hour bootstrap token. If the token expires before you deploy, regenerate it:

curl -X POST https://manager.saturnenterprise.io/v2/resend-setup \
    -H "Content-Type: application/json" \
    -d '{"name": "your-organization-name", "email": "your-email@example.com"}'

3. Clone the reference Terraform:

git clone https://github.com/saturncloud/saturncloud-reference-terraform.git
cd saturncloud-reference-terraform/crusoe

4. Configure and deploy. Copy your terraform.tfvars into the directory, then:

terraform init
terraform plan
terraform apply

The Terraform provisions the CMK cluster, node groups, InfiniBand partitions, and installs the Saturn Cloud operator.

5. Verify:

export KUBECONFIG=./kubeconfig
kubectl get nodes
kubectl get pods -A

GPU nodes scale from zero and appear when users create GPU workloads. The installation typically completes in 15-30 minutes.

Architecture

Infrastructure Layer

The reference Terraform provisions a Crusoe Managed Kubernetes (CMK) cluster with separate node groups for different workload types. A VPC network and subnet are created for cluster networking.

For multi-node GPU workloads, Terraform creates InfiniBand partitions. Crusoe’s crusoe_ib_networks data source exposes IB network capacity through the API, so you can check availability before provisioning rather than discovering capacity issues mid-deployment. Each 8-GPU instance consumes 8 slices from the IB network; the Terraform filters for networks with sufficient capacity and fails fast at plan time if none exist.

data "crusoe_ib_networks" "available" {}

locals {
  required_slices = var.vm_count * 8
  suitable_networks = [
    for net in data.crusoe_ib_networks.available.ib_networks :
    net if net.location == var.location && anytrue([
      for cap in net.capacities :
      cap.slice_type == var.slice_type && cap.quantity >= local.required_slices
    ])
  ]
}

resource "crusoe_ib_partition" "training" {
  name          = "training-partition"
  ib_network_id = local.suitable_networks[0].id
}

GPU instances attach to IB partitions via the host_channel_adapters block. All instances in the same partition can communicate over InfiniBand at 400 Gb/s. Node groups carry node.saturncloud.io/role labels for scheduling, and GPU nodes use Crusoe’s ubuntu22.04-nvidia-sxm-docker image with pre-installed drivers.

Platform Layer

Saturn Cloud installs via a Kubernetes operator that manages platform components as custom resources. The saturn-helm-operator follows the standard operator pattern: it watches CRDs and reconciles Helm releases every 2 minutes. It ships as a Helm chart from oci://ghcr.io/saturncloud/charts/saturn-helm-operator-crusoe.

The bootstrap process:

  1. Terraform provisions the CMK cluster and node groups
  2. Terraform installs the saturn-helm-operator via Helm with a short-lived bootstrap token
  3. The operator exchanges this bootstrap token for a long-lived token and stores it in cluster secrets
  4. The operator creates custom resources for each Saturn component
  5. The operator reconciles those CRs into Helm releases, installing all services

Core services:

ComponentFunction
AtlasAPI server and PostgreSQL-backed database for resource management
Auth-serverRS256 JWT tokens for user sessions and API access
TraefikIngress routing to workspaces, jobs, deployments, and UI
SSH-proxyGateway for IDE connections (VS Code, PyCharm, Cursor)
Cluster-autoscalerScales Crusoe node groups based on pending pods

Infrastructure services include cert-manager for TLS, Fluent Bit for logging, and Prometheus for metrics.

What Saturn Cloud Handles

ResourcePurposeAccess Methods
WorkspacesDevelopment environments with persistent home directoriesJupyterLab, RStudio, SSH (VS Code/PyCharm/Cursor)
JobsTraining runs from Git, scheduled or on-demandSingle-node or multi-node distributed
DeploymentsLong-running services behind authenticated endpointsModel APIs, dashboards

Multi-node distributed training: Each worker needs rank, world size, and leader address injected before the script starts. Workers must land on the same InfiniBand partition or communication bottlenecks at 25 Gbps instead of 400 Gbps. NCCL needs correct environment variables. When training fails, you need correlated logs from all nodes. Saturn Cloud handles the coordination: automatic environment setup (SATURN_JOB_LEADER, SATURN_JOB_RANK, worker DNS), IB partition scheduling, NCCL configuration, and log aggregation. You handle checkpointing and recovery logic.

IAM/RBAC: User management, groups, project-based access, SSO integration.

Cost tracking: Per-user and per-project usage reports, not just cluster-level costs. Integrates with your internal cost allocation.

Certificate management: Automated TLS provisioning and renewal via cert-manager.

Image management: Pre-built images with NVIDIA libraries (CUDA, NeMo, RAPIDS), or bring your own from any registry.

Platform upgrades: Kubernetes compatibility, operator updates, security patches. Typically every 6 months, causing 1-2 minutes of UI/API downtime. User workloads continue running during upgrades.

AI/ML engineers contact Saturn Cloud support directly for workspace and job questions. Platform engineers work on their actual project backlog.

What You Handle

Training code and checkpointing: Saturn Cloud coordinates multi-node jobs; you write the training logic and decide checkpointing strategy.

Data pipelines: Saturn Cloud doesn’t manage your feature stores, ETL, or data versioning. Your existing tools (Prefect, Dagster, Airflow) run alongside Saturn Cloud in the same cluster.

Compliance specifics: Saturn Cloud provides the platform primitives (SSO, audit logs, network isolation). Industry-specific compliance automation is yours.

Object storage: Crusoe doesn’t have native managed object storage. Deploy MinIO on Crusoe, use third-party S3-compatible services (Backblaze B2, Cloudflare R2), or access existing hyperscaler storage during hybrid migration.

GPU Options and Pricing

Crusoe operates in US (Virginia, Texas), Canada (Alberta), and EU (Iceland, Norway).

GPUMemoryOn-DemandSpotNotes
H100 SXM80GB HBM3$3.90/GPU-hr$1.60/GPU-hr8-GPU instances, InfiniBand
H200 SXM141GB HBM3e$4.29/GPU-hr-8-GPU instances, InfiniBand
B200192GB HBM3eContact Crusoe-Blackwell architecture
GB200BlackwellContact Crusoe-Grace-Blackwell
MI300X192GB HBM3$3.45/GPU-hr$0.95/GPU-hrAMD, ROCm

Spot instances include 7-day advance notice before interruption, which is enough time to complete most training runs or migrate to on-demand. Reserved capacity contracts offer up to 81% savings on 3-year terms.

For comparison: an 8-GPU H100 instance runs $31.20/hour on Crusoe versus $88-98/hour on Azure/GCP.

Customization and Portability

Customization: Saturn Cloud runs your Docker containers with code from your Git repositories. Node pool configurations are customizable via Terraform. You can deploy additional services (Prefect, Dagster, ClickHouse, Datadog) into the same cluster and Saturn Cloud workloads can connect to them.

Portability: Your workloads are standard Kubernetes pods. Resource configurations export as YAML recipes via CLI or API. All data stays in your Crusoe account. If you stop using Saturn Cloud, containers run on standard Kubernetes without modification.

Failure modes: If the saturn-helm-operator stops reconciling, existing workloads continue running. New workloads fail after 12 hours when registry credentials expire. SSL certificates expire after 3 months. Saturn Cloud support receives automatic alerts when the operator stops.

When This Isn’t the Right Fit

You need direct pod spec control: Saturn Cloud is opinionated about how workspaces, jobs, and deployments work. You can’t use custom scheduler plugins or modify operator internals. If you need that level of control, build on raw Kubernetes.

You’re spending under $10k/month on GPU: Migration effort may not pay off at smaller scale.

You need regions Crusoe doesn’t cover: Crusoe operates in US, Canada, and EU. No Asia-Pacific, Middle East, or Latin America yet.

Saturn Cloud doesn’t have to be your entire platform. It handles GPU workspaces and training jobs; your existing orchestration, databases, and internal tools run next to it.

Resources