Saturn Cloud on Crusoe: Platform Architecture

The Platform Engineer’s Problem
If you’re running platform for an AI team, you likely have a backlog of projects: integrating the training pipeline with your feature store, building compliance automation for your industry’s data retention requirements, setting up model drift monitoring specific to your domain, optimizing the data loader for your particular workloads.
Before you can work on that backlog, you’re frontline support for issues that eat up your week:
- IAM/RBAC: explaining permissions, debugging access issues
- Cost tracking: manual usage reports, idle instance cleanup
- Certificate management: TLS renewal, cert-manager debugging
- Image building: Docker troubleshooting, registry management, CVE patching
- Resource quotas: constant adjustment requests
- Network debugging: service-to-service communication, VPN issues
- Kubernetes upgrades: compatibility testing, deprecated API fixes
- Logging: pipeline maintenance, retention management
- User support: access setup, answering the same questions
Building this infrastructure takes a few weeks. The problem is maintaining it. Two years of edge cases, user requests, security patches, and daily operational interrupts.
Crusoe provides H100 GPUs at $3.90/hour (or $1.60/hour spot with 7-day advance notice) without the quota approvals and multi-month waitlists common on hyperscalers. Saturn Cloud handles the operational baseline on top, so AI/ML engineers get workspaces, jobs, and deployments without your team building and maintaining that infrastructure.
Comparing GPU cloud providers?
Download our GPU Cloud Comparison Report analyzing 17 providers across pricing, InfiniBand networking, storage, and enterprise readiness. Includes detailed Crusoe profile with infrastructure specs and use case recommendations.
The Stack
| Layer | Provider | What’s Included |
|---|---|---|
| GPU Compute | Crusoe | H100, H200, B200, GB200, AMD MI300X/MI355X. InfiniBand NDR 400 Gb/s. Spot with 7-day notice. |
| Kubernetes | Crusoe | Managed Kubernetes (CMK), control plane, node autoscaling, GPU drivers. |
| Block Storage | Crusoe | Persistent disks (Lightbits Labs), shared filesystem (VAST Data) for multi-node access. |
| Platform | Saturn Cloud | Workspaces, jobs, deployments, SSO, cost tracking, multi-node training coordination. |
| Support | Saturn Cloud | AI/ML engineers contact Saturn Cloud directly for workspace and job issues. |
All compute, storage, and networking stays in your Crusoe account. Saturn Cloud only accesses the Kubernetes API to manage the platform components.
Getting Started
Option 1: Managed Installation
Saturn Cloud runs the Terraform and operator deployment for you.
- Have a Crusoe account with a project. Note your project ID.
- Email support@saturncloud.io with your organization name, Crusoe project ID, and requirements (GPU types, region).
- Provide Saturn Cloud a service account with resource creation permissions.
- Saturn Cloud provisions the cluster (15-30 minutes).
- Receive your Saturn Cloud URL and admin credentials.
Option 2: Self-Service Installation
Run the reference Terraform yourself.
1. Register your organization:
curl -X POST https://manager.saturnenterprise.io/api/v2/customers/register \
-H "Content-Type: application/json" \
-d '{
"name": "your-organization-name",
"email": "your-email@example.com",
"cloud": "crusoe"
}'
2. Activate your account. You’ll receive an activation email with a token. Either click the activation link or run:
curl -X POST https://manager.saturnenterprise.io/v2/activate \
-H "Content-Type: application/json" \
-d '{"token": "YOUR_ACTIVATION_TOKEN"}'
After activation, you’ll receive a terraform.tfvars file containing a 4-hour bootstrap token. If the token expires before you deploy, regenerate it:
curl -X POST https://manager.saturnenterprise.io/v2/resend-setup \
-H "Content-Type: application/json" \
-d '{"name": "your-organization-name", "email": "your-email@example.com"}'
3. Clone the reference Terraform:
git clone https://github.com/saturncloud/saturncloud-reference-terraform.git
cd saturncloud-reference-terraform/crusoe
4. Configure and deploy. Copy your terraform.tfvars into the directory, then:
terraform init
terraform plan
terraform apply
The Terraform provisions the CMK cluster, node groups, InfiniBand partitions, and installs the Saturn Cloud operator.
5. Verify:
export KUBECONFIG=./kubeconfig
kubectl get nodes
kubectl get pods -A
GPU nodes scale from zero and appear when users create GPU workloads. The installation typically completes in 15-30 minutes.
Architecture
Infrastructure Layer
The reference Terraform provisions a Crusoe Managed Kubernetes (CMK) cluster with separate node groups for different workload types. A VPC network and subnet are created for cluster networking.
For multi-node GPU workloads, Terraform creates InfiniBand partitions. Crusoe’s crusoe_ib_networks data source exposes IB network capacity through the API, so you can check availability before provisioning rather than discovering capacity issues mid-deployment. Each 8-GPU instance consumes 8 slices from the IB network; the Terraform filters for networks with sufficient capacity and fails fast at plan time if none exist.
data "crusoe_ib_networks" "available" {}
locals {
required_slices = var.vm_count * 8
suitable_networks = [
for net in data.crusoe_ib_networks.available.ib_networks :
net if net.location == var.location && anytrue([
for cap in net.capacities :
cap.slice_type == var.slice_type && cap.quantity >= local.required_slices
])
]
}
resource "crusoe_ib_partition" "training" {
name = "training-partition"
ib_network_id = local.suitable_networks[0].id
}
GPU instances attach to IB partitions via the host_channel_adapters block. All instances in the same partition can communicate over InfiniBand at 400 Gb/s. Node groups carry node.saturncloud.io/role labels for scheduling, and GPU nodes use Crusoe’s ubuntu22.04-nvidia-sxm-docker image with pre-installed drivers.
Platform Layer
Saturn Cloud installs via a Kubernetes operator that manages platform components as custom resources. The saturn-helm-operator follows the standard operator pattern: it watches CRDs and reconciles Helm releases every 2 minutes. It ships as a Helm chart from oci://ghcr.io/saturncloud/charts/saturn-helm-operator-crusoe.
The bootstrap process:
- Terraform provisions the CMK cluster and node groups
- Terraform installs the saturn-helm-operator via Helm with a short-lived bootstrap token
- The operator exchanges this bootstrap token for a long-lived token and stores it in cluster secrets
- The operator creates custom resources for each Saturn component
- The operator reconciles those CRs into Helm releases, installing all services
Core services:
| Component | Function |
|---|---|
| Atlas | API server and PostgreSQL-backed database for resource management |
| Auth-server | RS256 JWT tokens for user sessions and API access |
| Traefik | Ingress routing to workspaces, jobs, deployments, and UI |
| SSH-proxy | Gateway for IDE connections (VS Code, PyCharm, Cursor) |
| Cluster-autoscaler | Scales Crusoe node groups based on pending pods |
Infrastructure services include cert-manager for TLS, Fluent Bit for logging, and Prometheus for metrics.
What Saturn Cloud Handles
| Resource | Purpose | Access Methods |
|---|---|---|
| Workspaces | Development environments with persistent home directories | JupyterLab, RStudio, SSH (VS Code/PyCharm/Cursor) |
| Jobs | Training runs from Git, scheduled or on-demand | Single-node or multi-node distributed |
| Deployments | Long-running services behind authenticated endpoints | Model APIs, dashboards |
Multi-node distributed training: Each worker needs rank, world size, and leader address injected before the script starts. Workers must land on the same InfiniBand partition or communication bottlenecks at 25 Gbps instead of 400 Gbps. NCCL needs correct environment variables. When training fails, you need correlated logs from all nodes. Saturn Cloud handles the coordination: automatic environment setup (SATURN_JOB_LEADER, SATURN_JOB_RANK, worker DNS), IB partition scheduling, NCCL configuration, and log aggregation. You handle checkpointing and recovery logic.
IAM/RBAC: User management, groups, project-based access, SSO integration.
Cost tracking: Per-user and per-project usage reports, not just cluster-level costs. Integrates with your internal cost allocation.
Certificate management: Automated TLS provisioning and renewal via cert-manager.
Image management: Pre-built images with NVIDIA libraries (CUDA, NeMo, RAPIDS), or bring your own from any registry.
Platform upgrades: Kubernetes compatibility, operator updates, security patches. Typically every 6 months, causing 1-2 minutes of UI/API downtime. User workloads continue running during upgrades.
AI/ML engineers contact Saturn Cloud support directly for workspace and job questions. Platform engineers work on their actual project backlog.
What You Handle
Training code and checkpointing: Saturn Cloud coordinates multi-node jobs; you write the training logic and decide checkpointing strategy.
Data pipelines: Saturn Cloud doesn’t manage your feature stores, ETL, or data versioning. Your existing tools (Prefect, Dagster, Airflow) run alongside Saturn Cloud in the same cluster.
Compliance specifics: Saturn Cloud provides the platform primitives (SSO, audit logs, network isolation). Industry-specific compliance automation is yours.
Object storage: Crusoe doesn’t have native managed object storage. Deploy MinIO on Crusoe, use third-party S3-compatible services (Backblaze B2, Cloudflare R2), or access existing hyperscaler storage during hybrid migration.
GPU Options and Pricing
Crusoe operates in US (Virginia, Texas), Canada (Alberta), and EU (Iceland, Norway).
| GPU | Memory | On-Demand | Spot | Notes |
|---|---|---|---|---|
| H100 SXM | 80GB HBM3 | $3.90/GPU-hr | $1.60/GPU-hr | 8-GPU instances, InfiniBand |
| H200 SXM | 141GB HBM3e | $4.29/GPU-hr | - | 8-GPU instances, InfiniBand |
| B200 | 192GB HBM3e | Contact Crusoe | - | Blackwell architecture |
| GB200 | Blackwell | Contact Crusoe | - | Grace-Blackwell |
| MI300X | 192GB HBM3 | $3.45/GPU-hr | $0.95/GPU-hr | AMD, ROCm |
Spot instances include 7-day advance notice before interruption, which is enough time to complete most training runs or migrate to on-demand. Reserved capacity contracts offer up to 81% savings on 3-year terms.
For comparison: an 8-GPU H100 instance runs $31.20/hour on Crusoe versus $88-98/hour on Azure/GCP.
Customization and Portability
Customization: Saturn Cloud runs your Docker containers with code from your Git repositories. Node pool configurations are customizable via Terraform. You can deploy additional services (Prefect, Dagster, ClickHouse, Datadog) into the same cluster and Saturn Cloud workloads can connect to them.
Portability: Your workloads are standard Kubernetes pods. Resource configurations export as YAML recipes via CLI or API. All data stays in your Crusoe account. If you stop using Saturn Cloud, containers run on standard Kubernetes without modification.
Failure modes: If the saturn-helm-operator stops reconciling, existing workloads continue running. New workloads fail after 12 hours when registry credentials expire. SSL certificates expire after 3 months. Saturn Cloud support receives automatic alerts when the operator stops.
When This Isn’t the Right Fit
You need direct pod spec control: Saturn Cloud is opinionated about how workspaces, jobs, and deployments work. You can’t use custom scheduler plugins or modify operator internals. If you need that level of control, build on raw Kubernetes.
You’re spending under $10k/month on GPU: Migration effort may not pay off at smaller scale.
You need regions Crusoe doesn’t cover: Crusoe operates in US, Canada, and EU. No Asia-Pacific, Middle East, or Latin America yet.
Saturn Cloud doesn’t have to be your entire platform. It handles GPU workspaces and training jobs; your existing orchestration, databases, and internal tools run next to it.
Saturn Cloud: the platform layer for AI teams
Saturn Cloud handles the infrastructure work every AI team needs but nobody wants to build: dev environments, job scheduling, deployments, SSO, and usage tracking. It runs on your cloud (including Crusoe) so your platform team can focus on what actually differentiates your business. Try the hosted version or talk to our team about enterprise deployment.
Resources
- Crusoe Cloud Documentation
- Crusoe Terraform Provider
- InfiniBand Configuration Guide
- Migrating from Hyperscalers to Crusoe
Saturn Cloud provides customizable, ready-to-use cloud environments
for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without having to switch tools.