Saturn Cloud on Nebius: Setup Guide

How to deploy Saturn Cloud on Nebius for teams that need H100 and H200 GPUs without hyperscaler quota constraints.

The Platform Engineer’s Problem

If you’re running platform for an AI team, you likely have a backlog of projects: integrating the training pipeline with your feature store, building compliance automation for your industry’s data retention requirements, setting up model drift monitoring specific to your domain, optimizing the data loader for your particular workloads.

Before you can work on that backlog, you’re frontline support for issues that eat up your week:

  • IAM/RBAC: explaining permissions, debugging access issues
  • Cost tracking: manual usage reports, idle instance cleanup
  • Certificate management: TLS renewal, cert-manager debugging
  • Image building: Docker troubleshooting, registry management, CVE patching
  • Resource quotas: constant adjustment requests
  • Network debugging: service-to-service communication, VPN issues
  • Kubernetes upgrades: compatibility testing, deprecated API fixes
  • Logging: pipeline maintenance, retention management
  • User support: access setup, answering the same questions

Building this infrastructure takes a few weeks. The problem is maintaining it. Two years of edge cases, user requests, security patches, and daily operational interrupts.

The combination of Nebius and Saturn Cloud addresses both the GPU access problem and the platform maintenance burden. Nebius provides H100 GPUs at $2.95/hour per GPU with no quota approvals. An 8-GPU H100 instance runs $23.60/hour on Nebius versus $88-98/hour on Azure/GCP, and with no multi-month waitlists. Saturn Cloud handles the operational baseline:

Why Nebius

Nebius is a cloud provider focused on AI workloads, operating in US and EU regions.

GPU Availability: On-demand access to H100 and H200 instances without sales calls or capacity reservations. You can provision multi-node clusters with 16-32 GPUs immediately. Larger clusters through commitment contracts.

Pricing: $2.95/hour per H100 GPU. An 8-GPU instance runs $23.60/hour versus $88-98/hour on Azure or GCP.

Networking: NVLink and InfiniBand standard on all GPU instances. NFS storage at 12 GBps read throughput per 8-GPU VM (compare to AWS EFS at 1.5 GBps max).

Managed Services: Managed Kubernetes, PostgreSQL, and MLflow included.

Why Saturn Cloud

Saturn Cloud adds the platform layer on top of Nebius’s managed Kubernetes. It provides three core resource types:

  • Workspaces: Development environments (JupyterLab, RStudio, or SSH for VS Code/PyCharm) with persistent home directories and GPU access
  • Jobs: Scheduled or on-demand training runs that pull code from Git and execute on CPU or GPU instances
  • Deployments: Long-running services for model inference, APIs, or dashboards behind authenticated endpoints

The platform handles the operational baseline:

  • IAM/RBAC: User management, groups, project-based access, SSO integration
  • Cost tracking: Per-user and per-project usage reports, not just cluster-level costs. Integrates with your internal cost allocation.
  • Multi-node distributed training: Jobs with multiple GPU nodes get automatic environment setup (SATURN_JOB_LEADER, SATURN_JOB_RANK, worker DNS). All workers land on the same InfiniBand fabric for RDMA. NCCL configured for InfiniBand. Logs from all nodes accessible for debugging. You handle checkpointing strategy and recovery.
  • Certificate management: Automated TLS provisioning and renewal
  • Image management: Pre-built images with NVIDIA libraries (CUDA, NeMo, RAPIDS), or bring your own
  • Resource quotas: Project-level limits
  • Platform upgrades: Kubernetes compatibility, operator updates, security patches
  • Logging: Fluent Bit aggregation, accessible via UI
  • User support: Direct support for AI/ML engineers on platform questions

AI/ML engineers contact Saturn Cloud support directly for workspace and job questions. Platform engineers work on their actual project backlog.

GPU Options

Saturn Cloud on Nebius provides access to:

GPUMemoryConfigurations
H10080GB HBM31-GPU, 8-GPU
H200141GB HBM3e1-GPU, 8-GPU
GB200BlackwellVia Nebius

All instances include NVLink (intra-node) and InfiniBand (inter-node).

Architecture

Infrastructure Layer

The reference Terraform provisions a Nebius MK8S cluster (Kubernetes 1.30) with public control plane endpoints and a single etcd instance (configurable for HA). A service account is created and added to your Nebius IAM viewers group to pull images from Nebius Container Registry.

For GPU workloads, Terraform creates separate InfiniBand GPU clusters. In EU (eu-north1), this means fabric-6 for H100 (configurable) and fabric-7 for H200. In US (us-central1), the fabrics are us-central1-a and us-central1-b. These clusters provide the low-latency RDMA networking required for multi-node distributed training.

Node groups are provisioned for different workload types. The system pool runs 2-100 cpu-d3 nodes (4vcpu-16gb) for the Saturn control plane. Three CPU pools handle general workloads: 4vcpu-16gb, 16vcpu-64gb, and 64vcpu-256gb, all scaling from 0-100 nodes. GPU pools include H100 and H200 configurations, available in 1-GPU and 8-GPU variants, also scaling 0-100. All node groups carry node.saturncloud.io/role labels for scheduling, and GPU nodes use Nebius-managed CUDA 12 drivers via the gpu_settings.drivers_preset parameter. The 8-GPU nodes attach to their respective InfiniBand GPU clusters for distributed training.

Platform Layer

Saturn Cloud installs via a Kubernetes operator that manages platform components as custom resources. The saturn-helm-operator follows the standard operator pattern: it watches CRDs and reconciles Helm releases every 2 minutes. It ships as a Helm chart from oci://ghcr.io/saturncloud/charts/saturn-helm-operator-nebius.

The core services handle user-facing functionality. Atlas is the API server and PostgreSQL-backed database that manages resources (workspaces, jobs, deployments). The auth-server issues RS256 JWT tokens for user sessions and API access. Traefik acts as the ingress controller, routing traffic to workspaces, jobs, deployments, and the Saturn UI. The ssh-proxy provides an SSH gateway that proxies IDE connections (VS Code, PyCharm, Cursor) to running workspace pods.

Infrastructure services provide cluster functionality. The cluster-autoscaler scales Nebius node groups based on pending pods. Cert-manager handles TLS certificate provisioning. Logging runs Fluent Bit for log aggregation. Monitoring deploys Prometheus for metrics collection. Network policy enforcement (Cilium) and DNS (CoreDNS) are managed by Nebius MK8S. Shared folders use Nebius’s native shared filesystem infrastructure rather than a separate NFS provisioner.

The bootstrap process works as follows: Terraform provisions the MK8S cluster and node groups, then installs the saturn-helm-operator via Helm with a short-lived bootstrap token. The operator exchanges this bootstrap token for a long-lived token and stores it in cluster secrets. It then creates custom resources for each Saturn component and reconciles those CRs into Helm releases, installing all services.

All compute, storage, and networking stays in your Nebius account under your IAM and VPC policies. Data never leaves your account. Saturn Cloud only accesses the Kubernetes API to manage the operator and platform components.

Customization

Saturn Cloud runs your Docker containers with code from your Git repositories. You can use Saturn’s pre-built images with NVIDIA libraries or build your own. Node pool configurations are customizable via Terraform. You can deploy additional services into the same Kubernetes cluster (Prefect, Flyte, Dagster, ClickHouse, Datadog, Crowdstrike) and Saturn Cloud workloads can connect to them.

See the Operations and Customization documentation for details on upgrades, failure modes, debugging, and integration patterns.

Installation

Both installation options use the reference Terraform and saturn-helm-operator described in the Architecture section. The Terraform is customizable (different node pool sizes, additional GPU types, network configurations), but the Saturn Cloud operator configuration must match your node pool setup. Saturn Cloud can only provision workloads on node pools that exist in your Terraform.

Option 1: Managed Installation

Saturn Cloud support runs the reference Terraform and operator deployment.

  1. Have a Nebius project with VPC and subnet configured. Note your subnet ID and project ID.

  2. Email support@saturncloud.io with:

    • Organization name
    • Nebius project ID
    • Subnet ID
    • Requirements (GPU types, region, network configuration)
  3. Provide Saturn Cloud a service account with permissions to create resources.

  4. Saturn Cloud runs the Terraform (MK8S cluster, node groups, InfiniBand GPU clusters) and deploys the operator. Takes 15-30 minutes.

  5. Receive your Saturn Cloud URL and admin credentials.

Option 2: Self-Service Installation

Run the reference Terraform and operator deployment yourself.

  1. Register:
curl -X POST https://manager.saturnenterprise.io/api/v2/customers/register \
    -H "Content-Type: application/json" \
    -d '{
      "name": "your-organization-name",
      "email": "your-email@example.com",
      "cloud": "nebius"
    }'
  1. Activate via email. You’ll receive a terraform.tfvars with a 4-hour bootstrap token.

  2. Clone and deploy the reference Terraform:

git clone https://github.com/saturncloud/saturncloud-reference-terraform.git
cd saturncloud-reference-terraform/nebius/eu-north1  # or us-central1
terraform init && terraform plan && terraform apply

The Terraform provisions the MK8S cluster, node groups, InfiniBand GPU clusters, and installs the saturn-helm-operator with your bootstrap token. The operator then deploys all Saturn Cloud components as described in the Architecture section.

  1. Verify:
export KUBECONFIG=./kubeconfig
kubectl get nodes
kubectl get pods -A

GPU nodes scale from zero and appear when users create GPU workloads.

Conclusion

Nebius provides H100 and H200 GPU access without hyperscaler quota constraints. Saturn Cloud provides the platform layer so AI/ML engineers can use that infrastructure without platform engineers building and maintaining workspaces, job scheduling, and deployment infrastructure.

The installation uses standard Terraform and a Kubernetes operator. The architecture is documented above. Operational details are in the Operations and Customization docs.

Contact support@saturncloud.io to get started.

Resources