Multi-tenant GPU Kubernetes

Single-tenant clusters. Self-service. Automated end to end.

GPU cloud operators don't actually want to share a Kubernetes cluster between customers. They want every cluster to belong to exactly one tenant, and they want tenants to provision them on their own, on demand, without a ticket. Saturn Cloud is the platform layer that ships inside each cluster and the fleet console the operator uses to run the rest.

Talk to an engineer See docs

What you get

Single-tenant clusters, self-service, automated end to end

Every cluster belongs to exactly one tenant. Tenants provision clusters on demand from a portal: one cluster, or a fleet of them for prod, staging, research, and regional deployments. The operator manages the underlying fleet of hardware. Everything in between is automated.

Single-tenant by default

Every cluster belongs to exactly one tenant. Dedicated control plane, dedicated GPU nodes, dedicated network fabric. No shared kube-apiserver, no shared etcd, no shared CNI.

Self-service for the tenant

Tenants spin up new clusters from a web UI or API. A dev cluster, a prod cluster, a quick sandbox to test a training script. No ticket, no email, no waiting on the operator's network team.

Fully automated lifecycle

Provisioning, network setup, GPU drivers, upgrades, and teardown all happen by API. When a tenant churns, their clusters are destroyed and the hardware is wiped and returned to the pool automatically.

InfiniBand and RoCE that just work

Multi-node training is the workload customers actually want, and it's the one most GPU clouds get wrong. RDMA fabric is configured at cluster provision time, NCCL discovers the topology on first launch, and a tenant's first torchrun across 16 GPUs runs without a support ticket.

One operator console across the fleet

The operator sees every cluster across every tenant: utilization, billing, idle workloads, hardware health. Provision quota, reclaim capacity, and roll upgrades from one place.

Reference architecture

k0rdent for cluster lifecycle, Netris for fabric, Saturn Cloud for the tenant platform

Saturn Cloud Tenant self-service: request a cluster, launch workspaces, run jobs, see usage. Operator fleet console for every cluster.

▼

k0rdent + Netris k0rdent: cluster provisioning and teardown via Cluster API and Metal3. Netris: per-cluster VPC, VLAN, and RDMA fabric, programmed at provision time.

▼

Shared GPU fleet H100 / H200 / B200 nodes · InfiniBand or RoCE · Allocated on demand, returned on teardown

Tenants self-serve. The operator stays out of the loop.

A tenant logs into the Saturn Cloud portal, picks a region and a GPU shape, and requests a new cluster. Minutes later they have a kubeconfig and a working AI platform. No ticket, no human in the path.

k0rdent provisions real clusters, by API

k0rdent picks free GPU nodes from the fleet, images them, and stands up a fresh Kubernetes cluster with a dedicated control plane. Teardown reverses the process and returns nodes to the pool. The same flow runs whether the tenant asks for one cluster or twenty.

Netris programs the fabric to match

Each new cluster gets its own L2 and L3 boundaries, its own RDMA partition, and its own outbound IPs, configured at the same time as the cluster. The network is part of the automation, not a follow-up task.

Multi-node training works on first launch

The NVIDIA network operator, MOFED, NCCL topology, and IPoIB are all configured by the cluster provisioning pipeline. A tenant logging into a fresh cluster can run distributed training across multiple nodes without tuning the fabric, debugging InfiniBand, or filing a support ticket about RDMA. This is the workload customers buy GPU clouds to run.

Saturn Cloud is what tenants log in to

Inside every tenant cluster, Saturn Cloud delivers JupyterLab and VS Code workspaces, jobs, deployments, SSO, and usage tracking. Tenants do not need to learn Kubernetes. They get a working AI platform, scoped to that cluster.

The operator sees the fleet, not just one cluster

A single Saturn Cloud operator console aggregates usage, billing, and health across every cluster a tenant has provisioned. Set per-tenant quotas, reclaim hardware, and roll upgrades from one place.

What the operator runs

Capabilities on day one

Capability	How it works
Tenant self-service cluster provisioning	Tenants request clusters from the Saturn Cloud portal or API. k0rdent provisions and tears them down.
Per-cluster network fabric	Netris programs VPC, VLAN, and RDMA partitions at provision time. No tickets to the network team.
Multi-node training that works on first launch	NVIDIA network operator, MOFED, NCCL topology, and IPoIB pre-configured per cluster.
AI workspaces inside every cluster	JupyterLab, VS Code, jobs, and deployments delivered by Saturn Cloud, scoped to that cluster.
Per-cluster upgrades	Kubernetes and GPU stack upgrades roll cluster by cluster, with the operator setting the order.
Automated teardown and hardware reclaim	When a cluster is destroyed, nodes are wiped and returned to the shared pool automatically.
Fleet-wide operator console	Usage, billing, and health across every tenant cluster in one Saturn Cloud admin view.

Need hands-on help?

We also consult on building this stack

If you want to build multi-tenant GPU Kubernetes on your own infrastructure rather than deploy Saturn Cloud as a product, our engineering team does that as a consulting engagement. Cluster lifecycle, network fabric, tenant isolation, and the self-service portal layer. See the tenant platform consulting service →