Managed Slurm for GPU clouds
Slurm cluster orchestration for multi-tenant GPU operators
Slurm is what customers ask for, and Slurm is what takes weeks of setup per cluster. Saturn Cloud delivers per-tenant Slurm clusters as a managed service: fair-share scheduling, shared storage, RDMA-aware partitions, and a tenant-facing UI on top.
Why managed Slurm
Customers want Slurm. They don't want to build Slurm.
HPC and AI research teams expect Slurm. They also expect a working shared file system, a job UI, and a way to share GPUs across users without fighting over them. Building that per customer is a multi-week engagement. Running it once and reselling it is a product.
Per-tenant Slurm clusters
Each customer gets a dedicated controller, dedicated login nodes, and isolated compute partitions. Quotas and reservations are enforced at the platform layer, not by Slurm config alone.
RDMA-aware partitions
InfiniBand and RoCE come pre-configured. Multi-node training jobs land on RDMA-connected nodes by default. NCCL discovery just works.
Shared file systems
Lustre, BeeGFS, or WEKA mounted on every node. Customers get the home and scratch directories they expect, without the storage build-out becoming a separate project.
Self-service workspaces on top
Tenants get a web UI for submitting jobs, watching the queue, and running interactive sessions. They can also SSH into login nodes for traditional workflows.
How it works
The stack
One operator console, many Slurm clusters
Provision a new tenant Slurm cluster in minutes. Quotas, reservations, and partition layouts come from a central config, not handcrafted slurm.conf files.
Mixed Slurm and Kubernetes on the same fleet
Run Slurm partitions and Kubernetes namespaces against the same GPU pool. Customers who want both get both. Capacity moves between them as demand shifts.
Built on standard Slurm
This is upstream Slurm with the standard accounting database, not a fork. Customers can lift and shift their existing job scripts and sbatch templates.
The shift
From handcrafted Slurm to multi-tenant Slurm
| One Slurm cluster per customer, handcrafted | With Saturn Cloud |
|---|---|
| Two weeks of setup per new customer | New tenant cluster in minutes |
| Bespoke slurm.conf and gres.conf per cluster | Central config, per-tenant overrides |
| Email threads to get InfiniBand topology right | RDMA partitions configured at install |
| No usage tracking outside Slurm accounting | GPU-hour billing per user, project, and tenant |
| Customer wants Jupyter? Another build-out. | Web workspaces alongside Slurm on the same nodes |