Running SLURM on Kubernetes with Nebius

Running SLURM on Kubernetes with Nebius
You have a Kubernetes cluster with H100s. Pods can request GPUs. Prometheus is scraping metrics. Everything works. So why would anyone want to run SLURM on top of it?
The answer is that Kubernetes and SLURM solve different scheduling problems. Kubernetes excels at running stateless services, keeping them healthy, and scaling them up or down. SLURM excels at running batch compute jobs with complex dependencies, resource quotas, and fair access across teams. For GPU training workloads, you often want both: SLURM’s job semantics and Kubernetes' operational model.
Nebius offers Soperator, an open-source solution that runs a full SLURM cluster on Kubernetes. This post explains why that combination makes sense and how it works.
Comparing GPU cloud providers?
Download our GPU Cloud Comparison Report analyzing 17 providers across pricing, InfiniBand networking, storage, and enterprise readiness. Includes detailed Nebius profile with infrastructure specs and use case recommendations.
What SLURM gives you that Kubernetes doesn’t
Kubernetes schedules pods. If resources are available, the pod runs. If not, it stays pending until resources free up. This works fine for web services but falls short for batch GPU workloads in several ways.
Job queuing with priorities. SLURM maintains a queue of pending jobs and schedules them based on configurable priority rules. A high-priority training run can preempt lower-priority jobs. In Kubernetes, you can set pod priorities, but there’s no queue management. Jobs either run or wait indefinitely.
Fairshare scheduling. When GPU time costs real money, you need to allocate it fairly across teams. SLURM’s fairshare algorithm tracks historical usage per user and project, then adjusts priorities so that teams who have used less than their share get priority over teams who have used more. Kubernetes has ResourceQuotas, but these are hard limits, not dynamic fairness policies.
Job dependencies. Training pipelines often have dependencies: preprocess data, then train, then evaluate. SLURM handles this natively:
# Submit preprocessing job
JOB1=$(sbatch --parsable preprocess.sh)
# Train only after preprocessing succeeds
JOB2=$(sbatch --parsable --dependency=afterok:$JOB1 train.sh)
# Evaluate only after training succeeds
sbatch --dependency=afterok:$JOB2 evaluate.sh
Kubernetes has no built-in job dependency system. You can use Argo Workflows or similar tools, but that’s another layer to deploy and maintain.
Resource reservations. SLURM can reserve specific nodes for specific times. If you have a deadline and need guaranteed access to 64 GPUs next Tuesday, you can reserve them. Kubernetes has no equivalent.
Native multi-node coordination. SLURM’s srun command launches processes across multiple nodes with the environment variables that distributed training frameworks expect (rank, world size, master address). In Kubernetes, you need to write this coordination logic yourself or use a framework-specific operator like the PyTorch Operator.
What Kubernetes gives you that VMs don’t
Traditional SLURM deployments run on bare VMs or physical servers. This works, but it pushes a lot of operational burden onto the infrastructure team.
Self-healing. When a Kubernetes node fails, the control plane notices and reschedules pods to healthy nodes. When a VM in a traditional SLURM cluster fails, someone has to notice, provision a replacement, install the OS, configure SLURM, and add it back to the cluster.
Declarative infrastructure. A Kubernetes-based SLURM cluster can be defined entirely in Terraform and Helm. The entire cluster configuration lives in version control. Traditional SLURM clusters often accumulate configuration drift because changes are made ad-hoc via SSH.
No manual node bootstrapping. Adding nodes to a traditional SLURM cluster means installing packages, configuring the SLURM daemon, and updating slurm.conf. With Soperator, you change a number in Terraform and apply. Kubernetes handles the rest.
The jail filesystem. Soperator introduces a pattern called the “jail filesystem,” a shared network filesystem mounted as the root (/) on all SLURM nodes. This eliminates version drift. When you update a library, all nodes see the change immediately because they share the same filesystem. Traditional SLURM clusters require careful coordination to keep nodes in sync, and version mismatches between nodes cause subtle, hard-to-debug failures.
How it works on Nebius
Nebius provides managed Kubernetes (MK8s) and the Soperator project to run SLURM on top of it. The architecture looks like this:
- Controller nodes run
slurmctld(the scheduler) andmunge(authentication) as Kubernetes pods - Worker nodes run
slurmd(the compute daemon) on GPU-enabled Kubernetes nodes with InfiniBand networking - Login nodes provide SSH access for job submission, exposed via a load balancer
- The jail filesystem is a Nebius Compute Filesystem mounted on all nodes via VirtioFS
The entire stack deploys via Terraform. A minimal GPU cluster (2x H100 nodes with 8 GPUs each) takes about 40 minutes to provision.
# Worker node configuration
worker_nodes = {
default = {
size = 2
platform = "gpu-h100-sxm"
preset = "8gpu-128vcpu-1600gb"
gpu_cluster = {
infiniband_fabric = "fabric-6"
}
}
}
Once deployed, users SSH into the login node and submit jobs with standard SLURM commands:
$ sinfo
PARTITION AVAIL NODES STATE NODELIST
main* up 2 idle worker-[0-1]
$ sbatch --gres=gpu:8 --nodes=2 train.sh
Submitted batch job 42
Behind the scenes, SLURM coordinates the job across both nodes, setting up the environment variables that PyTorch’s distributed launcher expects. The job runs on Kubernetes pods, benefits from Kubernetes' health monitoring, and stores checkpoints to the shared jail filesystem.
When to use this
Use SLURM on Kubernetes when:
- Multiple teams share GPU resources and need fair scheduling
- You have complex job pipelines with dependencies
- Your users already know SLURM from academic or HPC backgrounds
- You want the operational benefits of Kubernetes without rewriting your submission scripts
Use pure Kubernetes (with Kubeflow, Ray, etc.) when:
- You’re running inference workloads or services, not batch training
- Your team is already Kubernetes-native and doesn’t know SLURM
- You need tight integration with Kubernetes-native ML tools
Use traditional SLURM on VMs when:
- You have existing bare-metal infrastructure you can’t migrate
- You need features that Soperator doesn’t yet support
Saturn Cloud: the platform layer for AI teams
Saturn Cloud handles the infrastructure work every AI team needs but nobody wants to build: dev environments, job scheduling, deployments, SSO, and usage tracking. It runs on your cloud (including Nebius) so your platform team can focus on what actually differentiates your business. Try the hosted version or talk to our team about enterprise deployment.
The operational reality
Running SLURM on Kubernetes isn’t free. You’re adding a layer of abstraction, which means more components to understand when things go wrong. Here’s what that looks like in practice.
Debugging spans two systems. When a job fails, you need to check both SLURM logs (scontrol show job, sacct) and Kubernetes events (kubectl describe pod). A job might fail because SLURM couldn’t allocate resources, or because Kubernetes evicted the pod, or because the underlying node had a GPU error. You need to know where to look.
The jail filesystem is a single point of failure. All nodes share the same root filesystem over the network. If the filesystem becomes unavailable or slow, every node is affected simultaneously. This is different from traditional SLURM clusters where each node has its own local filesystem. Monitor filesystem latency and have a plan for when it degrades.
Upgrades require coordination. Upgrading SLURM means rebuilding the jail filesystem image, which affects all nodes. You can’t do rolling upgrades the way you would with a stateless Kubernetes deployment. Plan for maintenance windows.
You still need SLURM expertise. Soperator handles deployment, but tuning fairshare policies, setting up partitions, and debugging scheduler behavior still requires someone who understands SLURM. If your team only knows Kubernetes, you’re adding a learning curve.
The tradeoff
The pitch for SLURM on Kubernetes is that you get the best of both worlds. The reality is closer to: you get most of both worlds, plus the complexity of running both systems.
For teams that already rely on SLURM semantics (job dependencies, fairshare, reservations), running it on Kubernetes is a clear win over managing bare VMs. The operational benefits of declarative infrastructure, self-healing nodes, and the jail filesystem outweigh the added complexity.
For teams starting fresh with no existing SLURM workflows, the calculus is different. You might be better served by Kubernetes-native batch systems like Volcano or Kueue, which provide queue management and fair scheduling without the extra layer. These tools are less mature than SLURM for HPC workloads, but they’re improving rapidly and integrate more naturally with the Kubernetes ecosystem.
The question to ask: does your team’s existing expertise and workflow justify the complexity of running two scheduling systems? If the answer is yes, Nebius' Soperator is the most production-ready way to do it.
Saturn Cloud: the platform layer for AI teams
Saturn Cloud handles the infrastructure work every AI team needs but nobody wants to build: dev environments, job scheduling, deployments, SSO, and usage tracking. It runs on your cloud (including Nebius) so your platform team can focus on what actually differentiates your business. Try the hosted version or talk to our team about enterprise deployment.
Saturn Cloud provides customizable, ready-to-use cloud environments
for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without having to switch tools.