Cluster buildout and architecture
A production GPU cluster, built to last past day one
Getting a pod to see a GPU is straightforward. Building a cluster that schedules fairly under contention, survives a driver upgrade, partitions GPUs sensibly, and gives your ML teams a self-service path to use it is the actual work. We have built this for our own platform and will build it on your hardware or cloud.
What we deliver
From bare nodes to a cluster your teams can use
We work from the OS and Kubernetes layer up. The hardware, power, and physical network are yours or your provider's. Everything above that is what we build and hand over as code.
GPU Operator, configured correctly
The NVIDIA GPU Operator, device plugin, driver containers, DCGM exporter, and node feature discovery, with versions pinned to match your driver and kernel strategy. An unattended node reboot should not pull a driver your workloads have not been tested against.
MIG and time-slicing where they fit
Not every workload needs a whole H100. We configure MIG profiles for inference and small jobs, time-slicing for bursty interactive work, and full-GPU allocation for training, and expose them as distinct schedulable resources.
A scheduler that behaves under load
Default Kubernetes scheduling ignores gang requirements and fairness. We add what your workloads need: gang scheduling for distributed jobs (Kueue, Volcano, or Run:ai if you have it), priority and preemption, and quota so one team cannot exhaust the cluster.
Storage matched to your workloads
High-throughput shared file systems for training datasets, fast local NVMe for scratch and checkpoints, and object storage for artifacts, mounted where jobs expect them. The right storage tier for each access pattern.
Self-service access for ML teams
A path for users who should not need to write YAML: JupyterLab and VS Code workspaces, job submission, and namespaced quota behind your SSO. Platform team stops being a ticket queue.
Everything as code
Terraform or OpenTofu for the cluster, Helm or Argo for operators and the platform, committed to your repository. The cluster can be rebuilt from source, and your team can read how it was built.
The stack we build
Layers, from the OS up
Managed or self-hosted control plane
EKS, GKE, AKS, or a self-hosted control plane via kubeadm, k0rdent, or Rancher. We work with what you have or recommend based on who will operate it long-term.
Driver and kernel versions pinned deliberately
The most common way a GPU cluster breaks is an unplanned driver or kernel change. We decide the upgrade strategy up front: pinned versions, a tested promotion path, and node draining before any change. Upgrades become a deliberate operation instead of a surprise after a reboot.
Scheduler tuned for your workload mix
A research cluster running many small interactive jobs and a training cluster running a handful of large distributed jobs want opposite scheduler configurations. We tune for the workloads you run, not a generic default.
Multi-node training is a separate engagement
If your teams run distributed training, the network fabric and NCCL configuration are a deeper problem than the cluster itself. See multi-node training.
Common starting points
Where teams are when they call us
| Situation | What the engagement looks like |
|---|---|
| New GPU hardware, nothing on it yet | Greenfield buildout: control plane, GPU Operator, storage, scheduler, access layer, all as code. |
| A cluster that works but is unreliable | Audit and remediation: find why nodes go NotReady, why jobs sit pending, why a driver update took the fleet down, and fix it. |
| GPUs allocated but underutilized | Scheduling and quota rework, often paired with cost and chargeback to measure the improvement. |
| ML teams cannot self-serve | Add the access layer: workspaces, job submission, and namespaced quota behind SSO. |