Cluster buildout and architecture

A production GPU cluster, built to last past day one

Getting a pod to see a GPU is straightforward. Building a cluster that schedules fairly under contention, survives a driver upgrade, partitions GPUs sensibly, and gives your ML teams a self-service path to use it is the actual work. We have built this for our own platform and will build it on your hardware or cloud.

Talk to an engineer All services

What we deliver

From bare nodes to a cluster your teams can use

We work from the OS and Kubernetes layer up. The hardware, power, and physical network are yours or your provider's. Everything above that is what we build and hand over as code.

GPU Operator, configured correctly

The NVIDIA GPU Operator, device plugin, driver containers, DCGM exporter, and node feature discovery, with versions pinned to match your driver and kernel strategy. An unattended node reboot should not pull a driver your workloads have not been tested against.

MIG and time-slicing where they fit

Not every workload needs a whole H100. We configure MIG profiles for inference and small jobs, time-slicing for bursty interactive work, and full-GPU allocation for training, and expose them as distinct schedulable resources.

A scheduler that behaves under load

Default Kubernetes scheduling ignores gang requirements and fairness. We add what your workloads need: gang scheduling for distributed jobs (Kueue, Volcano, or Run:ai if you have it), priority and preemption, and quota so one team cannot exhaust the cluster.

Storage matched to your workloads

High-throughput shared file systems for training datasets, fast local NVMe for scratch and checkpoints, and object storage for artifacts, mounted where jobs expect them. The right storage tier for each access pattern.

Self-service access for ML teams

A path for users who should not need to write YAML: JupyterLab and VS Code workspaces, job submission, and namespaced quota behind your SSO. Platform team stops being a ticket queue.

Everything as code

Terraform or OpenTofu for the cluster, Helm or Argo for operators and the platform, committed to your repository. The cluster can be rebuilt from source, and your team can read how it was built.

The stack we build

Layers, from the OS up

Access layer Workspaces · job submission · namespaced quota · SSO

▲

Scheduling and GPU layer GPU Operator · device plugin · MIG / time-slicing · gang scheduling · quota and priority · DCGM

▲

Kubernetes and OS Managed control plane or self-hosted · CNI · CSI · pinned drivers and kernel

Managed or self-hosted control plane

EKS, GKE, AKS, or a self-hosted control plane via kubeadm, k0rdent, or Rancher. We work with what you have or recommend based on who will operate it long-term.

Driver and kernel versions pinned deliberately

The most common way a GPU cluster breaks is an unplanned driver or kernel change. We decide the upgrade strategy up front: pinned versions, a tested promotion path, and node draining before any change. Upgrades become a deliberate operation instead of a surprise after a reboot.

Scheduler tuned for your workload mix

A research cluster running many small interactive jobs and a training cluster running a handful of large distributed jobs want opposite scheduler configurations. We tune for the workloads you run, not a generic default.

Multi-node training is a separate engagement

If your teams run distributed training, the network fabric and NCCL configuration are a deeper problem than the cluster itself. See multi-node training.

Common starting points

Where teams are when they call us

Situation	What the engagement looks like
New GPU hardware, nothing on it yet	Greenfield buildout: control plane, GPU Operator, storage, scheduler, access layer, all as code.
A cluster that works but is unreliable	Audit and remediation: find why nodes go `NotReady`, why jobs sit pending, why a driver update took the fleet down, and fix it.
GPUs allocated but underutilized	Scheduling and quota rework, often paired with cost and chargeback to measure the improvement.
ML teams cannot self-serve	Add the access layer: workspaces, job submission, and namespaced quota behind SSO.