Day-2 operations and SRE

Keeping a GPU cluster healthy over time

Standing up a GPU cluster is one problem. Keeping it healthy through driver upgrades, node failures, Kubernetes version bumps, and a year of workload growth is another. Most GPU clusters degrade slowly because no one has built the operational practices to maintain them. We help you build those practices, or run them alongside you while your team does.

Talk to an engineer All services

What we deliver

Operational practices transferred to your team

The goal is a cluster your team can operate without us: a tested upgrade process, observability that catches GPU problems before users hit them, and runbooks for the failures that occur on GPU fleets.

Upgrades that do not disrupt the fleet

A staged process for Kubernetes versions, GPU drivers, and the operator stack: test on a canary node pool, drain GPU nodes gracefully so running jobs finish, roll forward with a tested rollback path. Documented so your team can run the next upgrade without us.

GPU health monitoring and node lifecycle

Nodes fail in GPU-specific ways: a card throws Xid errors, ECC error counts climb, a node drops off the fabric while staying Ready. We configure DCGM-based health checks, automatic cordoning on GPU-level signals, and a draining policy so a degraded node stops receiving jobs.

Observability for GPU fleets

Standard cluster monitoring does not surface GPU utilization, ECC errors, or Xid events. We wire up DCGM exporter, Prometheus, and dashboards covering the signals that matter: utilization, memory, temperature, Xid events, and per-job attribution.

Runbooks for GPU-specific failures

Written, tested procedures for the incidents that occur on GPU clusters: a node stuck NotReady, jobs pending with GPUs apparently free, a driver mismatch after a reboot, a hung NCCL job holding GPUs. Written so on-call can diagnose and act without deep GPU expertise.

Quota and capacity hygiene

Quota drifts as teams grow. We set up quota enforcement, idle job reaping, and reporting on who holds what, so capacity stays available as the organization scales.

Optional retainer

If you want backup while your team builds operational depth, we offer a retainer for GPU and Kubernetes incidents while your team handles the rest and gradually takes on more.

Failure modes we harden against

What we have seen break on GPU clusters

The unplanned driver change

A node reboots and pulls a newer driver than the rest of the fleet. CUDA workloads fail in ways that look like application bugs. Pinning drivers and gating promotion keeps the fleet consistent and makes upgrades deliberate.

The node that fails jobs but not health checks

A GPU starts throwing Xid errors but the node stays Ready, so the scheduler keeps placing jobs on it and they keep dying. Health checks that cordon on GPU-level signals, not just kubelet liveness, stop this.

The upgrade that keeps getting deferred

The cluster falls behind on Kubernetes and driver versions because no one has a tested process for upgrading it. We build and rehearse the upgrade path so it is a routine operation.

Allocation at 100%, utilization at 15%

GPUs are allocated but not doing useful work. Pods hold cards they are not using, interactive sessions are left open, and the job queue waits. Idle detection and reaping returns that capacity to the pool. See also cost control and chargeback.

What handoff includes

Your team can run it after we leave

Deliverable	What it is
Upgrade runbook	Tested procedure for Kubernetes, driver, and operator upgrades, with a rollback path for each step.
Incident runbooks	Per-failure-mode procedures for GPU-specific incidents your on-call will encounter.
Monitoring and alerts	DCGM and Prometheus dashboards with alerts tuned to signal real problems, not noise.
Health and draining automation	Automatic cordoning and draining of unhealthy GPU nodes, as code in your repository.
On-call walkthrough	A working session with your team so the runbooks belong to them, not sitting unread.