The Complete Guide to GPU Cloud Infrastructure

The architecture, operations, and failure modes of running a GPU cloud in 2026. Written for the people building them.

TL;DR

A GPU cloud is not a regular cloud with accelerators bolted on. The architecture, failure modes, and operational disciplines are different from the ground up.
The hardware layers (GPUs, networking, storage) get most of the attention. The operational layers (fleet lifecycle, multi-tenancy, observability, control plane) decide whether the business works.
Power is now the binding constraint, not GPU supply. The build-out is constrained by substation timelines, transformer lead times, and grid interconnection queues, not chip allocation.
Networking is where most builds quietly fail. Spec-sheet bandwidth and delivered bandwidth diverge significantly under real NCCL workloads, and the gap is usually only found by measuring.
Inference is now half the business. The training-cluster worldview that dominated 2023-2024 is no longer the default architecture.
Software is becoming the moat. Hardware access is converging; operational and platform sophistication is not.

Who this is for

This is written for the people standing up GPU infrastructure at scale: neocloud founders and CTOs, infrastructure leads at enterprises building private AI clouds, sovereign AI program teams, and engineers inside hyperscalers running accelerated compute. We assume you know what a GPU is, what training and inference are, and roughly what NVLink does. If you don’t, this isn’t the right starting point.

What follows is the surface map of the infrastructure stack and its operational realities. Each section links to a deeper piece. Use it as a reference and a checklist.

Why GPU clouds are structurally different

Internalize this first: the cloud playbook from the 2010s doesn’t transfer. The architecture that made AWS, Azure, and GCP work (heterogeneous workloads, oversubscribed CPUs, virtualization everywhere, eventual consistency on storage) is actively wrong for AI infrastructure.

GPU workloads have four properties that change everything downstream:

Tightly coupled and synchronous. A training job spans hundreds or thousands of GPUs that have to act as one machine. A single straggler at 90% speed slows the entire job by 10%. There is no graceful degradation.
Correlated failure modes. A single NIC fault, a flaky optical link, or one degraded GPU can take down a job spanning hundreds of nodes. Blast radius is orders of magnitude larger than CPU clouds.
Hardware that depreciates in months. A GB200 deployed today will be competing with the next generation within 18 months. Idle time isn’t a missed-revenue problem; it’s a stranded-asset problem.
Operational density. A modern AI rack draws 100-130 kW. The same physical footprint that hosted 10 kW of CPU servers a decade ago now needs roughly an order of magnitude more power, cooling, and structural support.

These properties cascade into every architectural decision: why bare metal beats virtualization for training, why lossless networking is mandatory, why scheduling matters more than it ever did for CPU workloads, why power and cooling are now first-class infrastructure problems rather than facilities problems, and why software operations decide whether the unit economics work.

Key fact: GPU cloud workloads differ from traditional cloud workloads on four dimensions that compound: synchronous tight coupling, correlated failure modes, sub-18-month hardware depreciation, and 10x power density per rack.¹

The physical layer: power, cooling, and the data center

The biggest shift in GPU cloud infrastructure over the past two years is that the data center itself is now the constraint, not the chips. This is the section most strategic overviews skip. Don’t.

Power is the bottleneck

A modern AI campus draws hundreds of megawatts. A 1 GW campus (and several are now operational or under construction) consumes roughly the electricity of a mid-sized city. The build-out is constrained by:

Utility interconnection queues. Connecting hundreds of MW to the grid takes years in most regions. Some U.S. markets now quote 4-7 year interconnection timelines.
Transformer and substation lead times. Large power transformers are on multi-year backorder globally. Operators are buying used transformers and pre-ordering against speculative builds.
Transmission capacity. Even if generation exists, getting it to the data center site requires transmission that may not. This is why operators are siting next to power generation rather than next to demand.

The implication: where you build matters more than what you put in the building. Regions with available power, cooperative utilities, and existing transmission are commanding premiums. Stranded power (generation that has nowhere to go because transmission can’t carry it) has become a real siting strategy for operators willing to co-locate with energy sources.²

Cooling has converged on liquid

Air cooling tops out around 30-40 kW per rack. GB200 NVL72 racks draw 120-130 kW. The math doesn’t work. Direct-to-chip liquid cooling is now standard for new builds, and rear-door heat exchangers handle the transition for retrofitted air-cooled facilities.

Operationally, this is a meaningful skill shift. Data center teams that have spent careers managing CRAC units and hot/cold aisle containment are now managing coolant distribution units, leak detection, and water chemistry. The failure modes are different. A coolant leak above a $40M rack is a different incident than a chiller failure.

Power oversubscription and load shaping

Because GPU workloads are highly variable in instantaneous power draw (a synchronized backward pass across 10,000 GPUs creates a coordinated current spike), operators are dealing with power management problems that don’t exist in traditional cloud. Some are deliberately shaping workload schedules to flatten power draw and avoid drawing more from the grid than their contracts permit. Others are deploying battery buffers to absorb spikes.

Key fact: GB200 NVL72 racks operate at approximately 120-132 kW, roughly 3-4x the density of prior-generation H100 deployments and well beyond the practical limit of air cooling.³

The compute layer

The accelerators themselves. In 2026, this primarily means NVIDIA’s Blackwell generation (B200, GB200 NVL72) for new builds, with H100 and H200 fleets still doing most of the production inference work. AMD’s MI300X and MI325X have found real customers, particularly for inference workloads where the larger HBM capacity helps. Google’s TPUs and AWS’s Trainium are vertically integrated alternatives available only inside those hyperscalers.

Hardware choice for an operator is mostly about supply allocation, not preference. The more interesting question is what mix: training-optimized configurations (GB200 NVL72 racks with dense NVLink domains) look very different from inference-optimized ones (single-node H100/H200 boxes with high HBM and looser networking requirements). Operators serving both training and inference workloads end up running heterogeneous fleets, which creates its own operational complexity around scheduling, billing, and capacity planning.

Procurement and supply chain reality

The supply chain dimension is consistently underestimated in strategic overviews. Real operators spend significant time managing:

GPU allocation cycles. Even with Blackwell shipping at volume, allocation is rationed across operators and timed against generation transitions.
Optics shortages. 800G optics have been intermittently constrained for two years. A cluster missing 5% of its optics is not a cluster.
Rack integration delays. OEM rack integration capacity is finite. Time from “GPUs in the warehouse” to “GPUs in a rack on the floor” is a meaningful KPI.
Burn-in and qualification. New hardware needs validation. Skipping burn-in catches failures during paying customer workloads, not during qualification.
Firmware compatibility matrices. GPU firmware, NIC firmware, switch firmware, BIOS, and driver versions form a combinatorial space. A bad combination breaks NCCL in subtle ways.

The networking layer

This is where most builds fail in ways that take months to diagnose. It deserves more attention than any other layer.

The fabric choice

AI training requires east-west bandwidth at levels traditional cloud networking cannot deliver. The two viable fabrics:

InfiniBand (NVIDIA/Mellanox). The incumbent for AI training. Lossless by design, well-understood, mature tooling, expensive, and increases NVIDIA dependency.
RoCEv2 over Ethernet, increasingly championed by the Ultra Ethernet Consortium. More flexible, cheaper at scale, broader vendor ecosystem, but requires careful tuning to achieve InfiniBand-class predictability under load.

The framing that “InfiniBand is required for AI training” is increasingly out of date. Multiple large operators have deployed Ethernet-based fabrics at training scale. The honest 2026 answer is that both work, the engineering effort to make them work is different, and the strategic calculus around vendor concentration is now part of the decision.

Topology matters more than bandwidth

The standard fat-tree topology with 1:1 oversubscription is the safe default, but it’s expensive. Operators serving cost-sensitive workloads experiment with oversubscribed topologies, dragonfly variants, and rail-optimized designs. Each choice has implications for collective performance:

Rail-optimized topologies preserve NCCL ring and tree performance by ensuring same-rail GPUs across nodes share short paths.
Topology-aware scheduling is the difference between gang-scheduled jobs achieving advertised performance and getting 60-70% of it. Schedulers that don’t understand which GPUs share NVLink domains, which nodes share leaf switches, and which leaf switches share spines, will place jobs in ways that destroy collective performance.

The bandwidth gap

The single most important operational fact about AI networking: spec-sheet bandwidth and delivered bandwidth are not the same number, and the gap is often large. A nominally 400Gbps interconnect can deliver a fraction of that under real NCCL workloads if congestion control is mistuned, ECMP hashing is poor, optics are flaky, or buffer settings are wrong.

The discipline that separates serious operators from amateurs is measuring before assuming. NCCL all-reduce benchmarks across the actual cluster topology, under realistic message sizes, before any customer workload runs. A 20% bandwidth deficit, undiscovered, costs an operator millions of GPU-hours of effective capacity over a fleet’s lifetime.

Failure modes operators learn the hard way

Silent rail degradation. One degraded rail in a multi-rail setup can drop aggregate collective bandwidth materially. NCCL collectives are bounded by the slowest link, so a single degraded rail throttles the entire ring without alarming any monitoring system that doesn’t specifically look for it.⁴
Optical link flaps. Optics fail more than people expect. A flapping link in the middle of a training run causes NCCL hangs that are difficult to attribute.
Congestion collapse. Under sustained collective traffic, mistuned RoCE deployments can enter congestion patterns that drop throughput dramatically with no warning.
Firmware drift. Switch firmware versions across a fabric must match. Drift produces subtle, intermittent NCCL failures.

Key fact: Topology-aware scheduling that respects NVLink domains, leaf switches, and rail boundaries can be the difference between achieving most of the theoretical collective bandwidth on a large job and degrading significantly. The gap depends on how badly the job’s allocation is fragmented across the fabric.⁵

The storage layer

The old “object store for cheap, block store for fast” split breaks down at GPU cloud scale. Training a frontier model means streaming tokenized data to thousands of GPUs at line rate, then periodically writing multi-terabyte checkpoints fast enough that they don’t stall the cluster.

What’s actually in production

Parallel file systems (WEKA, VAST, DDN Lustre, IBM Storage Scale) for the hot training tier. These deliver the IOPS and throughput training jobs need, with the metadata performance to handle small-file-heavy datasets.
High-performance object storage (often the same vendors, sometimes S3-compatible layers) for datasets, model artifacts, and cold tiers.
Local NVMe on each GPU node for scratch, intermediate state, and dataset caching.
GPUDirect Storage paths that let GPUs read directly from NVMe and network storage without bouncing through CPU memory.

The non-obvious storage problems

Most operators discover these the hard way, three months into production:

Checkpoint write storms. A large training job checkpointing simultaneously across thousands of GPUs creates a synchronized write burst that can saturate storage and stall the cluster. Asynchronous checkpointing, sharded checkpointing, and tiered checkpoint storage are the workarounds.
Metadata bottlenecks. Training datasets with billions of small files (common for vision and multi-modal training) crush file system metadata services even when raw bandwidth is plentiful. The fix is dataset packing (WebDataset, MosaicML’s MDS) and is non-trivial to retrofit.
Data loader starvation. GPUs idle waiting on tokenized data show up as low utilization that looks like a scheduling problem and is actually a storage problem. Pipeline depth, caching strategy, and prefetch tuning matter.
Cache locality. Datasets accessed repeatedly should live in local NVMe or RAM cache. Operators who treat storage as a single undifferentiated tier leave significant performance on the table.
RDMA-aware storage paths. Storage that integrates with the RDMA fabric performs differently than storage that doesn’t. Architecture matters at the protocol level.

Key fact: Checkpoint operations on large training jobs can consume meaningful portions of total wall-clock time without explicit checkpoint optimization. Operators running large training jobs without asynchronous or sharded checkpointing routinely report training stalls measured in minutes per checkpoint, aggregating to substantial productivity loss over a multi-week run.⁶

The orchestration layer

How jobs get placed onto hardware. Most coverage of this stops at “Slurm or Kubernetes.” The real operational problems are downstream of that choice.

Slurm, Kubernetes, or both

The honest 2026 answer: serious GPU clouds run both. Slurm for large training jobs that need gang scheduling and tight topology awareness. Kubernetes for inference services, development environments, and anything long-running and service-shaped. Kubernetes-only deployments with extensions like Kueue and Volcano are growing, but training workloads that actually pay the bills still mostly want Slurm semantics.

The real scheduling problems

What separates a working orchestration layer from a broken one:

Gang scheduling that actually works. Large jobs need all-or-nothing placement. Partial placement that hangs waiting for missing nodes is worse than rejection.
Topology-aware bin packing. Placing a 64-GPU job across 8 nodes that share a leaf switch is dramatically faster than placing it across 8 nodes scattered through the fabric. The scheduler has to know this.
Backfilling. Large reserved jobs leave gaps. A scheduler that can opportunistically backfill those gaps with small jobs is the difference between 60% and 80% utilization.
Preemption that preserves work. Killing a 6-hour training job to make room for a higher-priority workload is sometimes necessary. Doing it without checkpoint coordination is a customer-relationship event.
Fairness and quota policies. Multi-tenant clusters need queue policies that prevent any one customer from monopolizing capacity while still allowing burst usage. Getting this wrong shows up as customer complaints, not metrics.
Reservation systems. Enterprise customers want guaranteed capacity windows. Reservations interact with backfilling, preemption, and spot pricing in non-trivial ways.
Elastic training. Some training frameworks can grow and shrink their GPU allocation dynamically. Schedulers that can take advantage of this recapture significant capacity. Most can’t.
Multi-cluster federation. Operators with multiple data centers eventually need to schedule across them. This is harder than it sounds; latency-sensitive collective workloads do not federate.

This layer is where operational maturity is most visible. The orchestration sophistication gap between a 6-month-old neocloud and a 3-year-old one is wider than any other.

The multi-tenancy layer

A GPU cloud has to run multiple customers' workloads on shared hardware without leaking performance, data, or failures between them. The CPU cloud playbook for this (namespaces, cgroups, VPCs, IAM) doesn’t map cleanly onto GPUs.

The isolation primitives

MIG (Multi-Instance GPU). Partitions a single GPU into hardware-isolated slices for inference workloads. Operationally fragmenting MIG profiles across a fleet is its own complexity.
Full-GPU and full-node allocation. The default for training workloads where partitioning isn’t viable.
GPU passthrough vs. bare metal. Passing GPUs through hypervisors adds operational surface area (SR-IOV configuration, driver coordination, firmware lifecycle) that many operators decide isn’t worth the flexibility.
Network isolation at the fabric level. Harder than VPC isolation, particularly on InfiniBand where partition keys are the primary mechanism.
Storage isolation across shared parallel file systems, which most weren’t originally designed for multi-tenant use.
Identity, RBAC, and quota that ties all of the above together.

The problems operators actually hit

Noisy neighbor on shared fabric. One tenant’s collective traffic degrades another’s. Fabric-level QoS is the answer; most parallel file systems and switches support it imperfectly.
DMA isolation concerns. GPUs do DMA. Cross-tenant DMA isolation is a security boundary that requires careful configuration.
Tenant trust boundaries. Inference serving where one tenant’s request hits a model another tenant fine-tuned has a different threat model than training. Many architectures conflate them.
Kubernetes device plugin instability. The plugin ecosystem for GPUs is maturing but historically fragile. Plugin restarts mid-job are real.
MIG operational fragmentation. A fleet with mixed MIG profiles is harder to schedule than a uniform fleet. Profile changes require GPU resets. Capacity planning across MIG profiles is non-trivial.
Firmware drift across the fleet. Tenants running workloads sensitive to specific driver or firmware versions get angry when versions drift.

Key fact: Multi-tenant GPU isolation requires solving four distinct problems with imperfect primitives: compute partitioning (MIG and allocation), network isolation at the fabric layer (partition keys on InfiniBand, VLAN/VRF on Ethernet), storage isolation on shared parallel file systems, and identity-bound quota across all three.⁷

The fleet lifecycle layer

What nobody talks about until they live through it: a GPU fleet is not a static asset. It needs continuous management across its operational life.

Firmware rollout strategy. GPU firmware, NIC firmware, switch firmware, BIOS: all need coordinated updates without taking the fleet offline. Rolling maintenance windows, canary deployments, and rollback strategies all apply.
BIOS and driver drift. Across thousands of nodes, configuration drifts. Detecting and remediating drift is an ongoing job, not a one-time setup.
Cluster qualification. New hardware requires structured validation before joining the production pool. This is its own pipeline.
Burn-in testing. New GPUs have higher early failure rates. Burn-in catches them before customers do.⁸
Automated drain-and-replace. When a node fails (and they will, daily, at scale), draining workloads, isolating the node, replacing components, requalifying, and returning to the pool needs to be automated. Manual handling doesn’t scale past a few hundred nodes.
Rolling maintenance. Patches, upgrades, hardware swaps happen continuously. Doing them without disrupting tenants is a discipline.

The reliability dimension matters. GPU hardware fails meaningfully more than CPU hardware at the densities operators are now running. The Meta OPT-175B training paper documented 35 manual restarts, more than 70 automated restarts, and over 100 hosts cycled due to hardware issues across a 2-month training run on 992 A100 GPUs.⁹ More recent Meta disclosures around Llama 3.1 405B training (16,384 H100s) reported 419 unexpected hardware failures over a 54-day period, roughly one every three hours. The underlying rates have not dramatically improved with newer generations.¹⁰

The inference serving layer

Inference is now half the business, and the architecture is diverging fast from training. A complete picture of GPU cloud infrastructure has to cover it.

The serving stack that mattered two years ago (Triton plus custom batching) has been substantially displaced. The current production stack centers on:

vLLM, SGLang, and TensorRT-LLM as the serving engines. Each has different strengths (vLLM for breadth and PagedAttention, SGLang for structured decoding workloads, TensorRT-LLM for absolute NVIDIA-stack performance).
Continuous batching rather than static batching. Requests of different lengths get folded into the same batch dynamically, dramatically improving throughput.
PagedAttention and KV cache management. KV cache memory dominates inference memory usage for long contexts. How it’s allocated, paged, and shared across requests is a primary performance lever.
Prefix caching. Multi-turn conversations and system-prompted workloads share long common prefixes. Caching the KV state for shared prefixes improves throughput meaningfully on the right workloads. Published benchmarks show roughly 13% throughput gain on vLLM and 35% on TensorRT-LLM for shared-prefix scenarios, with time-to-first-token improvements substantially larger at high cache hit rates.¹¹
Speculative decoding. A small draft model proposes tokens that the large model verifies. Done well, this can improve throughput meaningfully on the right workloads.
Disaggregated inference. Splitting the prefill and decode phases of inference onto different hardware. Prefill is compute-bound and benefits from dense GPUs; decode is memory-bandwidth-bound and tolerates older or less expensive hardware. Operators serving high-volume inference are increasingly running disaggregated topologies.
Token scheduling and routing. Routing requests to KV cache that already contains relevant state, balancing load across serving replicas, and managing tail latency under load.

The implication for infrastructure: inference clusters have different shape than training clusters. Smaller node counts. Lower networking requirements per node. Different storage patterns. Different scheduling. An operator serving both needs heterogeneous infrastructure, not one fleet shoehorned into two roles.

The control plane

The layer that ties everything above into something a customer can actually use. Disclosure: this is what Saturn Cloud builds, so weight this section accordingly.

A GPU cloud control plane is responsible for:

Provisioning bare metal, Kubernetes, and Slurm clusters
Managing the lifecycle of training jobs and inference services
Enforcing multi-tenancy, RBAC, and quotas
Exposing self-service APIs and UIs to customers
Integrating with metering, billing, and identity
Supporting white-label deployments for resellers
Coordinating with the layers below: fleet management, scheduling, storage, networking, observability

The reason this layer exists as its own concern: each lower layer has its own configuration surface, its own failure modes, and its own operational discipline. The control plane is what makes them coherent from a tenant’s point of view. Without it, customers experience a GPU cloud as a collection of disconnected primitives, not a product.

A brief note on the market

In 2026, the GPU cloud landscape has three tiers: hyperscalers (AWS, Azure, GCP, Oracle) with distribution and breadth; large neoclouds (CoreWeave, Nebius, Crusoe, Lambda) with GPU-focused unit economics and rapid growth; and a long tail of 150-200 smaller and sovereign operators. Consolidation is underway.

The closing thesis: software is the moat

Three years ago, the differentiator in this market was access. Whoever could get H100s could sell H100s. That window has closed. GPU allocation is still rationed, but enough operators have access at enough scale that “we have GPUs” is no longer a strategy.

Two years ago, the differentiator was assumed to be power. Operators racing to lock down hundreds of MW of capacity in cheap-power regions, with the thesis that whoever controlled the substations controlled the business. Power still matters enormously, but it’s becoming a competitive necessity rather than a differentiator. Everyone serious is securing power.

The differentiator that’s actually emerging, and that operators underestimate consistently, is software.

Not “AI software” in the model sense. The operational and platform software that determines:

Whether you onboard a customer in days or months
Whether your fleet runs at 80% utilization or 55%
Whether your multi-tenancy is tight enough to charge premium prices
Whether you turn over a GPU generation without a six-month rebuild
Whether your operators sleep through the night

These are software outcomes. They’re decided by the orchestration layer, the multi-tenancy layer, the fleet lifecycle layer, the observability layer, and the control plane that ties them together. They compound: an operator running 80% utilization on a 10,000-GPU fleet has roughly 45% more revenue capacity than one running 55% on the same hardware. Over a depreciation cycle, that gap is the business.

The neoclouds that survive consolidation will not be the ones with the most GPUs. They will be the ones whose software lets them operate the GPUs they have closer to the theoretical maximum, onboard customers faster than competitors, and turn over hardware generations without rebuilding the stack. Hardware is becoming a commodity. Operational sophistication is not.

This is uncomfortable for many operators because hardware is what their boards understand and what their press releases announce. Software competence doesn’t show up on a capacity dashboard. It shows up in margin, retention, and time-to-revenue, three quarters or more after the investment.

The operators who are winning right now have figured this out. The ones who haven’t will figure it out in 2027 or 2028, by which time consolidation will have started without them.

Saturn Cloud provides the control plane for GPU clouds. Partners include NVIDIA, AWS, Azure, Oracle, Nebius, Crusoe, AMD, and Intel. .

Sources

Sheng et al., “Power for AI Data Centers: Energy Demand, Grid Impacts, Challenges and Perspectives,” Energies 19(3), 722 (MDPI, January 2026). https://www.mdpi.com/1996-1073/19/3/722. Confirms AI data center rack densities exceeding 40 kW typically and 100 kW+ for liquid-cooled equipment, against historical 5-10 kW for traditional racks. Hanwha Data Centers (December 2025) independently confirms the 10-15 kW traditional / 50-150 kW AI rack comparison: https://www.hanwhadatacenters.com/blog/what-are-the-power-requirements-for-ai-data-centers/. ↩︎
Lawrence Berkeley National Laboratory queued interconnection data (PJM, MISO, ERCOT) documents typical 4-5+ year wait times across major U.S. RTOs with up to 7+ years in congested regions. Hanwha Data Centers (December 2025) corroborates 4-8 year grid interconnection timelines for AI data center scale. ↩︎
Network World (March 2026), “Why AI rack densities make liquid cooling nonnegotiable,” confirming GB200 NVL72 racks at 120-130 kW. CUDO Compute Engineering (May 2026), “Cooling innovations: Immersion, liquid-to-chip, and advanced air cooling for AI infrastructure,” confirming 120-132 kW range and air-cooling limits at 8-12 kW. ↩︎
Behavior follows directly from NCCL collective communication design: ring and tree topologies are bounded by the slowest participating link, so a single degraded rail throttles aggregate throughput. See NVIDIA NCCL documentation: https://docs.nvidia.com/deeplearning/nccl/. ↩︎
General principle supported by published topology-aware GPU scheduling research, including Amaral et al., “Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments” (IEEE, 2017): https://ieeexplore.ieee.org/document/9926236/, and Ayadi et al., “TAMG: Topology-Aware Multi-GPU Allocation via Deep Reinforcement Learning” (2023). Specific magnitude depends heavily on workload, topology, and degree of fragmentation. ↩︎
Zhang et al., “OPT: Open Pre-trained Transformer Language Models,” arXiv:2205.01068 (May 2022) and the accompanying Meta training chronicles document repeated training disruptions including checkpoint-related stalls during the OPT-175B run. https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/chronicles. Specific percentage of wall-clock time consumed by checkpointing varies by workload; framework-level work on asynchronous and sharded checkpointing (e.g., PyTorch DCP, DeepSpeed) exists specifically to address this overhead. ↩︎
NVIDIA, “Multi-Instance GPU (MIG) User Guide,” https://docs.nvidia.com/datacenter/tesla/mig-user-guide/. InfiniBand Architecture Specification, Volume 1, on Partition Key (P_Key) based fabric isolation. ↩︎
Meta Engineering, “The Infrastructure Behind Llama 3” (2024), and Zhang et al. (2022) OPT-175B paper both document elevated early-life failure rates in large GPU fleets, motivating burn-in qualification before production assignment. AWS, “Reduce ML training costs with Amazon SageMaker HyperPod” (2025), summarizes published failure data: https://aws.amazon.com/blogs/machine-learning/reduce-ml-training-costs-with-amazon-sagemaker-hyperpod/. ↩︎
Zhang et al., “OPT: Open Pre-trained Transformer Language Models,” arXiv:2205.01068 (May 2022). The paper documents 35 manual restarts, 70+ automated restarts, and over 100 hosts cycled due to hardware issues across the 2-month training run on 992 A100 GPUs. Independently summarized in InfoQ (June 2022): https://www.infoq.com/news/2022/06/meta-opt-175b/. ↩︎
Meta AI, “The Llama 3 Herd of Models” (2024), reporting 419 unexpected hardware failures over a 54-day Llama 3 405B training run on 16,384 H100 GPUs (roughly one every three hours). Summarized in Tom’s Hardware: https://www.tomshardware.com/tech-industry/artificial-intelligence/faulty-nvidia-h100-gpus-and-hbm3-memory-caused-half-of-the-failures-during-llama-3-training-one-failure-every-three-hours-for-metas-16384-gpu-training-cluster and analyzed in detail by Epoch AI: https://epoch.ai/blog/hardware-failures-wont-limit-ai-scaling. ↩︎
SqueezeBits, “vLLM vs TensorRT-LLM #12: Automatic Prefix Caching” (February 2025), benchmarking shared-prefix workloads: https://blog.squeezebits.com/vllm-vs-tensorrtllm-12-automatic-prefix-caching-38189. Reports ~13.3% throughput improvement on vLLM and ~34.7% on TensorRT-LLM, with larger TTFT gains at high cache hit rates. Note that workloads with very long shared prefixes and high hit rates can produce larger improvements than the headline numbers; the SqueezeBits dataset is representative but not maximal. ↩︎