What an LLM Inference Stack Actually Looks Like

The layers of a production LLM inference stack, why each one exists, and which parts you get from open source versus build yourself.

TL;DR

“Serving an LLM” sounds like one component. In production it is three planes: a data plane that moves tokens to GPUs, a control plane that decides what runs where, and a tenant plane that meters, logs, and bills.
The data plane is a commodity. vLLM, SGLang, and TensorRT-LLM all converge on the same ideas, and you should not build your own.
A single serving engine gives you KV-cache reuse, continuous batching, and multi-LoRA serving for free, but only inside one replica. The cross-replica versions of those wins are your problem.
The control plane and tenant plane are where most of the engineering goes, because no off-the-shelf component knows about your tenants, your model names, or your billing.
Routing latency does not matter. Routing quality matters a lot, and only if your traffic has reusable structure (shared prompts, multi-turn, many fine-tunes off one base).

Who this is for

If you have only ever run vllm serve and pointed a client at it, you have seen the engine but not the stack. The gap between “a model answers on my laptop” and “a multi-tenant inference service that bills correctly” is large, and most of it is invisible until you hit it.

This post walks the layers from the GPU up, explains why each one exists rather than just what it is, and marks where open source stops and your own integration starts. It is for engineers who are about to stand up serving infrastructure and want to understand the problem before picking components.

Start with why inference is shaped the way it is

Every layer above the GPU is a response to two facts about how transformers generate text. Understanding both makes the rest of the stack obvious instead of arbitrary.

Fact one: generation is autoregressive and slow. A model produces one token at a time, and each token requires attention over every token before it. Generating a few hundred output tokens takes seconds. This is why throughput, batching, and caching dominate the design: the work is inherently sequential and memory-bound, so the whole stack is built to amortize it across requests.

Fact two: the KV cache is the real capacity limit. To avoid recomputing attention for the entire prefix on every step, the engine stores each token’s keys and values once and reuses them. That cache grows with sequence length and with the number of concurrent requests. It is per-layer, and for large models it runs to gigabytes per long request. Model weights are fixed; KV cache is what actually decides how many requests fit on a GPU.

Almost every optimization in the layers below is a KV-cache management trick.

The three planes

A production stack separates into three concerns. They are worth naming because they have different owners, different failure modes, and very different build-versus-buy answers.

Plane	Job	Where it comes from
Data plane	Move tokens to GPUs efficiently	Open source, commodity
Control plane	Decide what models run where, and scale them	Mostly open source, policy is yours
Tenant plane	Meter, log, quota, bill per customer	Capture is open source, your domain model is not

Most “inference stack” diagrams only draw the data plane. The other two are where the engineering goes.

The data plane

This is the part everyone pictures: a request comes in, a GPU produces tokens, they stream back.

The serving engine

At the bottom is the engine that actually runs the model: vLLM, SGLang, or TensorRT-LLM. They have converged enough that you can reason about them as a category, with a few distinctions that matter.

Engine	Origin idea	Strength	Cost
vLLM	PagedAttention	Broadest model coverage, easiest to operate, fastest to new models	None for most users; it is the safe default
SGLang	RadixAttention	Automatic prefix reuse, strong on multi-turn and agentic workloads	Slightly less mature tooling, improving fast
TensorRT-LLM	Ahead-of-time compilation	Peak performance on NVIDIA	Per-model, per-GPU-arch compile step; NVIDIA only; operational complexity

The thing to internalize is that TensorRT-LLM’s compile-ahead model is a poor fit for serving many arbitrary models across mixed GPU types, because every new model times every GPU architecture is a compilation. vLLM and SGLang load weights and run, which fits a platform serving lots of different models far better. Reach for TensorRT-LLM on a fixed, high-volume, known endpoint where the per-token savings justify the pipeline, not as your substrate.

You inherit several things from any of these for free, and they are worth knowing by name because they explain the engine’s behavior:

PagedAttention. The idea that made vLLM. Treat KV cache like operating-system virtual memory: split it into fixed-size pages, allocate on demand, and let sequences share pages. Before this, servers allocated one contiguous block per request sized to the maximum length, and wasted most of their KV memory to fragmentation. Paging is why modern engines pack so many concurrent requests onto one GPU.
Continuous batching. Requests join and leave the running batch as they arrive and finish, instead of waiting for a fixed batch to fill and drain. Paging is what makes this cheap.
Prefix caching. If many requests share a prefix (a long system prompt, few-shot examples, a growing conversation history, a common RAG document), compute that prefix’s KV once and reuse it. vLLM does this with block hashing; SGLang’s RadixAttention does it automatically across any shared prefix path. For workloads with heavy sharing this is a large throughput win.
Multi-LoRA serving. Serve many fine-tuned adapters off one base model loaded once. This is the economic foundation of serving many customer fine-tunes, because the expensive base weights are shared and the adapters are small.

Prefill and decode

The KV cache splits every request into two phases with different resource profiles, and this shows up everywhere in the stack:

Prefill processes the whole prompt at once to fill the cache. It is compute-bound and determines time-to-first-token.
Decode generates output one token at a time, streaming the entire cache through the GPU per token. It is memory-bandwidth-bound and determines how fast tokens come out after the first.

This asymmetry is why advanced stacks run prefill and decode on separate GPU pools, and why a long prompt from one user can stall everyone else’s token stream if you are not careful.

The data plane is a solved, commodity problem. Use a real engine. Do not write your own.

The control plane

Here is the first thing that is not free. The engine optimizes one replica. The moment you run more than one replica of a model, which you will the instant one GPU is not enough, the engine’s caches become private to each replica and stop coordinating.

Why a dumb load balancer hurts

Picture four replicas behind round-robin. A multi-turn conversation’s second turn lands on replica 1 and warms its cache. The third turn round-robins to replica 3, which is cold and recomputes the entire conversation history. The fourth turn lands on replica 2, also cold. Your prefix-cache hit rate collapses toward zero, not because the engine failed, but because the router fought the cache.

Fixing that is an inference router: a layer that routes a request to the replica that already holds its prefix, or to the least-loaded replica, or to the one that already has the right LoRA adapter loaded. This is what projects like the vLLM Production Stack and AIBrix provide, and it sits above the engine because the engine cannot see across replicas.

This is also where a common worry gets resolved. People ask whether adding a router hop hurts latency. It does not. A routing decision is sub-millisecond to a few milliseconds; a completion is seconds. The hop is noise. What matters is the quality of the decision: sending a request to a warm replica instead of a cold one can save hundreds of milliseconds to seconds of prefill. Routing overhead is negligible; routing intelligence is one of your largest latency levers, and only pays off if your traffic has structure to exploit.

Scaling and placement

The other control-plane jobs:

Autoscaling on the right signal. Scale on KV-cache utilization and queue depth, not CPU. That means exposing those as custom metrics (Prometheus plus a metrics adapter or KEDA) and scaling on them. CPU-based autoscaling is meaningless for LLM serving.
Scale-to-zero and cold start. If you serve many fine-tunes and most are idle, you cannot keep them all resident. You need to evict idle models and load them on demand, which means eating a cold start: pulling multi-gigabyte weights into GPU memory takes tens of seconds. Plain Kubernetes HPA does not scale to zero; KEDA or a Knative-style layer does.
Placement across heterogeneous GPUs. On a mixed fleet, deciding that this model needs an H100 while that one is fine on an L40S, and that a LoRA adapter should land where its base model already lives, is scheduling work the default Kubernetes scheduler does not do unless you teach it with node selectors, affinities, and GPU resource classes.

The highest-leverage idea here for anyone serving many fine-tunes: keep a handful of base models warm and hot-swap LoRA adapters rather than cold-loading full models. Adapters are megabytes, not gigabytes. This changes the cold-start economics entirely, and it is exactly the trick that a per-model autoscaling mental model misses, because it treats every fine-tune as an independent deployment to scale instead of an adapter to multiplex.

Open source covers most of the control-plane mechanics. The policy (which adapters stay warm, how aggressively to scale, what lands on which GPU) is yours, because it depends on your traffic and your hardware.

The tenant plane

This is the layer that turns a fast inference server into a multi-tenant product, and it is the part no open-source component can hand you complete, because none of them know about your customers.

Metering

If you bill per token, you need accurate token counts. The right source is the engine itself: vLLM and SGLang return an OpenAI-style usage block with prompt, completion, and cached token counts. The engine counted the tokens it actually processed, so it is the authority. Do not re-tokenize at the gateway to count; you will use a different tokenizer version and bill a number that does not match reality.

The work is capturing that number reliably and joining it to a tenant. A few requirements make metering harder than logging:

Streaming. Most responses stream over server-sent events, and the usage block arrives at the end. Your metering point has to read the stream to completion to capture it, and has to handle client disconnects: the user incurred GPU cost for the tokens generated before they hung up, so capture partial usage on the abort path. With OpenAI-compatible engines you usually must force stream_options.include_usage=true, or you silently get null usage and under-bill.
Cached tokens are cheaper to serve. Prefix caching means a request served largely from cache cost you far less than its raw prompt-token count implies. A competitive price has at least three rates: fresh input, cached input (cheaper), and output (most expensive). Your metering record has to carry the cached-token count to support that.
Metering is money, so it is durable and idempotent. Use the request ID as an idempotency key, emit events at-least-once off the hot path, and dedupe downstream. Losing events is lost revenue; double-counting is an angry customer.
Separate metering from rating. Capture raw token counts as immutable events. Apply price later. That way you can change prices, issue credits, or re-rate without re-capturing, and you keep an auditable trail of what happened versus what you charged.

There is also a version of this that is for you, not the customer: capture which GPU type served each request. On a mixed fleet the same request costs you differently on an H100 than an L40S, and carrying the serving GPU in the metering record is the difference between billing revenue and understanding margin.

Quotas

Real-time spend enforcement (“cut this tenant off at the monthly cap”) and accurate billing are two different jobs. Enforcement needs a fast, approximate running counter you check before serving; it can be slightly wrong and self-correct. Billing needs the durable, reconciled event stream; it cannot. Do not try to serve both from one path.

Logging

Tenants want to see what their models did, for debugging and audit. The capture point is the same one metering uses (it already sees the full request, the full streamed response, and the usage), so design them together off one interceptor. But the input/output log has different retention, privacy, and access rules than the metering ledger, and it carries the PII risk, so fork them after capture rather than treating them as one store.

The part that is irreducibly yours

No open-source gateway knows what a “tenant” is in your system. A proxy like LiteLLM gives you a lot of this as configuration (virtual keys, per-team budgets, spend tracking, logging callbacks), and a gateway like Kong has AI metering plugins. But all of them model identity as their own object (a key, a consumer, a team), and you still have to map that to your real tenants and keep the two in sync as tenants are created, suspended, and rotate keys. That mapping, plus the customer-facing usage and log views in your own UI, is the irreducible glue. It is integration, not invention, but it is yours.

Putting it together

A reasonable maximal-open-source assembly looks like this:

client
  -> ingress / API front door     (TLS, auth; you likely already run one)
  -> LLM gateway                  (token metering, token rate limits, OpenAI API shape, logging hooks)
  -> inference router             (KV-aware and LoRA-aware routing across replicas)
  -> vLLM replicas
  + KEDA                          (scale-to-zero, scale on custom metrics)
  + Prometheus / Grafana          (metrics, the autoscaling signal, dashboards)
  + an event sink                 (durable store for metering events and I/O logs)

What that buys you, honestly:

Plane	Open-source coverage	What you still own
Data plane	Essentially complete	Streaming and timeout configuration end to end
Control plane	High	Metric wiring, warm-pool and scaling policy
Tenant plane (capture)	Most of it	Mapping the gateway’s identity to your tenants; the customer-facing UI
Tenant plane (your domain)	None, by definition	The sync between your tenants and the gateway, and the billing join

Two cautions worth stating plainly. First, “no build” is not “no work”: you are now operating a gateway, a router, KEDA, Prometheus, and an event store, which is a real operational surface. Sometimes a thin purpose-built interceptor is less total burden than integrating six systems. Second, this whole picture is gated on traffic shape. If your traffic is short, unique, single-turn completions, there is little prefix to reuse, the smart router earns little, and a plain load balancer over autoscaled replicas is genuinely enough. If your traffic is multi-turn, agentic, shares system prompts, or runs many fine-tunes off shared bases, the router and the LoRA multiplexing do most of the work.

The takeaway

The data plane is commodity, and trying to out-engineer vLLM or NVIDIA on kernels is a losing move. The control plane and tenant plane are where a serving product is actually built, because they are the parts that depend on your tenants, your traffic, and your hardware, and no component ships knowing those. If you are deciding where to spend engineering effort, spend it above the engine.

Part 2 picks up where this leaves off: Where NVIDIA Dynamo Fits in an Inference Stack covers the component that actually does the prefix-aware routing and prefill/decode disaggregation described here, how you configure it per model type, and where the payoff is gated on your hardware.