“AI engineer” now covers two jobs that share a vocabulary but very little tooling. One builds applications on top of models: agents, retrieval, prompts, evals. The other builds the models themselves, or adapts them: fine-tuning, serving, scaling inference. A tool that is central to one job is often irrelevant to the other.
This post maps the landscape into categories rather than ranking products head to head. For each category it covers the core problem, the features that define a serious entry, and how the leading products differentiate. The point is to know where a given tool fits, so that when a new one launches next quarter you can place it without re-reading ten comparison posts.
The map splits cleanly into two halves. The application side is everything between a hosted model API and your users. The model side is everything involved in producing, adapting, and serving the weights. We cover the application side first because more engineers work there.
Part 1: The Application Side
Agent frameworks
The problem: A single model call is stateless and one-shot. An agent needs to loop: call a model, run a tool, feed the result back, decide what to do next, and repeat until done. Doing this by hand means writing your own control flow, state management, retry logic, and tool-dispatch layer. Agent frameworks provide that scaffolding.
Core features that define a serious framework: tool/function calling abstractions, multi-step control flow (often a graph or state machine), persistence so a run can pause and resume, multi-agent coordination, and human-in-the-loop interrupts. The dividing line in 2026 is between frameworks that give you an explicit, inspectable control graph and those that hide coordination behind higher-level roles.
| Framework | Model | Differentiator |
|---|---|---|
| LangGraph | Explicit state graph | State persistence, conditional routing, rollback; the default for complex stateful workflows |
| CrewAI | Role/task crews | Readable “team of specialists” model where you define roles and goals, not graph edges |
| AutoGen | Conversational agents | Multi-agent conversation patterns, good for experiments |
| LlamaIndex | RAG-first agents | Retrieval-grounded agents where the data layer is the center |
| Claude Agent SDK / vendor SDKs | Vendor-native | Tight coupling to one provider’s model and tool ecosystem |
The 2026 shift is consolidation of the crowded middle. Enterprise teams that need real state and rollback have moved from CrewAI to LangGraph, while teams that just want to ship against one provider reach for vendor SDKs (Anthropic’s Claude Agent SDK, the Vercel AI SDK, Pydantic AI). CrewAI holds the cases where readable role definitions matter more than graph-level control. If you are choosing today: pick the explicit graph when failures are expensive and you need to inspect every transition, and pick a vendor SDK when you are committed to one model family and want less surface area.
RAG and retrieval frameworks
The problem: Models do not know your private data and have a fixed knowledge cutoff. Retrieval-augmented generation fetches relevant documents at query time and injects them into the prompt. The framework handles ingestion, chunking, embedding, retrieval, reranking, and assembling the final context.
Core features: document loaders and chunking strategies, embedding-model integration, hybrid search (dense vectors plus keyword/BM25), reranking, and a query engine that ties it together. In 2026, hybrid retrieval and reranking are the production baseline, not optional extras. On most real corpora, dense-plus-sparse retrieval scores noticeably higher than pure vector search.
The two anchors are LlamaIndex and Haystack, and they differ on philosophy more than features. LlamaIndex optimizes for the shortest path from documents to a working query engine: point it at a folder of PDFs and get a queryable index in a few lines. Haystack optimizes for explicit, auditable pipelines where each stage (process, retrieve, rerank, generate) is a named component you can inspect, which is what regulated industries want. LangChain also lives here, but in practice teams pair it (orchestration) with LlamaIndex (ingestion and retrieval) rather than choosing between them.
Vector databases
The problem: RAG and any semantic-search feature need to find the nearest vectors to a query embedding across millions of records, fast, with metadata filtering. General-purpose databases do this poorly. Vector databases are built around approximate nearest-neighbor (ANN) indexes.
Core features: an ANN index (usually HNSW), metadata filtering that stays fast under load, hybrid search, multi-tenancy, and a managed-versus-self-hosted deployment story. The 2026 market has consolidated around four serious products plus pgvector for teams who want to stay inside Postgres.
| Database | Position | Differentiator |
|---|---|---|
| Pinecone | Fully managed | Zero operational overhead, serverless pricing; pay for simplicity |
| Qdrant | Self-host or managed | Best price-performance; low p50 latency; strong filtered search |
| Weaviate | Open-source, enterprise | Native hybrid search, built-in multi-tenancy, GraphQL API |
| Chroma | Developer-first | Easiest local start; great under ~1M vectors, less so at extreme scale |
| pgvector | Postgres extension | No new system to operate if you already run Postgres |
The decision usually comes down to operational appetite. A small team that wants nothing to manage takes Pinecone. A team comfortable running a VPS gets Qdrant or self-hosted Weaviate at a fraction of the cost. Filtered search performance (how much latency degrades when you add metadata filters) is the benchmark that separates these more than raw recall, since real queries almost always filter.
LLM gateways
The problem: Production apps call multiple model providers, need to fail over when one is down, want to track spend per team, and need a single place to enforce rate limits and access control. Hard-coding one provider’s SDK throughout your codebase makes all of this painful.
Core features: a unified, usually OpenAI-compatible API across many providers, routing and fallback, caching, spend tracking and budgets, and access control. The split is between open-source self-hosted proxies and managed services.
- LiteLLM is the open-source default: a Python proxy speaking an OpenAI-compatible API across 100+ providers, self-hosted so you keep full control and avoid lock-in.
- OpenRouter is the managed counterpart: one API key and a prepaid credit system that covers every provider, no infrastructure.
- Portkey bundled a gateway with observability and governance; note that Palo Alto Networks acquired it in April 2026 and folded it into their AI security platform, which changes its positioning for independent teams.
- Braintrust ships a gateway wired directly into its eval and observability platform, which matters if you want routing and quality measurement in one tool.
Pick LiteLLM for control, OpenRouter for zero-ops breadth, and a bundled platform when you want the gateway to feed an eval or governance workflow you are already using.
Observability and evaluation
The problem: LLM apps fail in ways traditional monitoring cannot see. A response can be syntactically fine and semantically wrong. You need to trace multi-step agent runs, inspect token usage and latency per step, and evaluate output quality systematically rather than by eyeballing samples.
Core features: tracing of nested calls (especially agent execution graphs), token and cost accounting, evaluation primitives (LLM-as-judge, dataset scoring, regression tests), and prompt management. Six platforms anchor the category, and they differ on lock-in, deployment, and eval rigor.
| Platform | Strength | Trade-off |
|---|---|---|
| LangSmith | Detailed tracing for LangChain/LangGraph: node-by-node state diffs, replay against new models | Pays off most when you already use LangChain |
| Langfuse | Open-source leader, self-hostable (MIT), free to run | You operate it |
| Arize Phoenix | ML-grade eval rigor from observability heritage | Heavier than a drop-in proxy |
| Helicone | Drop-in proxy: change one base URL, get traces | Less depth than SDK-integrated tools |
The decision axes are framework lock-in (LangSmith pays off most if you are all-in on LangChain), deployment model (Langfuse and Phoenix self-host cleanly), and how seriously you take evals. Helicone is the lowest-effort way to start, since it captures traces at the proxy with no SDK changes. Most teams end up running one tracing tool and one eval workflow, and the question is whether one product can do both.
Part 2: The Model Side
The application side treats the model as a fixed API. The model side is for teams that adapt weights, run their own inference, or both. The tooling is more infrastructure-heavy and the failure modes are about GPUs, memory, and throughput rather than prompt quality.
Fine-tuning and training frameworks
The problem: A base model is general. Adapting it to your domain, format, or task means continued training, usually with parameter-efficient methods (LoRA, QLoRA, DoRA) so you do not need to update every weight. These frameworks wrap the training loop, distributed-training backends, and PEFT methods into something configurable.
Core features: PEFT method support, distributed-training backends (FSDP, DeepSpeed), config-driven pipelines, memory optimization, and increasingly full RLHF/preference-tuning stages. Three frameworks lead, and they sort by scale.
| Framework | Best for | Differentiator |
|---|---|---|
| Unsloth | Single GPU | ~2x faster training, large VRAM savings; speed on consumer hardware |
| Axolotl | Multi-GPU clusters | YAML-driven pipelines, FSDP/DeepSpeed, full RLHF |
| TorchTune | PyTorch-native teams | Lean, abstraction-free, just PyTorch you can read and extend |
The choice maps almost directly to your hardware and how much you want to see. On a single GPU where iteration speed matters, Unsloth wins on wall-clock time and memory. On a multi-node cluster running preference tuning, Axolotl’s config-driven pipelines and distributed backends are built for it. TorchTune is for teams who want to own and modify the training code without fighting a framework’s abstractions. Many teams use TRL underneath for advanced training objectives regardless of which wrapper they pick.
This is also where the line between “tool” and “infrastructure” blurs: the framework is the easy part. Coordinating multi-node training (setting up the environment variables for torchrun or DeepSpeed across nodes, handling failures mid-run, tracking which job belongs to which user) is the operational work that none of these frameworks solve on their own.
Inference and serving engines
The problem: Serving an open-weights model in production is not model.generate() in a loop. You need high throughput across concurrent requests, low latency, and efficient GPU memory use. The serving engine handles batching, KV-cache management, and hardware-specific optimization.
Core features: continuous batching, paged KV-cache attention, quantization support, and hardware targeting. Three engines matter in 2026, and TGI, the former Hugging Face default, entered maintenance mode at the end of 2025 (Hugging Face’s own endpoints now default to vLLM).
| Engine | Best for | Differentiator |
|---|---|---|
| vLLM | General default | Fast start, broad model and hardware support, largest community |
| SGLang | Shared-prefix workloads | RadixAttention gives real gains when requests share long prefixes; has matched or beaten vLLM on throughput |
| TensorRT-LLM | Stable NVIDIA workloads | Best raw throughput and latency, at the cost of a long compilation step |
The trade-off is start-up flexibility versus peak performance. vLLM is the right default: it starts fast, runs almost any model, and supports non-NVIDIA hardware. SGLang is the pick when your workload has heavy shared prefixes (many requests sharing a long system prompt or few-shot preamble), where RadixAttention’s prefix caching pays off. TensorRT-LLM gives the best numbers if your model is stable enough to absorb a multi-minute compile step per change and you are locked to NVIDIA.
AI coding tools
These cross both halves, since the AI engineers building everything above increasingly use AI agents to do it. The category split into three shapes in 2026:
- Inline suggestion and chat (GitHub Copilot’s origin): fast for small edits, the lowest entry price, tightest GitHub integration.
- Autonomous coding agents (Claude Code, OpenAI Codex, Kiro): plan, execute, and verify whole features in one run.
- Agent-integrated IDEs (Cursor, Windsurf, Google Antigravity): full editors with parallel agents and, in Windsurf’s case, a cloud agent for one-click delegation to a remote VM.
The differentiation here moves week to week on benchmarks and pricing, so treat any specific ranking as perishable. What holds steady is the three-way split: whether you want suggestions in your existing editor, a terminal agent that owns whole tasks, or an IDE built around agents.
Where the categories meet
Two patterns are worth noting.
First, the categories are merging at the platform level. Gateways now ship observability, observability tools now ship gateways, and agent frameworks now include retrieval. A 2026 tool decision is increasingly “which bundle” rather than “which point tool,” and the risk is buying a bundle for one strong component and inheriting four mediocre ones.
Second, every model-side category has an operational layer the frameworks do not cover. Fine-tuning frameworks do not coordinate multi-node training or track cost per user. Serving engines do not handle idle-GPU detection or usage attribution. This is the undifferentiated infrastructure every AI team rebuilds: dev environments, job orchestration with failure handling, usage tracking, SSO. It is the 90% that does not differentiate your business, and it is the layer Saturn Cloud provides so your team can spend its time on the 10% that does.
Sources for the product details and category trends in this post: pecollective on agent frameworks, DataCamp on vector databases, Yotta Labs on inference engines, Spheron on fine-tuning frameworks, Braintrust on LLM gateways, digitalapplied on agent observability, and artificialanalysis on coding agents.

