The Open Source AI Framework Landscape in 2026: A Map for AI Engineers

A list of specific OSS projects goes stale within a quarter. The categories do not. If you know what problem each category solves and the axes along which the projects in it differ, you can place any new tool on the map and evaluate it quickly.

This post is organized that way. It splits the ecosystem into the application side (building agents and LLM-powered apps) and the model side (serving, training, fine-tuning, and post-training), and covers each category with the leading projects and the differences between them. The focus is architecture and trade-offs rather than feature lists.

One caveat that affects how you should read the rest: most “open source” projects here are VC-backed open-core. The framework is open, but a company sells a hosted tier and the roadmap follows that product. The projects under neutral governance are the exception, and they are worth knowing for that reason. We flag them where they come up.

Part 1: The application side (agents and LLM apps)

Agent frameworks and orchestration

This is the most crowded category. The useful way to cut it is by control model rather than by vendor. There are three:

Graph-based: you define agent behavior as an explicit state machine. Nodes do work, edges route, and state persists. This gives the most control at the cost of more upfront structure.
Role-based: you define agents by role and goal, then assemble them into teams. Less code, but less control over the exact execution path.
Minimal / typed: a thin layer over the model API with a few primitives such as tools, handoffs, and structured output. Built for engineers who distrust heavy abstractions.

Framework	Model	Language	Maintainer	Notes
LangGraph	Graph-based	Python / JS	LangChain Inc.	The orchestration substrate others build on
LangChain	Abstraction layer	Python / JS	LangChain Inc.	Re-architected onto the LangGraph runtime
LlamaIndex	Retrieval-centric	Python / TS	LlamaIndex Inc.	Data and document workflows
CrewAI	Role-based	Python	CrewAI Inc.	Crews of role-defined agents
OpenAI Agents SDK	Minimal	Python / TS	OpenAI	Agents, handoffs, guardrails
Pydantic AI	Minimal, typed	Python	Pydantic	Type-first, “FastAPI for agents”
Smolagents	Minimal, code-exec	Python	Hugging Face	Agents act by writing and running code
Google ADK	Graph-based	Python / Java / TS	Google	GCP-native, tight Gemini integration
Mastra	Workflow + agents	TypeScript	Mastra Inc.	The leading TS-first framework

LangGraph has become the default substrate for serious agent work. It models an agent as a directed graph with conditional edges and provides the features you need in production: durable execution, checkpointing so a run can resume from saved state, streaming, and human-in-the-loop interrupts. Use it when an agent needs to loop, branch, and survive restarts.

LangChain itself did not die, but its scope shrank. The 1.0 release deprecated the old AgentExecutor and chain abstractions, moved the new create_agent entry point on top of the LangGraph runtime, and split legacy code into a separate langchain-classic package. So this is less “LangGraph killed LangChain” than “the same company consolidated everything onto the LangGraph runtime and slimmed LangChain down to thin agent and integration helpers.” The two ship from the same company and are designed to be used together. One thing worth knowing: LangChain’s reputation for leaky abstractions drove a “skip the framework, call the SDK directly” backlash, which is part of why the minimal camp grew.

CrewAI is the canonical role-based option. You define agents with a role and a goal, group them into a crew, and express a multi-agent workflow in far less code than raw orchestration would take. It is strong for structured business-process automation and weaker when you need fine-grained control over the execution path. Open-core, with a hosted deployment platform on top.

OpenAI Agents SDK and Pydantic AI anchor the minimal camp. The OpenAI SDK exposes three primitives (agents, handoffs for delegation, and guardrails) plus tracing, and it is the natural default for teams already on the OpenAI stack. Pydantic AI is the type-first option from the Pydantic team: rigorous typed inputs and outputs with validation baked in, which appeals to engineers who want the model boundary to behave like any other typed interface.

Smolagents (Hugging Face) takes a different approach: its agents act by writing and executing Python code rather than emitting JSON tool calls. It is a niche tool rather than a market leader, but the code-execution model is a distinct design point.

Mastra is the one to know if you work in TypeScript. It bundles agents, workflows, and RAG in one framework and has become the leading TS-first option, with the Vercel AI SDK as its main rival there. The agent world was effectively Python-only for a while, and that is no longer the case.

A note on AutoGen, since it featured heavily in older comparisons: it has fragmented. The original creators maintain a community fork (AG2), while Microsoft moved its own effort into the Microsoft Agent Framework, which merges AutoGen’s multi-agent ideas with Semantic Kernel’s enterprise tooling. The “AutoGen” name no longer points at one thing, so evaluate the successors directly rather than the original.

The protocol layer: MCP and A2A

This is the biggest structural change in the application stack, and the standardization is real rather than aspirational.

MCP (Model Context Protocol), introduced by Anthropic in late 2024, standardizes the agent-to-tool and agent-to-data layer: how an agent reads files, calls functions, and pulls in context. Through 2025 it was adopted across OpenAI, Google, and Microsoft, plus the major coding tools. It has effectively won the tool-integration layer, and today you build tool integrations as MCP servers.

A2A (Agent-to-Agent), introduced by Google, handles the problem MCP explicitly leaves out: how independent agents discover and coordinate with each other. The two are complementary. A realistic production system runs MCP for tools and A2A for inter-agent communication.

What makes this durable is the governance. Both protocols were placed under the Linux Foundation’s Agentic AI Foundation, co-founded by the major labs (OpenAI, Anthropic, Google, Microsoft, AWS, and others). Neutral governance is why the interoperability question is largely settled instead of being contested between vendors. These are among the few projects in this post that are not single-company open-core.

Retrieval (RAG) frameworks and vector databases

Two sub-categories: the frameworks that orchestrate retrieval, and the databases that store and search the vectors.

On the framework side, the two durable choices are LlamaIndex and Haystack. LlamaIndex is retrieval-first and has deepened its indexing, context-assembly, and document-intelligence primitives. Haystack (from deepset, Apache-2.0) takes an explicit-pipeline approach: retrievers, rankers, and generators wired into an inspectable, serializable graph. Haystack’s differentiator is operability, since the pipelines are designed to be deployed, monitored, and run in production. If RAG is one feature of a larger app, LlamaIndex tends to fit. If RAG is the product and has to be operated as a service, Haystack’s pipeline model is the better fit.

On the database side, the practical selection heuristic in 2026 is blunt:

Database	Form	License	Pick it when
pgvector	Postgres extension	OSS (PostgreSQL)	You already run Postgres and are under a few million vectors
Qdrant	Standalone (Rust)	Apache-2.0	You need a dedicated store; strong payload filtering
Weaviate	Standalone	OSS (BSD-3)	You want built-in vectorization modules
Milvus	Distributed	Apache-2.0	You are at very large (billion-vector) scale
Chroma	Embedded / light	Apache-2.0	Prototyping a new RAG project
LanceDB	Embedded (columnar)	Apache-2.0	In-process, local, or multimodal data

The short version most teams converge on: pgvector if you already have Postgres, Qdrant if you do not, Milvus only when scale actually demands a distributed system. Chroma is the prototype default but teams plan a migration once filtering needs or dataset size grow. One governance note: Milvus sits under Linux Foundation (LF AI and Data) governance and pgvector rides on Postgres, while the rest are single-vendor open-core with a hosted cloud attached.

LLM gateways and routing

When you call more than one model provider, you want a single API surface, centralized spend tracking, and key management in one place. The key distinction for an infra team is self-hosted proxy versus hosted marketplace.

LiteLLM is the self-hosted default: a Python SDK plus a proxy server that exposes one OpenAI-compatible API across a hundred-plus providers, with routing, spend tracking, and key management you own and run. Portkey is the hybrid option, an open-source gateway plus a managed tier that adds semantic caching, guardrails, and audit trails for enterprise use. OpenRouter is frequently mentioned alongside these, but it is a hosted SaaS marketplace, not something you self-host, so it belongs in a different mental bucket: zero infra to run, at the cost of routing your traffic through a third party.

One point relevant to DevOps: this layer sits directly in your credential path, which makes it a supply-chain target. LiteLLM had malicious code published in two point releases on PyPI in early 2026, followed quickly by a clean release. The takeaway is not to avoid the tool. It is to pin and verify dependencies for anything that can see your provider keys and cloud credentials.

Observability and evaluation

The defining axis here is open-source-and-portable versus proprietary-and-framework-locked. The industry is converging on OpenTelemetry (and OpenInference) as the shared trace format, which favors the portable tools.

Langfuse (MIT, self-hostable, OpenTelemetry-native, framework-agnostic) is the open-source favorite for tracing, evals, and prompt management. Self-hosting is a primary supported mode rather than an afterthought, which matters when inputs and outputs must not leave your infrastructure.
Phoenix (Arize) is open and built on OpenTelemetry/OpenInference, strong for agent and LLM trace troubleshooting. Note the license is source-available (Elastic License v2) rather than OSI-approved open source.
LangSmith is the LangChain company’s product. It auto-traces LangChain and LangGraph with almost no setup, which is its strength and its lock-in. Be precise here: LangSmith is proprietary, not open source.

For evaluation specifically, the common open-source stack is promptfoo for CI-style prompt testing and red-teaming, DeepEval for unit-test-style assertions on outputs (“Pytest for LLMs”), and Ragas for RAG-specific metrics (faithfulness, context precision and recall). These are components, not platforms, and they feed results into an observability tool like Langfuse or Phoenix.

Guardrails and structured output

Two problems that get conflated. One is making the model return data your code can parse. The other is keeping a conversation inside policy.

For structured output, Instructor is the pragmatic default: it patches the LLM client, uses Pydantic (or equivalent) models as the schema, and validates and retries on failure. Outlines takes a different approach, constraining the model’s decoding at the token level so the output is structurally valid by construction rather than validated after the fact. Outlines is the better fit when you control the model weights (local or open models); Instructor fits when you are calling a hosted API. Note that native structured-output modes in the major provider APIs, and typed frameworks like Pydantic AI, have absorbed the simpler use cases.

For policy and safety, Guardrails AI validates output content and structure (PII, format, content rules) through a library of pre-built validators, while NeMo Guardrails (NVIDIA) controls conversation flow and dialog policy through a dedicated DSL. They solve different layers, and a common pattern is to run NeMo for conversation policy and Guardrails AI for output validation in the same system.

Part 2: The model side (serving, training, fine-tuning, post-training)

A note on benchmarks before the tables: every “X is 30% faster than Y” figure in this half of the ecosystem is specific to a workload, model, hardware setup, and version, and most circulating numbers come from the winning project’s own benchmarks. The architectural differences below hold up over time; the percentages do not, so this section mostly leaves them out.

Inference and serving engines

The production GPU-serving race has consolidated. vLLM is the default, SGLang is the credible challenger, TensorRT-LLM is the NVIDIA-locked performance ceiling, and the previous-generation servers have largely stepped back.

Engine	Maintainer	Role	Key idea
vLLM	PyTorch Foundation project	Default production engine	PagedAttention; widest hardware and model support
SGLang	sgl-project community	Challenger, strong on agentic/RAG	RadixAttention (automatic prefix-cache reuse)
TensorRT-LLM	NVIDIA	Performance ceiling, niche	Compiler-optimized FP8/FP4 on NVIDIA only
llama.cpp	ggml community	Local / CPU / edge standard	C/C++ engine, GGUF format
Ollama	Ollama Inc.	Local dev default	Wraps llama.cpp, runs in minutes
KServe	CNCF (graduated)	Kubernetes orchestration	Wraps engines; autoscaling, scale-to-zero, canary
Ray Serve	Anyscale	Multi-model orchestration	Typed service calls, multi-node, queue-based autoscaling

The concepts that actually distinguish these engines:

PagedAttention (vLLM’s original breakthrough) manages the KV cache like OS virtual memory, paging blocks in and out. It cuts the memory fragmentation that otherwise wastes GPU RAM, which lets you serve more concurrent requests on the same card.
RadixAttention (SGLang) stores the KV cache in a radix tree so that shared prefixes (a long system prompt, few-shot examples, multi-turn history, RAG context) are detected and reused automatically. This is why SGLang shows its biggest wins on prefix-heavy and agentic workloads and roughly matches vLLM on plain single-turn generation.
Continuous batching (all the serious engines do it) swaps finished sequences out and new ones in every decode step instead of waiting for a whole batch to finish. This is table stakes now, not a differentiator.
Disaggregated prefill/decode runs the compute-bound prefill phase and the memory-bandwidth-bound decode phase on separate GPU pools so you can tune time-to-first-token and inter-token latency independently. This is the frontier of production serving, and it is enabling a new Kubernetes layer (llm-d, from Red Hat, Google, and IBM) that does KV-cache-aware routing on top of vLLM.

Who wins what: vLLM is the safe default for the breadth of its ecosystem, hardware, and model coverage. SGLang wins when your workload is prefix-heavy. TensorRT-LLM if you must extract maximum throughput from NVIDIA hardware and accept the lock-in and the longer setup. For local and desktop, Ollama for ease with llama.cpp underneath. KServe and Ray Serve are not engines themselves; they orchestrate the engines on Kubernetes and in Python respectively.

Inference routers and the serving control plane

A serving engine optimizes one replica. The optimizations above (PagedAttention, RadixAttention, prefix caching, multi-LoRA) all operate inside a single engine process. The moment a model needs more than one replica, those wins stop coordinating: each replica has its own private KV cache and does not know what the others hold. A round-robin load balancer in front of several replicas will route a multi-turn conversation to a different replica each turn, so the prefix cache misses and every turn recomputes the full history. The cache-aware win evaporates at exactly the point where you scale.

Closing that gap is a separate category from the engine: an inference router (sometimes called an inference scheduler or serving control plane) that routes across replicas of a model on live serving state. The signals it routes on are the ones a plain gateway cannot see, because they live inside the engine: which replica already holds the request’s KV-cache prefix, which replica has the shortest queue right now, and which already has the requested LoRA adapter loaded. This is what distinguishes an inference router from an LLM gateway. The gateway (LiteLLM and the others above) routes between provider endpoints on API-level facts and treats each backend as an opaque URL; the inference router routes into engine internals on cache and load state. They sit at different layers and compose: a gateway in front for keys and metering, an inference router behind it for cache-aware placement.

Project	Maintainer	Role	Key idea
vLLM Production Stack	vLLM project	Reference router for vLLM at scale	Prefix/KV-aware routing across replicas, plus autoscaling and metrics
AIBrix	vLLM ecosystem (ByteDance origin)	Fuller control plane	LoRA-aware scheduling, heterogeneous-GPU handling, multiple models in one fabric
llm-d	Red Hat, Google, IBM	Kubernetes-native distributed serving	Disaggregated prefill/decode and KV-aware routing, aligned to the Gateway API Inference Extension
NVIDIA Dynamo	NVIDIA	Datacenter-scale serving	Disaggregated serving and KV tiering; engine-agnostic (vLLM, SGLang, TensorRT-LLM), so it does not require TensorRT-LLM
Gateway API Inference Extension	Kubernetes SIG	The standard, not a product	An inference-aware Endpoint Picker extending the Kubernetes Gateway API

The axis that matters when choosing is how much routing logic you need beyond prefix affinity. The vLLM Production Stack offers a small set of selectable strategies (round-robin, session affinity, prefix-aware) and is the simplest thing that does cache-aware routing for one model’s replicas. AIBrix, llm-d, and Dynamo sit at the richer end: multiple models behind one fabric, LoRA-aware scheduling, heterogeneous GPUs, and disaggregated prefill/decode. None of them expose routing policy the way you might expect from a configurable rule engine. The interesting tuning knob in practice is the scoring function that arbitrates between prefix affinity and load, because pure prefix affinity creates hotspots (everyone with the same prefix routes to the same replica, which is then the busiest).

This is the fastest-moving category on the model side, and it is converging on the Gateway API Inference Extension as the standard interface. The practical consequence for anyone building on it: integrate at that standard rather than hard-wiring one router, and the specific implementation underneath becomes swappable as the projects leapfrog each other. A note on scope: like the engines, these routers operate per model and on system state (cache, load, hardware). They are deliberately blind to tenants and billing, so per-customer metering and access control are not their job and live in a layer above them.

Distributed training frameworks

The consolidation here: PyTorch-native FSDP2 has become a credible default into the hundred-billion-parameter range, DeepSpeed remains the offload specialist, and NVIDIA’s Megatron-Core is the engine for true frontier scale, especially Mixture-of-Experts.

Framework	Maintainer	Role	Differentiation
PyTorch FSDP2	Meta / PyTorch	Default for most teams	DTensor-based sharding, composes with the PyTorch stack
DeepSpeed	Microsoft	Offload king	ZeRO stages; CPU/NVMe offload (ZeRO-Infinity)
Megatron-Core	NVIDIA	Frontier-scale engine	All parallelism dimensions, mature expert parallel, FP8/FP4
NeMo	NVIDIA	End-to-end enterprise stack	Builds on Megatron-Core
Accelerate	Hugging Face	Thin wrapper	One interface over FSDP and DeepSpeed
Ray Train	Anyscale	Orchestration layer	Wraps the backends; multi-node lifecycle, fault tolerance
Horovod	Linux Foundation	Legacy	Ring-allreduce, superseded by FSDP/DeepSpeed

The vocabulary that matters is the set of parallelism strategies, because picking a framework is mostly about which combinations you need:

Data parallel: replicate the model, split the batch.
Sharding (FSDP / ZeRO): split parameters, gradients, and optimizer state across GPUs so no single GPU holds the whole model. ZeRO stages 1, 2, and 3 progressively shard optimizer state, then gradients, then parameters; FSDP2 is the PyTorch-native equivalent.
Tensor parallel: split individual matrix multiplications across GPUs (within a layer).
Pipeline parallel: split layers into stages across GPUs (across layers).
Sequence / context parallel: split along the sequence dimension for very long context.
Expert parallel: distribute Mixture-of-Experts experts across GPUs. As frontier models go MoE, mature expert parallelism is the most common reason teams reach past FSDP2 to Megatron-Core.

Selection heuristic: FSDP2 if you are PyTorch-native and training in the 7B to 100B range on a handful of GPUs. DeepSpeed if CPU/NVMe offload is a first-order requirement. Megatron-Core or NeMo for frontier-scale dense or MoE training where every parallelism dimension has to be tuned together. Accelerate and Ray Train are not backends; they wrap and orchestrate the ones above.

Fine-tuning libraries

Library	Maintainer	Sweet spot	Differentiation
Unsloth	Unsloth AI	Single-GPU speed and VRAM	Custom Triton kernels; multi-GPU is gated behind paid tiers
Axolotl	Axolotl AI	Multi-GPU clusters, full pipelines	YAML config; FSDP/DeepSpeed; SFT through RLHF
LLaMA-Factory	community	Fast first run, broad model zoo	Web UI, very wide model coverage
PEFT	Hugging Face	The adapter library	LoRA/QLoRA/DoRA implementations the others build on
TRL	Hugging Face	Reference post-training trainers	SFT/DPO/GRPO/PPO, tight Transformers integration

The workhorse technique is still LoRA and QLoRA. LoRA trains small low-rank adapter matrices instead of full weights. QLoRA adds 4-bit quantization of the frozen base model so it fits in far less VRAM. Because all of these libraries build on the same PEFT and safetensors conventions, adapters are largely portable: an adapter trained in Unsloth loads in Axolotl and vice versa.

Selection heuristic: Unsloth for solo, single-GPU iteration (budget for the paid tier if you need multi-GPU, since that is gated in the open version). Axolotl for serious multi-GPU clusters and end-to-end RLHF. LLaMA-Factory for the fastest path to a first run and the widest model support. TRL when you are already in the Hugging Face ecosystem and want reference-quality trainers.

One project to handle carefully if you read older guides: torchtune went into maintenance mode in mid-2025 (bug and security fixes only). Its role is being split rather than cleanly inherited, with PyTorch-native post-training moving toward a newer, scale-focused project. Do not start new work on it.

RL and post-training

This is the most dynamic category, driven by the rise of reinforcement learning for reasoning models. The leaders in 2026 are verl and OpenRLHF for scale, with TRL as the accessible entry point.

Framework	Maintainer	Role	Differentiation
verl	verl community (started at ByteDance)	Leading at scale	Hybrid-controller design; GRPO/PPO/DAPO and more in a few lines
OpenRLHF	OpenRLHF community	Scalable, production-ready	Ray plus vLLM architecture
TRL	Hugging Face	Most accessible	Tight Transformers/PEFT integration
NeMo-Aligner	NVIDIA	Enterprise / NVIDIA stack	Megatron-Core-backed alignment

The algorithm story behind this category is GRPO (Group Relative Policy Optimization). It drops the separate value model that PPO requires and instead computes advantages from a group of sampled responses scored by a reward, often a rule-based, verifiable one. For reasoning, math, and code tasks where the answer can be checked, this is cheaper and simpler than PPO, which is why it became the dominant algorithm for training reasoning models. A rough rule of thumb: GRPO for verifiable-reward reasoning tasks, DPO for offline preference alignment with no online generation, PPO when you need a general learned reward model and can afford the infrastructure.

Selection heuristic: verl if you are doing serious RL post-training at scale (it is the de facto leader in both research and industry). OpenRLHF as the other scalable Ray-plus-vLLM option. TRL to get started or for smaller jobs. NeMo-Aligner if you are committed to the NVIDIA and Megatron stack.

Core libraries and quantization

Underneath all of the above sit a handful of libraries that almost everything depends on. Hugging Face Transformers is the universal model definition and loading layer. PyTorch remains dominant for open-source LLM work (JAX is strong inside Google and on TPUs). Triton is the GPU-kernel language that powers torch.compile, Unsloth’s speedups, and cross-vendor kernel portability. FlashAttention is the exact-attention kernel nearly everyone relies on, with recent versions targeting Hopper and Blackwell GPUs.

For quantization, the landscape sorts cleanly by where the model runs:

Method	Best for	Notes
GGUF	Local / CPU / edge	From llama.cpp, the local-inference standard
AWQ	GPU serving (INT4)	Has overtaken GPTQ for new work on quality and kernel speed
GPTQ	GPU serving (INT4), legacy	Still common, losing ground to AWQ
bitsandbytes	QLoRA fine-tuning	The 4-bit base-model quant during training
FP8 / FP4	H100 and Blackwell serving and training	GPU-native low precision, no calibration for dynamic quant

The trajectory: GGUF owns local and CPU, AWQ has displaced GPTQ as the default for GPU-served INT4 weights, bitsandbytes owns the QLoRA training path, and FP8 (with FP4 arriving on Blackwell) is where both serving and training are heading on current NVIDIA hardware.

How to use this map

The categories outlast the specific projects. When a new framework launches, the questions are the same: which category does it sit in, what control model or architectural idea does it bet on, and is it neutral-governance or single-vendor open-core with a hosted tier? Answer those three and you can place almost anything.

A few cross-cutting points:

The protocol layer is settling. Build tool integrations as MCP servers and agent coordination on A2A, and you are aligned with where the major labs converged.
The minimal-framework backlash is real. If a heavy framework’s abstractions are fighting you, calling the provider SDK directly (or using a typed minimal layer) is a reasonable and increasingly common choice.
Open-core is the norm. Read the license and the governance before you build a hard dependency. The neutral-governance projects (MCP and A2A under the Linux Foundation, Milvus, pgvector, vLLM under the PyTorch Foundation) are the ones least likely to change direction under you.
The model side has a portable path and a locked one. vLLM, SGLang, FSDP2, and the PyTorch-native tools are the portable alternative to NVIDIA’s TensorRT-LLM and Megatron stack. NVIDIA’s tooling sets the performance ceiling, and the open stack is closing the gap.
“Router” means two different things. An LLM gateway routes between providers on API-level facts; an inference router routes across replicas on engine-internal cache and load state. They are different layers that compose, not substitutes. If a project’s routing decision needs to see inside the engine, it is an inference router; if the backend is an opaque URL to it, it is a gateway.

None of this tooling decides where it runs. Most teams assemble a stack from several of these projects: an agent framework, a serving engine, a fine-tuning library for the model work. They then have to operate all of it on GPU infrastructure, with usage tracking, multi-node coordination, idle detection, and access control. That operational layer is the same regardless of which frameworks you pick, and it is the part Saturn Cloud handles, so your team can spend its time on the model and product work that differentiates you. If you are standing up infrastructure for any of the stacks above, visit Saturn Cloud to learn more.