A list of specific OSS projects goes stale within a quarter. The categories do not. If you know what problem each category solves and the axes along which the projects in it differ, you can place any new tool on the map and evaluate it quickly.
This post is organized that way. It splits the ecosystem into the application side (building agents and LLM-powered apps) and the model side (serving, training, fine-tuning, and post-training), and covers each category with the leading projects and the differences between them. The focus is architecture and trade-offs rather than feature lists.
One caveat that affects how you should read the rest: most “open source” projects here are VC-backed open-core. The framework is open, but a company sells a hosted tier and the roadmap follows that product. The projects under neutral governance are the exception, and they are worth knowing for that reason. We flag them where they come up.
Part 1: The application side (agents and LLM apps)
Agent frameworks and orchestration
This is the most crowded category. The useful way to cut it is by control model rather than by vendor. There are three:
- Graph-based: you define agent behavior as an explicit state machine. Nodes do work, edges route, and state persists. This gives the most control at the cost of more upfront structure.
- Role-based: you define agents by role and goal, then assemble them into teams. Less code, but less control over the exact execution path.
- Minimal / typed: a thin layer over the model API with a few primitives such as tools, handoffs, and structured output. Built for engineers who distrust heavy abstractions.
| Framework | Model | Language | Maintainer | Notes |
|---|---|---|---|---|
| LangGraph | Graph-based | Python / JS | LangChain Inc. | The orchestration substrate others build on |
| LangChain | Abstraction layer | Python / JS | LangChain Inc. | Re-architected onto the LangGraph runtime |
| LlamaIndex | Retrieval-centric | Python / TS | LlamaIndex Inc. | Data and document workflows |
| CrewAI | Role-based | Python | CrewAI Inc. | Crews of role-defined agents |
| OpenAI Agents SDK | Minimal | Python / TS | OpenAI | Agents, handoffs, guardrails |
| Pydantic AI | Minimal, typed | Python | Pydantic | Type-first, “FastAPI for agents” |
| Smolagents | Minimal, code-exec | Python | Hugging Face | Agents act by writing and running code |
| Google ADK | Graph-based | Python / Java / TS | GCP-native, tight Gemini integration | |
| Mastra | Workflow + agents | TypeScript | Mastra Inc. | The leading TS-first framework |
LangGraph has become the default substrate for serious agent work. It models an agent as a directed graph with conditional edges and provides the features you need in production: durable execution, checkpointing so a run can resume from saved state, streaming, and human-in-the-loop interrupts. Use it when an agent needs to loop, branch, and survive restarts.
LangChain itself did not die, but its scope shrank. The 1.0 release deprecated the old AgentExecutor and chain abstractions, moved the new create_agent entry point on top of the LangGraph runtime, and split legacy code into a separate langchain-classic package. So this is less “LangGraph killed LangChain” than “the same company consolidated everything onto the LangGraph runtime and slimmed LangChain down to thin agent and integration helpers.” The two ship from the same company and are designed to be used together. One thing worth knowing: LangChain’s reputation for leaky abstractions drove a “skip the framework, call the SDK directly” backlash, which is part of why the minimal camp grew.
CrewAI is the canonical role-based option. You define agents with a role and a goal, group them into a crew, and express a multi-agent workflow in far less code than raw orchestration would take. It is strong for structured business-process automation and weaker when you need fine-grained control over the execution path. Open-core, with a hosted deployment platform on top.
OpenAI Agents SDK and Pydantic AI anchor the minimal camp. The OpenAI SDK exposes three primitives (agents, handoffs for delegation, and guardrails) plus tracing, and it is the natural default for teams already on the OpenAI stack. Pydantic AI is the type-first option from the Pydantic team: rigorous typed inputs and outputs with validation baked in, which appeals to engineers who want the model boundary to behave like any other typed interface.
Smolagents (Hugging Face) takes a different approach: its agents act by writing and executing Python code rather than emitting JSON tool calls. It is a niche tool rather than a market leader, but the code-execution model is a distinct design point.
Mastra is the one to know if you work in TypeScript. It bundles agents, workflows, and RAG in one framework and has become the leading TS-first option, with the Vercel AI SDK as its main rival there. The agent world was effectively Python-only for a while, and that is no longer the case.
A note on AutoGen, since it featured heavily in older comparisons: it has fragmented. The original creators maintain a community fork (AG2), while Microsoft moved its own effort into the Microsoft Agent Framework, which merges AutoGen’s multi-agent ideas with Semantic Kernel’s enterprise tooling. The “AutoGen” name no longer points at one thing, so evaluate the successors directly rather than the original.
The protocol layer: MCP and A2A
This is the biggest structural change in the application stack, and the standardization is real rather than aspirational.
MCP (Model Context Protocol), introduced by Anthropic in late 2024, standardizes the agent-to-tool and agent-to-data layer: how an agent reads files, calls functions, and pulls in context. Through 2025 it was adopted across OpenAI, Google, and Microsoft, plus the major coding tools. It has effectively won the tool-integration layer, and today you build tool integrations as MCP servers.
A2A (Agent-to-Agent), introduced by Google, handles the problem MCP explicitly leaves out: how independent agents discover and coordinate with each other. The two are complementary. A realistic production system runs MCP for tools and A2A for inter-agent communication.
What makes this durable is the governance. Both protocols were placed under the Linux Foundation’s Agentic AI Foundation, co-founded by the major labs (OpenAI, Anthropic, Google, Microsoft, AWS, and others). Neutral governance is why the interoperability question is largely settled instead of being contested between vendors. These are among the few projects in this post that are not single-company open-core.
Retrieval (RAG) frameworks and vector databases
Two sub-categories: the frameworks that orchestrate retrieval, and the databases that store and search the vectors.
On the framework side, the two durable choices are LlamaIndex and Haystack. LlamaIndex is retrieval-first and has deepened its indexing, context-assembly, and document-intelligence primitives. Haystack (from deepset, Apache-2.0) takes an explicit-pipeline approach: retrievers, rankers, and generators wired into an inspectable, serializable graph. Haystack’s differentiator is operability, since the pipelines are designed to be deployed, monitored, and run in production. If RAG is one feature of a larger app, LlamaIndex tends to fit. If RAG is the product and has to be operated as a service, Haystack’s pipeline model is the better fit.
On the database side, the practical selection heuristic in 2026 is blunt:
| Database | Form | License | Pick it when |
|---|---|---|---|
| pgvector | Postgres extension | OSS (PostgreSQL) | You already run Postgres and are under a few million vectors |
| Qdrant | Standalone (Rust) | Apache-2.0 | You need a dedicated store; strong payload filtering |
| Weaviate | Standalone | OSS (BSD-3) | You want built-in vectorization modules |
| Milvus | Distributed | Apache-2.0 | You are at very large (billion-vector) scale |
| Chroma | Embedded / light | Apache-2.0 | Prototyping a new RAG project |
| LanceDB | Embedded (columnar) | Apache-2.0 | In-process, local, or multimodal data |
The short version most teams converge on: pgvector if you already have Postgres, Qdrant if you do not, Milvus only when scale actually demands a distributed system. Chroma is the prototype default but teams plan a migration once filtering needs or dataset size grow. One governance note: Milvus sits under Linux Foundation (LF AI and Data) governance and pgvector rides on Postgres, while the rest are single-vendor open-core with a hosted cloud attached.
LLM gateways and routing
When you call more than one model provider, you want a single API surface, centralized spend tracking, and key management in one place. The key distinction for an infra team is self-hosted proxy versus hosted marketplace.
LiteLLM is the self-hosted default: a Python SDK plus a proxy server that exposes one OpenAI-compatible API across a hundred-plus providers, with routing, spend tracking, and key management you own and run. Portkey is the hybrid option, an open-source gateway plus a managed tier that adds semantic caching, guardrails, and audit trails for enterprise use. OpenRouter is frequently mentioned alongside these, but it is a hosted SaaS marketplace, not something you self-host, so it belongs in a different mental bucket: zero infra to run, at the cost of routing your traffic through a third party.
One point relevant to DevOps: this layer sits directly in your credential path, which makes it a supply-chain target. LiteLLM had malicious code published in two point releases on PyPI in early 2026, followed quickly by a clean release. The takeaway is not to avoid the tool. It is to pin and verify dependencies for anything that can see your provider keys and cloud credentials.
Observability and evaluation
The defining axis here is open-source-and-portable versus proprietary-and-framework-locked. The industry is converging on OpenTelemetry (and OpenInference) as the shared trace format, which favors the portable tools.
- Langfuse (MIT, self-hostable, OpenTelemetry-native, framework-agnostic) is the open-source favorite for tracing, evals, and prompt management. Self-hosting is a primary supported mode rather than an afterthought, which matters when inputs and outputs must not leave your infrastructure.
- Phoenix (Arize) is open and built on OpenTelemetry/OpenInference, strong for agent and LLM trace troubleshooting. Note the license is source-available (Elastic License v2) rather than OSI-approved open source.
- LangSmith is the LangChain company’s product. It auto-traces LangChain and LangGraph with almost no setup, which is its strength and its lock-in. Be precise here: LangSmith is proprietary, not open source.
For evaluation specifically, the common open-source stack is promptfoo for CI-style prompt testing and red-teaming, DeepEval for unit-test-style assertions on outputs (“Pytest for LLMs”), and Ragas for RAG-specific metrics (faithfulness, context precision and recall). These are components, not platforms, and they feed results into an observability tool like Langfuse or Phoenix.
Guardrails and structured output
Two problems that get conflated. One is making the model return data your code can parse. The other is keeping a conversation inside policy.
For structured output, Instructor is the pragmatic default: it patches the LLM client, uses Pydantic (or equivalent) models as the schema, and validates and retries on failure. Outlines takes a different approach, constraining the model’s decoding at the token level so the output is structurally valid by construction rather than validated after the fact. Outlines is the better fit when you control the model weights (local or open models); Instructor fits when you are calling a hosted API. Note that native structured-output modes in the major provider APIs, and typed frameworks like Pydantic AI, have absorbed the simpler use cases.
For policy and safety, Guardrails AI validates output content and structure (PII, format, content rules) through a library of pre-built validators, while NeMo Guardrails (NVIDIA) controls conversation flow and dialog policy through a dedicated DSL. They solve different layers, and a common pattern is to run NeMo for conversation policy and Guardrails AI for output validation in the same system.
Part 2: The model side (serving, training, fine-tuning, post-training)
A note on benchmarks before the tables: every “X is 30% faster than Y” figure in this half of the ecosystem is specific to a workload, model, hardware setup, and version, and most circulating numbers come from the winning project’s own benchmarks. The architectural differences below hold up over time; the percentages do not, so this section mostly leaves them out.
Inference and serving engines
The production GPU-serving race has consolidated. vLLM is the default, SGLang is the credible challenger, TensorRT-LLM is the NVIDIA-locked performance ceiling, and the previous-generation servers have largely stepped back.
| Engine | Maintainer | Role | Key idea |
|---|---|---|---|
| vLLM | PyTorch Foundation project | Default production engine | PagedAttention; widest hardware and model support |
| SGLang | sgl-project community | Challenger, strong on agentic/RAG | RadixAttention (automatic prefix-cache reuse) |
| TensorRT-LLM | NVIDIA | Performance ceiling, niche | Compiler-optimized FP8/FP4 on NVIDIA only |
| llama.cpp | ggml community | Local / CPU / edge standard | C/C++ engine, GGUF format |
| Ollama | Ollama Inc. | Local dev default | Wraps llama.cpp, runs in minutes |
| KServe | CNCF (graduated) | Kubernetes orchestration | Wraps engines; autoscaling, scale-to-zero, canary |
| Ray Serve | Anyscale | Multi-model orchestration | Typed service calls, multi-node, queue-based autoscaling |
The concepts that actually distinguish these engines:
- PagedAttention (vLLM’s original breakthrough) manages the KV cache like OS virtual memory, paging blocks in and out. It cuts the memory fragmentation that otherwise wastes GPU RAM, which lets you serve more concurrent requests on the same card.
- RadixAttention (SGLang) stores the KV cache in a radix tree so that shared prefixes (a long system prompt, few-shot examples, multi-turn history, RAG context) are detected and reused automatically. This is why SGLang shows its biggest wins on prefix-heavy and agentic workloads and roughly matches vLLM on plain single-turn generation.
- Continuous batching (all the serious engines do it) swaps finished sequences out and new ones in every decode step instead of waiting for a whole batch to finish. This is table stakes now, not a differentiator.
- Disaggregated prefill/decode runs the compute-bound prefill phase and the memory-bandwidth-bound decode phase on separate GPU pools so you can tune time-to-first-token and inter-token latency independently. This is the frontier of production serving, and it is enabling a new Kubernetes layer (llm-d, from Red Hat, Google, and IBM) that does KV-cache-aware routing on top of vLLM.
Who wins what: vLLM is the safe default for the breadth of its ecosystem, hardware, and model coverage. SGLang wins when your workload is prefix-heavy. TensorRT-LLM if you must extract maximum throughput from NVIDIA hardware and accept the lock-in and the longer setup. For local and desktop, Ollama for ease with llama.cpp underneath. KServe and Ray Serve are not engines themselves; they orchestrate the engines on Kubernetes and in Python respectively.
Distributed training frameworks
The consolidation here: PyTorch-native FSDP2 has become a credible default into the hundred-billion-parameter range, DeepSpeed remains the offload specialist, and NVIDIA’s Megatron-Core is the engine for true frontier scale, especially Mixture-of-Experts.
| Framework | Maintainer | Role | Differentiation |
|---|---|---|---|
| PyTorch FSDP2 | Meta / PyTorch | Default for most teams | DTensor-based sharding, composes with the PyTorch stack |
| DeepSpeed | Microsoft | Offload king | ZeRO stages; CPU/NVMe offload (ZeRO-Infinity) |
| Megatron-Core | NVIDIA | Frontier-scale engine | All parallelism dimensions, mature expert parallel, FP8/FP4 |
| NeMo | NVIDIA | End-to-end enterprise stack | Builds on Megatron-Core |
| Accelerate | Hugging Face | Thin wrapper | One interface over FSDP and DeepSpeed |
| Ray Train | Anyscale | Orchestration layer | Wraps the backends; multi-node lifecycle, fault tolerance |
| Horovod | Linux Foundation | Legacy | Ring-allreduce, superseded by FSDP/DeepSpeed |
The vocabulary that matters is the set of parallelism strategies, because picking a framework is mostly about which combinations you need:
- Data parallel: replicate the model, split the batch.
- Sharding (FSDP / ZeRO): split parameters, gradients, and optimizer state across GPUs so no single GPU holds the whole model. ZeRO stages 1, 2, and 3 progressively shard optimizer state, then gradients, then parameters; FSDP2 is the PyTorch-native equivalent.
- Tensor parallel: split individual matrix multiplications across GPUs (within a layer).
- Pipeline parallel: split layers into stages across GPUs (across layers).
- Sequence / context parallel: split along the sequence dimension for very long context.
- Expert parallel: distribute Mixture-of-Experts experts across GPUs. As frontier models go MoE, mature expert parallelism is the most common reason teams reach past FSDP2 to Megatron-Core.
Selection heuristic: FSDP2 if you are PyTorch-native and training in the 7B to 100B range on a handful of GPUs. DeepSpeed if CPU/NVMe offload is a first-order requirement. Megatron-Core or NeMo for frontier-scale dense or MoE training where every parallelism dimension has to be tuned together. Accelerate and Ray Train are not backends; they wrap and orchestrate the ones above.
Fine-tuning libraries
| Library | Maintainer | Sweet spot | Differentiation |
|---|---|---|---|
| Unsloth | Unsloth AI | Single-GPU speed and VRAM | Custom Triton kernels; multi-GPU is gated behind paid tiers |
| Axolotl | Axolotl AI | Multi-GPU clusters, full pipelines | YAML config; FSDP/DeepSpeed; SFT through RLHF |
| LLaMA-Factory | community | Fast first run, broad model zoo | Web UI, very wide model coverage |
| PEFT | Hugging Face | The adapter library | LoRA/QLoRA/DoRA implementations the others build on |
| TRL | Hugging Face | Reference post-training trainers | SFT/DPO/GRPO/PPO, tight Transformers integration |
The workhorse technique is still LoRA and QLoRA. LoRA trains small low-rank adapter matrices instead of full weights. QLoRA adds 4-bit quantization of the frozen base model so it fits in far less VRAM. Because all of these libraries build on the same PEFT and safetensors conventions, adapters are largely portable: an adapter trained in Unsloth loads in Axolotl and vice versa.
Selection heuristic: Unsloth for solo, single-GPU iteration (budget for the paid tier if you need multi-GPU, since that is gated in the open version). Axolotl for serious multi-GPU clusters and end-to-end RLHF. LLaMA-Factory for the fastest path to a first run and the widest model support. TRL when you are already in the Hugging Face ecosystem and want reference-quality trainers.
One project to handle carefully if you read older guides: torchtune went into maintenance mode in mid-2025 (bug and security fixes only). Its role is being split rather than cleanly inherited, with PyTorch-native post-training moving toward a newer, scale-focused project. Do not start new work on it.
RL and post-training
This is the most dynamic category, driven by the rise of reinforcement learning for reasoning models. The leaders in 2026 are verl and OpenRLHF for scale, with TRL as the accessible entry point.
| Framework | Maintainer | Role | Differentiation |
|---|---|---|---|
| verl | verl community (started at ByteDance) | Leading at scale | Hybrid-controller design; GRPO/PPO/DAPO and more in a few lines |
| OpenRLHF | OpenRLHF community | Scalable, production-ready | Ray plus vLLM architecture |
| TRL | Hugging Face | Most accessible | Tight Transformers/PEFT integration |
| NeMo-Aligner | NVIDIA | Enterprise / NVIDIA stack | Megatron-Core-backed alignment |
The algorithm story behind this category is GRPO (Group Relative Policy Optimization). It drops the separate value model that PPO requires and instead computes advantages from a group of sampled responses scored by a reward, often a rule-based, verifiable one. For reasoning, math, and code tasks where the answer can be checked, this is cheaper and simpler than PPO, which is why it became the dominant algorithm for training reasoning models. A rough rule of thumb: GRPO for verifiable-reward reasoning tasks, DPO for offline preference alignment with no online generation, PPO when you need a general learned reward model and can afford the infrastructure.
Selection heuristic: verl if you are doing serious RL post-training at scale (it is the de facto leader in both research and industry). OpenRLHF as the other scalable Ray-plus-vLLM option. TRL to get started or for smaller jobs. NeMo-Aligner if you are committed to the NVIDIA and Megatron stack.
Core libraries and quantization
Underneath all of the above sit a handful of libraries that almost everything depends on. Hugging Face Transformers is the universal model definition and loading layer. PyTorch remains dominant for open-source LLM work (JAX is strong inside Google and on TPUs). Triton is the GPU-kernel language that powers torch.compile, Unsloth’s speedups, and cross-vendor kernel portability. FlashAttention is the exact-attention kernel nearly everyone relies on, with recent versions targeting Hopper and Blackwell GPUs.
For quantization, the landscape sorts cleanly by where the model runs:
| Method | Best for | Notes |
|---|---|---|
| GGUF | Local / CPU / edge | From llama.cpp, the local-inference standard |
| AWQ | GPU serving (INT4) | Has overtaken GPTQ for new work on quality and kernel speed |
| GPTQ | GPU serving (INT4), legacy | Still common, losing ground to AWQ |
| bitsandbytes | QLoRA fine-tuning | The 4-bit base-model quant during training |
| FP8 / FP4 | H100 and Blackwell serving and training | GPU-native low precision, no calibration for dynamic quant |
The trajectory: GGUF owns local and CPU, AWQ has displaced GPTQ as the default for GPU-served INT4 weights, bitsandbytes owns the QLoRA training path, and FP8 (with FP4 arriving on Blackwell) is where both serving and training are heading on current NVIDIA hardware.
How to use this map
The categories outlast the specific projects. When a new framework launches, the questions are the same: which category does it sit in, what control model or architectural idea does it bet on, and is it neutral-governance or single-vendor open-core with a hosted tier? Answer those three and you can place almost anything.
A few cross-cutting points:
- The protocol layer is settling. Build tool integrations as MCP servers and agent coordination on A2A, and you are aligned with where the major labs converged.
- The minimal-framework backlash is real. If a heavy framework’s abstractions are fighting you, calling the provider SDK directly (or using a typed minimal layer) is a reasonable and increasingly common choice.
- Open-core is the norm. Read the license and the governance before you build a hard dependency. The neutral-governance projects (MCP and A2A under the Linux Foundation, Milvus, pgvector, vLLM under the PyTorch Foundation) are the ones least likely to change direction under you.
- The model side has a portable path and a locked one. vLLM, SGLang, FSDP2, and the PyTorch-native tools are the portable alternative to NVIDIA’s TensorRT-LLM and Megatron stack. NVIDIA’s tooling sets the performance ceiling, and the open stack is closing the gap.
None of this tooling decides where it runs. Most teams assemble a stack from several of these projects: an agent framework, a serving engine, a fine-tuning library for the model work. They then have to operate all of it on GPU infrastructure, with usage tracking, multi-node coordination, idle detection, and access control. That operational layer is the same regardless of which frameworks you pick, and it is the part Saturn Cloud handles, so your team can spend its time on the model and product work that differentiates you. If you are standing up infrastructure for any of the stacks above, visit Saturn Cloud to learn more.

