Inference Provider Comparison Report: The Token Factory Landscape
Executive Summary
This report analyzes the providers that serve open-weight large language models as a metered API: the “token factory” layer that turns GPU infrastructure into per-token inference. It compares them across the dimensions that determine production fit: price per million tokens, throughput and latency, deployment model (serverless vs dedicated vs self-hosted), context limits, and enterprise compliance.
Pricing across providers is hard to compare directly because each one serves a different mix of models at different prices. To make the comparison apples-to-apples, we anchor on two reference models that nearly every provider serves:
- Llama 3.3 70B Instruct as the mid-size dense reference (one of the most widely hosted open models)
- DeepSeek V4 Pro as the leading open-weight / large-MoE reference
Key Findings
- Price spread is ~9x for the same model: Llama 3.3 70B ranges from ~$0.12/M tokens (DeepInfra Turbo FP8) to ~$1.05/M (Scaleway) for the identical weights. The provider, not the model, drives most of the cost.
- Custom silicon owns the speed tier: Groq (~319 tok/s), SambaNova (~306 tok/s), and Cerebras run Llama 3.3 70B 2-3x faster than GPU-based hosts, at competitive token prices.
- Serverless vs dedicated is the real decision: per-token serverless wins for bursty/low-volume traffic; dedicated GPU endpoints ($4-12/GPU/hr) win once utilization is high and sustained.
- Self-hosting crosses over high: on reserved GPU capacity, self-hosting open models typically beats cheap per-token APIs only into the tens of millions of tokens/day per model, before counting the platform engineering it requires.
- Compliance is now table stakes at the top tier: Nebius Token Factory, Fireworks, Together, and Baseten all carry SOC 2 Type II + HIPAA with zero-retention modes; the cheapest hosts often do not.
Report Structure:
This report provides comparative tables across providers (pricing, throughput, deployment, compliance), an analysis of the serverless / dedicated / self-hosted economic crossover, then detailed profiles assessing each provider’s strengths, gaps, and best-fit use cases. Recommendations by workload type are in the “Choosing a Provider” section.
The Token Factory Landscape
A “token factory” is the operational layer that turns raw GPU compute into a metered, OpenAI-compatible API billed per token. The buyer stops managing GPUs and starts buying tokens. The provider absorbs model loading, batching, autoscaling, and hardware failure.
This is a different market from the GPU cloud (neocloud) market. Neoclouds rent you the GPU by the hour and you run the server. Token factories run the server for you and charge by the token. Many providers now do both: Nebius sells both raw H100s and a Token Factory; RunPod sells both pods and serverless inference.
The market splits along two axes that matter more than brand:
- Who runs the model (serverless shared endpoint, dedicated single-tenant endpoint, or your own self-hosted server)
- What hardware it runs on (NVIDIA/AMD GPUs vs custom inference silicon like Groq LPU, Cerebras WSE, SambaNova RDU)
Why this market exists. Open-weight models (Llama, DeepSeek, Qwen, Mistral, gpt-oss) are free to download but expensive to serve well. Getting good throughput requires vLLM, SGLang, or TensorRT-LLM, continuous batching, tensor/pipeline parallelism across GPUs, KV-cache management, and autoscaling. Token factories do that work once and amortize it across many tenants, which is why per-token prices for popular models keep falling.
Market Segmentation
| Tier | Description | Pricing Model | Best For |
|---|---|---|---|
| Neutral open-model hosts | Multi-model serverless + dedicated, vendor-neutral | Per-token + per-GPU-hour | Most teams serving open models |
| Custom-silicon hosts | Proprietary inference chips, speed-optimized | Per-token | Latency-critical, high tok/s needs |
| Cloud-native inference | Inference arm of a GPU cloud or hyperscaler | Per-token + per-GPU-hour | Teams already on that cloud |
| Serverless GPU platforms | Bring-your-own-container, scale-to-zero | Per-second GPU | Custom models, spiky traffic |
| Aggregators | Route to other providers behind one API | Passthrough + margin | Multi-model apps, failover |
Tier 1: Neutral open-model hosts
Vendor-neutral platforms that serve a broad catalog of open models via serverless API, with dedicated single-tenant endpoints for higher volume. The core of this report.
Providers: Together AI, Fireworks AI, DeepInfra, Novita, Baseten, Nebius Token Factory
Characteristics:
- OpenAI-compatible API, 50-100+ models
- Serverless per-token pricing with no idle charge
- Dedicated GPU endpoints for sustained or latency-sensitive load
- Fine-tuning and LoRA serving common
Tier 2: Custom-silicon hosts
Providers running proprietary inference accelerators instead of (or alongside) NVIDIA GPUs. They compete on tokens-per-second and time-to-first-token.
Providers: Groq (LPU), Cerebras (WSE), SambaNova (RDU)
Characteristics:
- 2-10x faster output speed than GPU hosts on the same model
- Smaller model catalog (silicon must be tuned per model)
- Strong fit for agentic loops and reasoning models where token latency compounds
Tier 3: Cloud-native inference
The managed-inference offering of a GPU cloud or hyperscaler. You get the inference API plus the surrounding cloud (storage, IAM, networking, billing).
Providers: Amazon Bedrock, Google Vertex AI, Azure AI Foundry, plus neocloud inference arms (Nebius, CoreWeave, Lambda)
Characteristics:
- Tight integration with the parent cloud’s IAM, VPC, and billing
- Enterprise procurement and compliance already in place
- Open-model token prices usually higher than pure-play hosts
Tier 4: Serverless GPU platforms
You bring a container or model and the platform runs it with scale-to-zero. Not token-metered: you pay for GPU-seconds while your worker is up.
Providers: RunPod Serverless, Modal, Replicate
Characteristics:
- Per-second GPU billing, scale to zero between requests
- Cold-start latency is the key differentiator (sub-200ms to 60s+)
- Best when you need a custom model the token factories don’t host
Tier 5: Aggregators
A single API that routes requests to other providers, picking on price, speed, or availability.
Providers: OpenRouter, Hugging Face Inference Providers
Characteristics:
- One key, one bill, many backends
- Automatic failover and price routing
- Small margin on top of the underlying provider’s price
Note: Many providers span tiers. Nebius is both a neocloud (Tier 3 raw GPUs) and a Tier 1 token factory. RunPod sells pods, serverless, and routes traffic. Groq is a custom-silicon host that also appears behind aggregators. Match the offering, not the company name, to your workload.
Reference-Model Pricing
Token prices are quoted per million tokens (1M). “Blended” uses a 3:1 input:output weighting unless noted. Figures are on-demand serverless list prices as of mid-2026 and change frequently; verify before committing.
Reference Model A: Llama 3.3 70B Instruct (mid-size dense)
One of the most widely hosted open models, served by nearly every provider. The cleanest cross-provider benchmark.
| Provider | Input ($/M) | Output ($/M) | Blended ($/M) | Output Speed | Source |
|---|---|---|---|---|---|
| DeepInfra (Turbo, FP8) | $0.10 | $0.32 | ~$0.12 | ~25 tok/s | Link |
| Novita | $0.14 | ~$0.39 | ~$0.16 | Varies | Link |
| Nebius Token Factory | $0.13 | $0.40 | ~$0.16 | Varies | Link |
| Groq | $0.59 | $0.79 | ~$0.64 | ~319 tok/s | Link |
| Together AI | $1.04 | $1.04 | $1.04 | Varies | Link |
| Fireworks AI | $0.90 | $0.90 | $0.90 | Varies | Link |
| Google Vertex AI | Contact | Contact | Varies | ~144 tok/s | Link |
| Amazon Bedrock | ~$0.72 | ~$0.72 | ~$0.72 | Varies | Link |
| Scaleway | Contact | Contact | ~$1.05 | Varies | Link |
For Llama 3.3 70B, the same weights span roughly $0.12 to $1.05 per million tokens (about 9x) depending only on who serves them. DeepInfra’s quantized “Turbo” tier is cheapest; GPU-based neutral hosts cluster around $0.90 (Fireworks) to $1.04 (Together); hyperscaler endpoints land between (Bedrock ~$0.72) and higher.
Reference Model B: DeepSeek V4 Pro (leading open-weight / large MoE)
A 1.6T-parameter mixture-of-experts model (released April 2026, MIT license), currently the #2 open-weight reasoning model on independent benchmarks. The Blended column below uses Artificial Analysis’s cache-aware measured blend (3:1 input:output weighting), which runs well below each provider’s headline list price; the List (in/out) column shows the provider’s own published per-token rate for comparison.
| Provider | Blended ($/M, AA) | List in/out ($/M) | Output Speed | Time-to-First-Token | Source |
|---|---|---|---|---|---|
| DeepSeek (official) | ~$0.50 | $0.43 / $0.87 | Varies | Varies | Link |
| GMI Cloud | ~$0.64 | $1.39 / $2.78 | ~57 tok/s | ~80s | Link |
| Fireworks AI | ~$0.79 | Contact | ~123 tok/s | ~37s | Link |
| SiliconFlow | ~$0.80 | $1.74 / $3.48 | ~56 tok/s | ~80s | Link |
| DeepInfra (FP4 mixed) | ~$0.80 | $1.30 / $2.60 | ~38 tok/s | ~1.2s | Link |
| Together AI | Contact | Contact | ~120 tok/s | ~38s | Link |
| Nebius Token Factory | ~$1.93 | $1.75 / $3.50 | Varies | Varies | Link |
The official first-party DeepSeek API is cheapest, but its privacy policy states data is processed and stored in China with no zero-retention option, which is a non-starter for many regulated buyers. Among Western hosts, GMI Cloud and Fireworks lead on price; the long time-to-first-token on this reasoning model (37-80s) reflects the reasoning tokens generated before the first answer token, not a defect, and is why caching and dedicated endpoints matter for these models.
Why the spread is so wide: quantization (FP8/FP4 vs BF16) changes both price and quality, batching policy changes throughput, and a provider serving a model at a loss to acquire users prices differently than one running at margin. A lower per-token price can mean a more aggressively quantized model. Benchmark output quality on your own prompts, not just price.
Throughput and Latency
For interactive and agentic workloads, speed often matters more than per-token price. Two numbers govern user-perceived performance:
- Time-to-first-token (TTFT): latency before the first token streams. Dominated by prompt length and queue depth.
- Output speed (tokens/sec): how fast tokens stream after the first. Determines completion time for long outputs.
| Provider | Hardware | Llama 3.3 70B Output Speed | TTFT | Notes |
|---|---|---|---|---|
| Groq | LPU | ~319 tok/s | ~0.93s | Fastest GPU-class alternative; speculative-decode variant higher |
| SambaNova | RDU | ~306 tok/s | Not documented | Large-model + agentic focus |
| Cerebras | WSE | Very high | Low | Wafer-scale, OpenAI-compatible API |
| Google Vertex | TPU/GPU | ~144 tok/s | ~0.66s | Lowest TTFT among the GPU-class hosts measured |
| CompactifAI | GPU | ~123 tok/s | ~1.17s | Compressed-model host |
| Fireworks | GPU | Fast (Multi-LoRA) | ~1.0s | Tuned serving stack |
| CoreWeave | GPU | Varies | ~0.93s | Neocloud inference |
Key observations:
- Custom silicon wins decisively on output speed. Groq, SambaNova, and Cerebras run 2-3x the tokens/sec of GPU hosts on the same model. For agentic loops (many short round-trips) and reasoning models (long chains of thought), that compounds into large wall-clock and cost-per-task differences.
- TTFT is a separate axis from speed. A provider can be fast per token but slow to start (large reasoning models commonly show 30s+ TTFT on long prompts). Prompt caching and dedicated endpoints attack TTFT directly.
- Quantization confounds the comparison. An FP8 or FP4 deployment will out-throughput a BF16 one of the same model. Always check the precision a quoted speed was measured at.
Deployment Models
The single most consequential choice is how the model runs. The same provider often offers all three.
| Model | Billing | Scales to Zero | Latency | Best For |
|---|---|---|---|---|
| Serverless (shared) | Per token | Yes (no idle cost) | Variable (shared queue) | Bursty, low/moderate volume, prototyping |
| Dedicated endpoint | Per GPU-hour | No (you pay for idle) | Consistent, sub-second | Sustained or latency-sensitive volume |
| Self-hosted | Per GPU-hour (your infra) | Your responsibility | You control | High volume, data control, custom serving |
Serverless vs Dedicated: the crossover
Serverless per-token pricing is cheapest when utilization is low: you pay nothing between requests. Dedicated endpoints cost a fixed GPU-hour rate but deliver consistent latency and a lower marginal token cost at high utilization.
The crossover is a utilization question. A dedicated 8x H100 node at, say, $20-25/hr is ~$15-18K/month whether you use it or not. Serverless at $0.16/M blended reaches that monthly spend at roughly 90-110M tokens/month. Below that, serverless is cheaper and simpler. Above it, and especially if you need predictable latency, dedicated wins.
| Provider | Dedicated GPU (on-demand) | Notes |
|---|---|---|
| Together AI | H100 ~$5.49-6.49/hr; B200 ~$9.95-11.95/hr | Reserved H100 from ~$3.99/hr; GB200 contact sales |
| Fireworks AI | H100/H200 ~$7.00/hr; B200 ~$10.00/hr | Per-GPU dedicated deployments (B300 ~$12.00/hr) |
| Baseten | H100 ~$6.50/hr | Per-minute billing, T4 through B200 |
| Nebius | H100 ~$3.85/hr (on-demand) | $2.15/hr preemptible; raw GPU or Token Factory dedicated endpoint |
| RunPod | H100 ~$1.99/hr (Community) / ~$2.89/hr (Secure) | Plus per-second serverless workers |
Serverless GPU platforms (bring-your-own-model)
When the token factories don’t host the model you need (a custom fine-tune, an unusual architecture, a multimodal pipeline), serverless GPU platforms run your container with scale-to-zero. Here cold-start latency is the differentiator.
| Provider | Billing | Cold Start | Notes |
|---|---|---|---|
| RunPod Serverless | Per second | Sub-200ms (FlashBoot, marketed) | Fast cold start; lowest per-second pod rates |
| Modal | Per second | Sub-second to a few seconds (Memory Snapshots) | Strong Python-native DX |
| Replicate | Per second + per token | 60s+ on custom models | Large community model library |
Self-Hosting Economics
The third option is to run the model yourself on rented or owned GPUs. This is where teams that already have GPU capacity (or strict data-control requirements) land.
When self-hosting beats per-token APIs
Published 2026 break-even analyses converge on a wide band, because the answer depends on the comparison model and your utilization:
- Against frontier closed APIs, crossover lands around 2-5M tokens/day on reserved GPU capacity over a 12-month window.
- Against cheap open-model token factories (which are already near cost), crossover is much higher: as a rough estimate, into the tens of millions of tokens/day per model on H100/H200 on-demand before self-hosting is cheaper on raw GPU cost alone. Published figures scatter widely with the comparison API and assumed utilization, so treat this as a range, not a fixed number.
The reason the second number is so high: token factories run at high multi-tenant utilization with an optimized serving stack. A single team self-hosting rarely keeps a GPU as busy, so its effective cost-per-token is higher until volume fills the hardware.
The hidden multiplier
Raw GPU rental is the smallest part of the self-hosting bill. Published analyses add a 3-5x multiplier for the platform engineering around it: serving-stack tuning (vLLM/SGLang/TensorRT-LLM), autoscaling, GPU-failure handling, model-update cycles, observability, and the DevOps/MLOps salaries to run all of it. A single H100 runs roughly $1,500-2,800/month on cloud depending on provider and commitment (budget specialized clouds at the low end, hyperscalers higher); the team to operate a fleet of them well is the dominant cost.
This is the same build-vs-buy gap the GPU cloud market has at the infrastructure layer, one level up. The model weights are free and the GPUs are rentable, but the inference platform between them is months of engineering that produces no competitive differentiation.
Why teams self-host anyway
- Data control: prompts and completions never leave your infrastructure (regulatory, IP, or contractual requirements).
- Custom models: fine-tunes, merged models, or architectures no token factory hosts.
- Sustained high volume: above the crossover, marginal token cost on owned/reserved GPUs is lower.
- Latency/locality: pin the model in a specific region or next to your data.
The Platform Layer: Saturn Cloud
The Inference Platform Gap
Open weights are free and GPUs are rentable, but the platform that turns them into reliable, observable, cost-allocated inference is not. Teams that self-host for data control or volume reasons build this layer in-house, or run it as a managed service on their own infrastructure.
What Self-Hosted Inference Actually Requires
Teams choosing to serve open models on their own GPU capacity (a neocloud, a reserved cluster, or on-prem) typically build the following before they have production inference:
| Capability | Purpose | Typical Build Time |
|---|---|---|
| Model Deployment & Serving | Stand up vLLM/SGLang/TensorRT-LLM endpoints with continuous batching | 2-3 months |
| Autoscaling & Scale-to-Zero | Match GPU count to traffic without paying for idle | 2-3 months |
| Usage Tracking & Cost Allocation | Per-token and per-GPU spend attributed by user, team, and project | 1-2 months |
| Idle Resource Detection | Automated shutdown of unused GPU endpoints | 1-2 months |
| Access Control & SSO | SAML/OIDC, RBAC, and per-team quotas on inference endpoints | 1-2 months |
This is operationally necessary and competitively undifferentiated: every team self-hosting inference needs the same pieces.
Saturn Cloud for Self-Hosted Inference
Saturn Cloud provides the platform layer as a managed service deployable on any Kubernetes cluster (Nebius, CoreWeave, Crusoe, or bare-metal GPUs). For teams that have decided to self-host (for data control, custom models, or sustained volume) it supplies the operational tooling so they serve tokens instead of building a serving platform.
Relevant Platform Features:
- Deploy model-serving endpoints (vLLM and similar) on GPU nodes with autoscaling
- Convert a single-GPU deployment to multi-GPU serving with tensor parallelism configured automatically
- Real-time GPU usage dashboards with tracking by user, team, and project for inference cost allocation
- Configurable idle shutdown to stop endpoints that aren’t serving traffic
- Enterprise SSO (SAML/OIDC) with role-based access control on deployments
Deployment Model:
Saturn Cloud deploys via Helm chart to an existing Kubernetes cluster. Prompts, completions, and model weights stay inside customer infrastructure: Saturn Cloud provides the control plane and user interface, not the data path.
When it fits:
Consider self-hosting with Saturn Cloud when per-token API spend is consistently above the crossover for your volume, when data-control requirements rule out third-party token factories, or when you serve custom models no factory hosts, and you would otherwise spend months building deployment, autoscaling, and cost-tracking yourself.
Enterprise Readiness
For regulated buyers, compliance and data handling gate adoption regardless of price or speed.
| Provider | Compliance | Zero-Retention | Data Residency | Source |
|---|---|---|---|---|
| Nebius Token Factory | SOC 2 Type II, HIPAA, ISO 27001 | Yes | US, Finland, France | Link |
| Fireworks AI | SOC 2 Type II, HIPAA, ISO 27001 | Yes (default) | US, EU, APAC | Link |
| Together AI | SOC 2 Type II, HIPAA | Customer-controlled | US | Link |
| Baseten | SOC 2 Type II, HIPAA | Yes (default) | US, EU, UK, Australia | Link |
| Amazon Bedrock | Inherits AWS (SOC, HIPAA, FedRAMP, etc.) | Yes | All AWS regions | Link |
| Google Vertex AI | Inherits GCP | Yes | All GCP regions | Link |
| Azure AI Foundry | Inherits Azure | Yes | All Azure regions | Link |
| DeepInfra | SOC 2, ISO 27001 | Yes (0-retention policy) | US | Link |
| Groq | Enterprise terms available | Documented for enterprise | US, EU, Middle East, APAC | Link |
| DeepSeek (official) | Not documented for Western buyers | No | China | Link |
Key observations:
- The top neutral hosts now match hyperscaler compliance. Nebius, Fireworks, Together, and Baseten carry SOC 2 (often Type II) plus HIPAA with zero-retention modes, which removes the historical reason to default to Bedrock/Vertex for regulated work.
- Hyperscaler inference inherits the parent cloud’s compliance, which is its main advantage and why enterprises already on AWS/GCP/Azure often start there despite higher open-model token prices.
- First-party DeepSeek pricing comes with jurisdictional cost. DeepSeek’s own privacy policy states data is processed and stored in China with no zero-retention option; Western hosts re-serving the same open weights are the compliant path.
Choosing a Provider
Provider selection should follow the workload, not the brand. The recommendations below categorize by primary use case.
Lowest cost, moderate volume
Recommended: DeepInfra, Novita, Nebius Token Factory
- Cheapest per-token serverless on popular open models ($0.12-0.16/M for Llama 3.3 70B)
- Serverless with no idle charge fits bursty traffic
- Verify quantization level and benchmark output quality on your prompts before committing
Lowest latency / highest throughput
Recommended: Groq, Cerebras, SambaNova
- Custom silicon delivers 2-3x the output speed of GPU hosts on the same model
- Best for agentic loops and reasoning models where token latency compounds
- Smaller catalogs; confirm the model you need is supported
Production open-model serving with compliance
Recommended: Nebius Token Factory, Fireworks AI, Together AI, Baseten
- SOC 2 Type II + HIPAA with zero-retention, plus dedicated endpoints for consistent latency
- Fine-tuning and LoRA serving for custom variants
- Nebius: broadest compliance (SOC 2 + HIPAA + ISO 27001), EU + US residency, and the same vendor sells raw GPUs if you later self-host
- Fireworks: fastest tuned GPU serving stack, Multi-LoRA, zero-retention by default
- Together: broad catalog plus operated GPU clusters for dedicated/reserved capacity
- Baseten: per-minute dedicated GPU billing, strong fit for regulated single-model production
Already on a hyperscaler
Recommended: Amazon Bedrock, Google Vertex AI, Azure AI Foundry
- Inherits the parent cloud’s IAM, VPC, billing, and compliance
- Higher open-model token prices, offset by zero new-vendor procurement
- Prompt caching and batch (often 50% off) materially cut the effective rate
Custom models or spiky traffic
Recommended: RunPod Serverless, Modal, Replicate
- Bring-your-own-container with scale-to-zero when no token factory hosts your model
- RunPod: fastest cold starts (sub-200ms) and lowest per-second GPU rates
- Modal: best Python-native developer experience
- Replicate: largest community model library for fast experimentation
High volume or strict data control (self-host)
Recommended: Self-host on a neocloud (Nebius, CoreWeave, Crusoe) with a platform layer
- Into the tens of millions of tokens/day per model, or when prompts cannot leave your infrastructure
- Pair reserved GPU capacity with Saturn Cloud (or equivalent) for deployment, autoscaling, and cost allocation rather than building it
- See the companion GPU Cloud Comparison Report for choosing the underlying GPU provider
Provider Profiles
Each profile covers what the provider offers, strengths, gaps, and best-fit use cases.
Nebius Token Factory

Overview
Nebius (spun off from Yandex N.V. in 2024) operates both a GPU neocloud and Token Factory, its managed open-model inference platform. It serves 60+ open models (DeepSeek, Llama, Qwen, Mistral, gpt-oss, NVIDIA Nemotron) via an OpenAI-compatible API, with serverless per-token pricing, dedicated single-tenant endpoints, fine-tuning, and a RAG/embeddings stack.
What it offers
- Serverless inference across 60+ models with production SLAs
- Dedicated endpoints with 99.9% uptime, custom autoscaling, sub-second latency for sustained load
- Post-training/fine-tuning with transparent per-token pricing on the resulting custom model
- Data Lab (curate training sets from production logs) and embedding/RAG tooling
Strengths
- Broadest compliance among neutral hosts: SOC 2 Type II, HIPAA, ISO 27001, with zero-retention mode
- EU + US data residency (Finland, France, US), with DPAs for enterprise
- Same vendor sells raw GPUs, so a team can move from API to self-hosted without changing providers
- Input/output price separation and volume discounts; batch API at ~50% off
Gaps
- Per-token prices on the largest models (e.g. DeepSeek V4 Pro) run higher than the cheapest hosts
- Inference platform is newer than the underlying GPU cloud
- Throughput leadership belongs to custom-silicon hosts, not GPU-based serving
Best for: Regulated teams serving open models who want compliance, EU/US residency, and a path to self-hosting with the same vendor.
Fireworks AI

Overview
Fireworks AI is a neutral open-model host known for an aggressively optimized serving stack and Multi-LoRA support (serving many fine-tuned adapters off one base model). It offers serverless per-token inference plus dedicated GPU deployments (H100/H200 ~$7.00/hr, B200 ~$10.00/hr, B300 ~$12.00/hr).
Strengths
- Among the fastest GPU-based serving stacks; competitive output speed on large models
- Multi-LoRA: serve many fine-tunes cheaply against a shared base
- SOC 2 Type II, HIPAA, zero data retention by default
- Serverless ($0.10-0.90/M depending on model size) plus dedicated for sustained load
Gaps
- Per-token prices on small/mid models higher than the cheapest quantized hosts
- Catalog focused on popular models rather than breadth
- Dedicated GPU deployments are priced at a premium ($7-12/hr) vs raw neocloud rates
Best for: Production teams serving fine-tuned open models who need speed, Multi-LoRA, and compliance.
Together AI

Overview
Together AI is a broad-catalog neutral host that also builds and operates GPU clusters on NVIDIA Cloud Partner reference architectures (H100, H200, B200, GB200 with InfiniBand). It spans serverless per-token inference, dedicated endpoints, fine-tuning, batch inference, and raw GPU cluster rental, making it a one-stop path from prototype to dedicated capacity.
Strengths
- Large model catalog with serverless and dedicated options
- Operates GPU clusters: dedicated H100 ~$5.49-6.49/hr on-demand, from ~$3.99/hr reserved
- Fine-tuning integrated with serving (no weight migration between services)
- Batch inference (~50% off) and a Startup Accelerator (up to $50K credits)
- SOC 2 Type II and HIPAA with customer-controlled data
Gaps
- Mid-model serverless pricing (~$1.04/M for Llama 3.3 70B) above the cheapest hosts
- No free tier; $5 minimum credit
- Throughput trails custom-silicon hosts
Best for: Teams that want one vendor across serverless, dedicated, fine-tuning, and raw GPU clusters.
DeepInfra

Overview
DeepInfra is a price-leader serverless host. Its “Turbo” FP8 and FP4 tiers deliver some of the lowest per-token prices in the market (Llama 3.3 70B around $0.12/M blended), with an OpenAI-compatible API and cached-token discounts.
Strengths
- Lowest or near-lowest per-token pricing on popular open models
- FP8/FP4 quantized tiers for further savings
- Cached-token rate cuts cost for repeated system prompts and tool definitions
- Simple OpenAI-compatible drop-in
- SOC 2 and ISO 27001 certified, with a documented zero-retention policy
Gaps
- Aggressive quantization can affect output quality; benchmark on your prompts
- Fewer dedicated/enterprise features than Nebius/Fireworks/Together
- US-only data residency (no EU/regional option)
Best for: Cost-sensitive, moderate-volume workloads where you’ve validated quality at the quantization tier offered.
Groq

Overview
Groq builds the LPU, a custom inference accelerator, and serves open models at output speeds GPU hosts can’t match (~319 tok/s on Llama 3.3 70B, far higher with speculative decoding). In December 2025, NVIDIA acquired Groq’s assets and licensed its technology in a reported ~$20B deal (an asset purchase plus acquihire of senior leadership, with Groq remaining nominally independent), the landmark inference-silicon deal of late 2025.
Strengths
- Fastest GPU-class output speed among widely available hosts
- Low time-to-first-token; ideal for agentic loops and reasoning models
- Competitive per-token pricing despite the speed ($0.59 in / $0.79 out for Llama 3.3 70B)
- OpenAI-compatible API
Gaps
- Smaller model catalog (each model must be tuned to the silicon)
- US-headquartered, with a growing international footprint (Europe, Middle East, APAC) rather than full regional coverage
- Post-acquisition roadmap and pricing may shift following the NVIDIA deal
Best for: Latency-critical and agentic workloads where tokens/sec and TTFT drive cost-per-task.
Cerebras & SambaNova

Overview
Cerebras (wafer-scale WSE) and SambaNova (RDU) are the other custom-silicon inference hosts. Cerebras offers an OpenAI-compatible API and signed a multi-year, 750MW staged compute agreement with OpenAI in January 2026 (reported at $10B+; the megawatt total and timeline are company-stated, the dollar figure is press-attributed). SambaNova’s SN50 (announced Feb 2026, shipping H2 2026) targets agentic workloads and very large models (claimed support up to 10T parameters, 10M context).
Strengths
- Extreme output speed, competitive with Groq on supported models (Cerebras leads on several)
- SambaNova positions for large models (405B+); the 10M-token context target is forward-looking with the SN50
- Cerebras OpenAI-compatible API eases adoption
Gaps
- Catalogs narrower than GPU hosts
- SambaNova’s public enterprise compliance documentation is thin; Cerebras reportedly holds SOC 2 and HIPAA
- Shipped long-context on the largest models has historically been capped below the headline numbers
- Availability and pricing evolving rapidly
Best for: Teams whose workload is dominated by a supported large or reasoning model and where speed is the priority.
Baseten

Overview
Baseten is a horizontal multi-model inference platform offering both serverless Model APIs and dedicated deployments with per-minute GPU billing (T4 through B200; H100 ~$6.50/hr) and a strong compliance posture. It serves LLMs, image, transcription, TTS, and embedding models across dev-tools, enterprise, and healthcare customers.
Strengths
- Per-minute dedicated GPU billing; pay only while the model runs
- SOC 2 Type II, HIPAA, with no input/output retention by default
- Wide GPU selection (T4, L4, A10G, A100, H100, H100 MIG, H200, B200)
- Regional environments in US, EU, UK, and Australia for data residency
- Strong production tooling for high-performance serving
Gaps
- Dedicated-GPU pricing is higher than per-token serverless for low volume
- Less of a broad open-LLM serverless catalog than Together/Fireworks
Best for: Teams running specific models in production who want dedicated capacity with compliance, regional residency, and per-minute billing.
Amazon Bedrock / Google Vertex AI / Azure AI Foundry

Overview
The hyperscalers all offer managed inference for open models (Llama and others) alongside their first-party and partner models. Bedrock, Vertex AI, and Azure AI Foundry differ mostly in catalog and pricing detail but share the same core value: inherit the parent cloud’s IAM, networking, billing, and compliance.
Strengths
- Compliance and procurement already in place for existing customers
- Prompt caching (up to ~90% savings) and batch (often 50% off) cut effective rates
- Provisioned-throughput / reserved capacity for predictable high volume (~20-45% off, varying by commitment length and utilization)
- Open models sit next to first-party models behind one API and one bill
Gaps
- Open-model per-token prices typically above pure-play hosts
- Less aggressive on the newest open models than neutral hosts
- Throughput trails custom-silicon hosts
Best for: Enterprises already standardized on AWS/GCP/Azure that value zero new-vendor procurement over the lowest token price.
RunPod / Modal / Replicate (Serverless GPU)

Overview
These platforms run your container or model on GPUs with per-second billing and scale-to-zero, rather than metering tokens. They’re the answer when no token factory hosts the model you need. Cold-start latency is the main differentiator.
Strengths
- Run arbitrary models/containers, not just a fixed catalog
- Scale-to-zero: pay only for GPU-seconds in use
- RunPod: fastest cold starts (sub-200ms on ~48% of starts), lowest per-second GPU rates
- Modal: Python-native DX, FlashBoot snapshot cold starts (~5-25s)
- Replicate: largest community model library, fast to experiment
Gaps
- You own the serving stack, batching, and optimization
- Cold starts can be 60s+ for large custom models (Replicate)
- Per-token economics worse than token factories for high steady volume
Best for: Custom models and spiky traffic where a fixed token-factory catalog doesn’t fit.
OpenRouter (Aggregator)

Overview
OpenRouter routes one OpenAI-compatible API to 300+ models across dozens of underlying providers, picking on price, speed, or availability, with one key and one bill, and passes through provider pricing with no inference markup. Hugging Face Inference Providers offers a similar routing layer. Aggregators are how multi-model apps get failover and price routing without integrating each provider directly.
Strengths
- One integration, many backends; automatic failover and price routing
- Fast access to new models as providers add them
- Useful for benchmarking providers against each other on real traffic
Gaps
- Fees apply on credit purchases and on BYOK usage, though inference itself is passed through at the underlying provider’s price with no markup
- Compliance and data handling depend on whichever backend serves the request
- Less control over exactly which provider/quantization serves a given call
Best for: Multi-model applications wanting one API, failover, and price routing across providers.
Last updated: June 2026. Token prices, model availability, and provider features change frequently. Llama 3.3 70B figures are on-demand serverless list prices; DeepSeek V4 Pro blended figures are Artificial Analysis measured blends (which run below providers' headline list rates, also shown). All prices depend on quantization. Verify current offerings and benchmark output quality on your own prompts before making decisions.