A fintech startup called their API bill a rounding error when they launched with GPT-4 in 2024. Twelve months later, they were spending $47,000 a month on OpenAI tokens. Not because their product was wildly successful. Because every request — from a simple "summarise this transaction" to a complex multi-step compliance check — hit the same frontier model at the same price. They had no routing. No caching. No fallback. No visibility into which requests actually needed frontier-level reasoning and which could have been handled by a model costing one-twentieth as much.
They are not unusual. They are the default. According to Menlo Ventures, enterprise spending on foundation model APIs reached $12.5 billion in 2025 — up 2x from 2024. Gartner projects that more than 80% of enterprises will have deployed generative AI applications by 2026. Yet most of these deployments talk directly to a single provider’s API with no abstraction layer, no routing logic, and no cost controls beyond a prayer that the monthly invoice stays under budget.
The fix is not a better model. It is a gateway.
What an LLM Gateway Actually Does
An LLM gateway sits between your application code and the model providers. It intercepts every LLM API call and makes decisions about where to send it, whether to cache the response, how to handle failures, and what to log. Think of it as the reverse proxy layer that web applications have had since the 1990s — NGINX for language models.
Without a gateway, your application code contains hardcoded provider SDK calls scattered across dozens of files. Switching from OpenAI to Anthropic means rewriting every callsite. Adding a fallback provider means wrapping every call in try-catch logic. Implementing cost controls means building your own token counting, rate limiting, and budget enforcement. Adding observability means instrumenting every request manually. Each of these is a solved problem. But without a gateway, you solve each one independently, in every service, maintained by whoever happened to write that particular integration.
The Five Jobs of a Gateway
What a production LLM gateway handles
Your application code calls the gateway. The gateway translates to OpenAI, Anthropic, Google, Mistral, Bedrock, Azure, or any other provider. Swap providers by changing a config, not rewriting code. Portkey supports 250+ models through a single API. OpenRouter exposes 623+ models. LiteLLM provides an OpenAI-compatible interface that routes to over 100 providers.
Not every request needs GPT-4o or Claude Opus. A classification task, a simple extraction, a summary — these can run on models that cost 5–20x less with negligible quality loss. RouterBench research demonstrates that multi-LLM routers can match or exceed single-model quality while cutting average inference cost. SciForce found that hybrid routing achieves 37–46% reduction in LLM usage by reserving frontier models for complex tasks only.
Semantic caching recognises when incoming prompts mean the same thing as previous ones — even when worded differently — and returns cached responses instead of making a new API call. Cache hits return in approximately 5ms versus 2,000ms+ for a fresh LLM call. Redis LangCache reports up to 73% cost reduction in high-repetition workloads. Production teams using semantic caching with OpenAI, Claude, and Gemini APIs see cache hit rates exceeding 80%.
OpenAI has had multiple significant outages. Anthropic has had rate limit storms. Google has had capacity constraints. If your production system calls one provider with no fallback, you inherit their availability as your SLA. A gateway routes around failures automatically — primary to fallback in milliseconds, with retry logic, circuit breaking, and health checks built in.
Token counts per request, cost per user, cost per feature, latency percentiles, error rates, model quality scores — all in one dashboard. Without a gateway, this data is scattered across provider consoles, application logs, and billing pages. With a gateway, every request is instrumented automatically. Portkey, Helicone, and Braintrust lead on AI-specific observability. Budget enforcement, RBAC, and spend alerts prevent the $47,000 surprise.
The 2026 Gateway Landscape: Who Does What
Five gateways dominate production deployments in 2026. Each occupies a different niche. Choosing between them is less about which is "best" and more about which tradeoffs match your constraints.
Bifrost: Raw Speed, Open Source
Built in Go by the Maxim team. In sustained benchmarks at 5,000 requests per second, Bifrost adds 11 microseconds of gateway overhead per request. Not milliseconds. Microseconds. At 500 RPS on AWS t3.xlarge instances, Bifrost maintains a P99 latency of 520ms while LiteLLM reaches 28,000ms. At 1,000 RPS, Bifrost remains stable while LiteLLM crashes from memory exhaustion. It supports 1,000+ models, includes adaptive load balancing and cluster mode, and ships with guardrails and fallback chains. Open source under Apache 2.0.
The tradeoff: Bifrost is relatively new. The ecosystem is smaller than LiteLLM’s. Enterprise governance features like RBAC, SSO, and hierarchical budgets are still maturing. If you need raw throughput and want to self-host, Bifrost is the choice. If you need enterprise compliance out of the box, look elsewhere.
Portkey: Enterprise Governance, Managed
Portkey raised a $15M Series A in February 2026 and has positioned itself as the enterprise-grade managed gateway. 250+ models, 50+ guardrails, prompt management with versioning, FinOps dashboard, RBAC, SSO, SOC 2 and HIPAA compliance, and an MCP Gateway that shipped as GA in January 2026 for agent workflows. The unified API is clean. The observability is comprehensive — logs, traces, and cost attribution per request.
The tradeoff: managed means vendor dependency. Pricing scales with usage. Self-hosting is limited compared to LiteLLM or Bifrost. If your compliance team requires a managed vendor with certifications and your team does not want to operate infrastructure, Portkey is the path of least resistance.
LiteLLM: Self-Hosted Flexibility, Python
LiteLLM is the most widely adopted open-source gateway. OpenAI-compatible interface routing to 100+ providers. Battle-tested in production by thousands of teams. The community is large, the documentation is deep, and the plugin ecosystem covers most use cases. It is the default choice for teams that want full control and are comfortable operating Python infrastructure.
The tradeoff: Python. At high concurrency, LiteLLM’s performance degrades significantly compared to Go-based alternatives. The Kong benchmark showed LiteLLM with 86% higher latency and 859% lower throughput compared to Kong’s gateway under identical conditions. Enterprise governance — RBAC, workspaces, budgets, audit logs — requires additional tooling. If your traffic is moderate and your team knows Python, LiteLLM is excellent. If you are pushing thousands of RPS, benchmark carefully.
Kong AI Gateway: API Management Extended
Kong took their existing API gateway — already handling trillions of API transactions — and added an AI layer. The result inherits Kong’s mature plugin ecosystem, enterprise features, and operational tooling. Their own benchmark showed 65% lower latency than Portkey and 86% lower latency than LiteLLM in proxy mode. Semantic caching and token-aware rate limiting are built in.
The tradeoff: complexity. Kong is a full API management platform. If you already run Kong for your non-AI APIs, adding AI routing is a natural extension. If you do not, you are adopting a heavyweight platform for a single use case. Pricing reflects the enterprise positioning.
Cloudflare AI Gateway: Edge Network Advantage
Cloudflare’s gateway runs on their global edge network — 300+ data centres worldwide. For applications where geographic latency matters, this is a structural advantage no other gateway can match. Setup is minimal if you are already on Cloudflare. Basic caching, rate limiting, and analytics are included.
The tradeoff: feature depth. No semantic caching. No MCP support. Limited routing intelligence. 10–50ms of proxy latency per request. SaaS-only — no self-hosting option. Cloudflare AI Gateway is a good starting point for teams already on Cloudflare who need basic observability and caching. It is not sufficient for teams that need intelligent routing, semantic caching, or multi-model orchestration.
| Gateway | Language | Hosting | Overhead at Scale | Best For |
|---|---|---|---|---|
| Bifrost | Go | Self-hosted (open source) | 11µs at 5K RPS | High-throughput, latency-sensitive workloads |
| Portkey | N/A (managed) | Managed + limited self-host | Mid-range | Enterprise compliance, teams wanting managed ops |
| LiteLLM | Python | Self-hosted (open source) | Degrades at >500 RPS | Flexibility, Python teams, moderate traffic |
| Kong AI | Lua/Go | Self-hosted + managed | 65% lower than Portkey | Teams already on Kong, full API management |
| Cloudflare | N/A (managed) | SaaS only (edge) | 10–50ms proxy latency | Global edge distribution, Cloudflare users |
Routing Strategies That Actually Work
A gateway without routing logic is just a proxy. The value comes from sending the right request to the right model. Here are the four routing strategies production teams use, ordered from simplest to most sophisticated.
Rule-Based Routing: The Starting Point
Map request types to models using explicit rules. Summarisation goes to Claude Haiku. Code generation goes to Claude Sonnet. Complex reasoning goes to Claude Opus or GPT-4o. Simple classification goes to Mistral or Llama. This requires knowing your workload distribution upfront, but it is deterministic, debuggable, and effective. Most teams start here and it covers 80% of the value.
Cost-Optimised Routing: The Budget Controller
Set a cost ceiling per request or per user and let the gateway pick the cheapest model that meets a quality threshold. If the task can be handled by a model at $0.25 per million tokens, do not send it to a model at $15 per million tokens. This requires defining quality thresholds — usually through evaluation scores on representative samples — but once calibrated, it runs autonomously.
Latency-Based Routing: The Performance Optimizer
For real-time applications — chatbots, autocomplete, inline suggestions — latency matters more than cost. Route to whichever model and provider currently has the lowest response time. This requires real-time latency tracking across providers, which every gateway in this comparison supports. Combine with geographic routing (Cloudflare’s edge advantage) for global applications.
Cascading Routing: The Escalation Pattern
Start with the cheapest model. If the response fails a quality check — confidence score below threshold, format validation failure, safety filter trigger — escalate to the next tier. This is the most cost-effective strategy for diverse workloads. 70–80% of requests resolve at the cheapest tier. Only 5–10% escalate to frontier models. The downside is added latency on escalated requests and the need for reliable quality checks at each tier.
Semantic Caching: The Highest-ROI Gateway Feature
Traditional caching matches exact strings. Semantic caching matches meaning. "What is the capital of France?" and "Tell me France’s capital city" hit the same cache entry because the gateway embeds both queries into vector space and recognises them as semantically equivalent. The impact on cost and latency is dramatic.
How Semantic Caching Works
When a request arrives, the gateway generates an embedding of the prompt using a lightweight embedding model. It performs a vector similarity search against previously cached request-response pairs. If the similarity exceeds a configurable threshold — typically 0.92–0.98 depending on the use case — it returns the cached response without calling the LLM. Cache misses that require embedding generation plus vector search take approximately 60ms (50ms embedding, 10ms search) versus 2,000ms+ for a full provider round-trip.
The similarity threshold is the critical tuning parameter. Too low and you serve incorrect cached responses for semantically different queries. Too high and the cache never hits. Customer support workloads with repetitive queries tolerate lower thresholds (0.92–0.95). Code generation workloads where precision matters need higher thresholds (0.97–0.99) or should skip semantic caching entirely and use exact-match caching on normalised prompts.
- Customer support: high query repetition, tolerance for approximate matches, massive cost savings
- Internal knowledge bases: employees asking similar questions about policies, processes, and tools
- Classification and extraction: same document types processed repeatedly with similar prompts
- Content generation with templates: product descriptions, email drafts, summaries with consistent structure
- Code generation: subtle prompt differences produce vastly different code — exact caching only
- Agentic workflows: each step depends on prior context that changes across runs
- Personalised responses: user-specific data means cached responses are wrong for other users
- Time-sensitive queries: "What happened today?" must not return yesterday’s cached answer
The MCP Gateway Convergence
Model Context Protocol introduced a standard for how AI agents communicate with external tools and data sources. In January 2026, Portkey shipped an MCP Gateway as generally available. Bifrost added MCP operations with sub-3ms latency overhead. This convergence matters because agents need both LLM routing and tool routing — and the gateway is the natural place to handle both.
An agentic workflow that plans with Claude Opus, executes code with Claude Sonnet, queries a database through an MCP server, calls a third-party API through another MCP server, and synthesises results with GPT-4o needs a routing layer that understands both LLM calls and tool calls. Without a unified gateway, you build separate infrastructure for each — an LLM router, an MCP proxy, a tool registry, and custom glue code to connect them.
The gateways that win in 2026 will be the ones that unify LLM routing, MCP tool routing, observability, and cost controls into a single layer. Portkey and Bifrost are ahead. Kong is extending from the API management side. Cloudflare and LiteLLM are behind on MCP integration.
“The AI gateway is becoming what the API gateway was in 2015 — optional infrastructure that everyone ignores until their production system falls over at 3am and they realise they have no fallback, no observability, and no idea which model call caused the cascade.”
How to Add a Gateway Without Rewriting Your Application
Production gateway adoption in five steps
Before choosing a gateway, understand your workload. How many LLM calls per day? What is the distribution of request types? Which calls need frontier models and which could run on cheaper alternatives? What is your current monthly spend? Most teams are shocked by the audit results. The "simple classification" endpoint that nobody thought about is responsible for 40% of the token spend.
Deploy the gateway as a transparent proxy first. No routing changes. No caching. Just intercept every request, log it, and forward it to the existing provider. This gives you observability — cost per endpoint, latency distribution, error rates — without changing any behaviour. Run this for two weeks to build a baseline.
Identify the endpoints with the highest request repetition. Customer support, FAQ, classification, and extraction endpoints typically have 40–80% semantic overlap. Enable semantic caching for these endpoints first. Measure the cache hit rate and cost reduction. Most teams see 30–50% savings in the first week.
Route simple tasks to cheap models. Summarisation, classification, extraction, and formatting tasks rarely need frontier models. Build an evaluation set of 50–100 representative requests per task type. Test each candidate model against the evaluation set. If the cheaper model scores within 5% of the frontier model on your quality metrics, route to it. This step typically captures another 20–40% cost reduction.
Configure provider failover: primary to secondary in under 500ms. Set per-team, per-feature, and per-user budget limits. Enable spend alerts at 70% and 90% of budget thresholds. This prevents the $47,000 surprise and ensures your production system survives the next provider outage.
What Could Go Wrong
Gateway adoption is not risk-free. Here are the failure modes we see in production deployments.
Gateway as a single point of failure
If the gateway goes down, every LLM call fails. This is the same risk as any reverse proxy, and the mitigation is the same: redundancy, health checks, and a circuit breaker that bypasses the gateway and calls providers directly if the gateway is unhealthy. Not every gateway supports this pattern natively. Ask before you adopt.
Semantic cache poisoning
If an incorrect response gets cached, every semantically similar query returns the wrong answer until the cache entry expires or is manually invalidated. This is particularly dangerous for factual queries and compliance-sensitive responses. Mitigation: short TTLs for critical endpoints, quality-score-gated caching where only high-confidence responses enter the cache, and manual cache invalidation APIs.
Routing quality regression
Routing a request to a cheaper model saves money but degrades quality. If the quality degradation is subtle — slightly worse summaries, slightly less accurate classifications — it can go undetected for weeks while user satisfaction erodes. Mitigation: continuous evaluation against golden sets, A/B testing of routing changes, and quality monitoring dashboards that alert on degradation before users notice.
Vendor lock-in at the gateway layer
Ironically, a tool designed to reduce provider lock-in can introduce gateway lock-in. Portkey’s guardrails, Kong’s plugin ecosystem, and Cloudflare’s edge integration all create switching costs. Mitigation: use OpenAI-compatible interfaces where possible, keep routing rules in version-controlled configuration, and ensure your gateway choice supports export of all observability data.
Where Fordel Builds
We have deployed LLM gateways for clients in fintech, SaaS, healthcare, and insurance. The pattern is consistent: teams start with a single-provider integration, costs spiral as usage grows, and adding a gateway layer after the fact is harder than building it in from day one. The routing logic, the caching strategy, the fallback chains, and the observability pipeline are all infrastructure decisions that compound over time.
We help teams audit their LLM spend, select the right gateway for their constraints, and implement routing strategies that cut costs without cutting quality. If your AI spend is growing faster than your revenue, the gateway layer is where you start. Reach out.