You Are Paying 3x Too Much for AI: The Gateway Layer Your Production Stack Is Missing

AI agents call LLMs in loops, which means a missing gateway layer multiplies the bill. 90% of teams route every call to a frontier model. Bifrost, Portkey, LiteLLM, Kong, Cloudflare — what each does, what it costs.

Abhishek Sharma· Head of Engg @ Fordel Studios

March 24, 2026Updated May 8, 202616 min read

You Are Paying 3x Too Much for AI: The Gateway Layer Your Production Stack Is Missing

A fintech startup called their API bill a rounding error when they launched with GPT-4 in 2024. Twelve months later, they were spending $47,000 a month on OpenAI tokens. Not because their product was wildly successful. Because every request — from a simple "summarise this transaction" to a complex multi-step compliance check — hit the same frontier model at the same price. They had no routing. No caching. No fallback. No visibility into which requests actually needed frontier-level reasoning and which could have been handled by a model costing one-twentieth as much.

They are not unusual. They are the default. According to Menlo Ventures, enterprise spending on foundation model APIs reached $12.5 billion in 2025 — up 2x from 2024. Gartner projects that more than 80% of enterprises will have deployed generative AI applications by 2026. Yet most of these deployments talk directly to a single provider’s API with no abstraction layer, no routing logic, and no cost controls beyond a prayer that the monthly invoice stays under budget.

The fix is not a better model. It is a gateway.

90%+of production AI teams run 5+ LLMs simultaneouslySource: Maxim AI industry survey, 2026

$12.5Bspent on foundation model APIs in 2025Source: Menlo Ventures State of Generative AI report — up 2x from $6.3B in 2024

50–70%cost reduction achievable with routing + caching + prompt optimizationSource: Prem AI production benchmarks across enterprise deployments

···

What an LLM Gateway Actually Does

An LLM gateway sits between your application code and the model providers. It intercepts every LLM API call and makes decisions about where to send it, whether to cache the response, how to handle failures, and what to log. Think of it as the reverse proxy layer that web applications have had since the 1990s — NGINX for language models.

Without a gateway, your application code contains hardcoded provider SDK calls scattered across dozens of files. Switching from OpenAI to Anthropic means rewriting every callsite. Adding a fallback provider means wrapping every call in try-catch logic. Implementing cost controls means building your own token counting, rate limiting, and budget enforcement. Adding observability means instrumenting every request manually. Each of these is a solved problem. But without a gateway, you solve each one independently, in every service, maintained by whoever happened to write that particular integration.

The Five Jobs of a Gateway

What a production LLM gateway handles

Unified API: one interface, every provider

Your application code calls the gateway. The gateway translates to OpenAI, Anthropic, Google, Mistral, Bedrock, Azure, or any other provider. Swap providers by changing a config, not rewriting code. Portkey supports 250+ models through a single API. OpenRouter exposes 623+ models. LiteLLM provides an OpenAI-compatible interface that routes to over 100 providers.

Intelligent routing: right model for each request

Not every request needs GPT-4o or Claude Opus. A classification task, a simple extraction, a summary — these can run on models that cost 5–20x less with negligible quality loss. RouterBench research demonstrates that multi-LLM routers can match or exceed single-model quality while cutting average inference cost. SciForce found that hybrid routing achieves 37–46% reduction in LLM usage by reserving frontier models for complex tasks only.

Semantic caching: stop paying for duplicate work

Semantic caching recognises when incoming prompts mean the same thing as previous ones — even when worded differently — and returns cached responses instead of making a new API call. Cache hits return in approximately 5ms versus 2,000ms+ for a fresh LLM call. Redis LangCache reports up to 73% cost reduction in high-repetition workloads. Production teams using semantic caching with OpenAI, Claude, and Gemini APIs see cache hit rates exceeding 80%.

Failover and load balancing: survive provider outages

OpenAI has had multiple significant outages. Anthropic has had rate limit storms. Google has had capacity constraints. If your production system calls one provider with no fallback, you inherit their availability as your SLA. A gateway routes around failures automatically — primary to fallback in milliseconds, with retry logic, circuit breaking, and health checks built in.

Observability and cost controls: know what you are spending

Token counts per request, cost per user, cost per feature, latency percentiles, error rates, model quality scores — all in one dashboard. Without a gateway, this data is scattered across provider consoles, application logs, and billing pages. With a gateway, every request is instrumented automatically. Portkey, Helicone, and Braintrust lead on AI-specific observability. Budget enforcement, RBAC, and spend alerts prevent the $47,000 surprise.

The 2026 Gateway Landscape: Who Does What

Five gateways dominate production deployments in 2026. Each occupies a different niche. Choosing between them is less about which is "best" and more about which tradeoffs match your constraints.

Bifrost: Raw Speed, Open Source

Built in Go by the Maxim team. In sustained benchmarks at 5,000 requests per second, Bifrost adds 11 microseconds of gateway overhead per request. Not milliseconds. Microseconds. At 500 RPS on AWS t3.xlarge instances, Bifrost maintains a P99 latency of 520ms while LiteLLM reaches 28,000ms. At 1,000 RPS, Bifrost remains stable while LiteLLM crashes from memory exhaustion. It supports 1,000+ models, includes adaptive load balancing and cluster mode, and ships with guardrails and fallback chains. Open source under Apache 2.0.

The tradeoff: Bifrost is relatively new. The ecosystem is smaller than LiteLLM’s. Enterprise governance features like RBAC, SSO, and hierarchical budgets are still maturing. If you need raw throughput and want to self-host, Bifrost is the choice. If you need enterprise compliance out of the box, look elsewhere.

Portkey: Enterprise Governance, Managed

Portkey raised a $15M Series A in February 2026 and has positioned itself as the enterprise-grade managed gateway. 250+ models, 50+ guardrails, prompt management with versioning, FinOps dashboard, RBAC, SSO, SOC 2 and HIPAA compliance, and an MCP Gateway that shipped as GA in January 2026 for agent workflows. The unified API is clean. The observability is comprehensive — logs, traces, and cost attribution per request.

The tradeoff: managed means vendor dependency. Pricing scales with usage. Self-hosting is limited compared to LiteLLM or Bifrost. If your compliance team requires a managed vendor with certifications and your team does not want to operate infrastructure, Portkey is the path of least resistance.

LiteLLM: Self-Hosted Flexibility, Python

LiteLLM is the most widely adopted open-source gateway. OpenAI-compatible interface routing to 100+ providers. Battle-tested in production by thousands of teams. The community is large, the documentation is deep, and the plugin ecosystem covers most use cases. It is the default choice for teams that want full control and are comfortable operating Python infrastructure.

The tradeoff: Python. At high concurrency, LiteLLM’s performance degrades significantly compared to Go-based alternatives. The Kong benchmark showed LiteLLM with 86% higher latency and 859% lower throughput compared to Kong’s gateway under identical conditions. Enterprise governance — RBAC, workspaces, budgets, audit logs — requires additional tooling. If your traffic is moderate and your team knows Python, LiteLLM is excellent. If you are pushing thousands of RPS, benchmark carefully.

Kong AI Gateway: API Management Extended

Kong took their existing API gateway — already handling trillions of API transactions — and added an AI layer. The result inherits Kong’s mature plugin ecosystem, enterprise features, and operational tooling. Their own benchmark showed 65% lower latency than Portkey and 86% lower latency than LiteLLM in proxy mode. Semantic caching and token-aware rate limiting are built in.

The tradeoff: complexity. Kong is a full API management platform. If you already run Kong for your non-AI APIs, adding AI routing is a natural extension. If you do not, you are adopting a heavyweight platform for a single use case. Pricing reflects the enterprise positioning.

Cloudflare AI Gateway: Edge Network Advantage

Cloudflare’s gateway runs on their global edge network — 300+ data centres worldwide. For applications where geographic latency matters, this is a structural advantage no other gateway can match. Setup is minimal if you are already on Cloudflare. Basic caching, rate limiting, and analytics are included.

The tradeoff: feature depth. No semantic caching. No MCP support. Limited routing intelligence. 10–50ms of proxy latency per request. SaaS-only — no self-hosting option. Cloudflare AI Gateway is a good starting point for teams already on Cloudflare who need basic observability and caching. It is not sufficient for teams that need intelligent routing, semantic caching, or multi-model orchestration.

Gateway	Language	Hosting	Overhead at Scale	Best For
Bifrost	Go	Self-hosted (open source)	11µs at 5K RPS	High-throughput, latency-sensitive workloads
Portkey	N/A (managed)	Managed + limited self-host	Mid-range	Enterprise compliance, teams wanting managed ops
LiteLLM	Python	Self-hosted (open source)	Degrades at >500 RPS	Flexibility, Python teams, moderate traffic
Kong AI	Lua/Go	Self-hosted + managed	65% lower than Portkey	Teams already on Kong, full API management
Cloudflare	N/A (managed)	SaaS only (edge)	10–50ms proxy latency	Global edge distribution, Cloudflare users

Routing Strategies That Actually Work

A gateway without routing logic is just a proxy. The value comes from sending the right request to the right model. Here are the four routing strategies production teams use, ordered from simplest to most sophisticated.

Rule-Based Routing: The Starting Point

Map request types to models using explicit rules. Summarisation goes to Claude Haiku. Code generation goes to Claude Sonnet. Complex reasoning goes to Claude Opus or GPT-4o. Simple classification goes to Mistral or Llama. This requires knowing your workload distribution upfront, but it is deterministic, debuggable, and effective. Most teams start here and it covers 80% of the value.

Cost-Optimised Routing: The Budget Controller

Set a cost ceiling per request or per user and let the gateway pick the cheapest model that meets a quality threshold. If the task can be handled by a model at $0.25 per million tokens, do not send it to a model at $15 per million tokens. This requires defining quality thresholds — usually through evaluation scores on representative samples — but once calibrated, it runs autonomously.

Latency-Based Routing: The Performance Optimizer

For real-time applications — chatbots, autocomplete, inline suggestions — latency matters more than cost. Route to whichever model and provider currently has the lowest response time. This requires real-time latency tracking across providers, which every gateway in this comparison supports. Combine with geographic routing (Cloudflare’s edge advantage) for global applications.

Cascading Routing: The Escalation Pattern

Start with the cheapest model. If the response fails a quality check — confidence score below threshold, format validation failure, safety filter trigger — escalate to the next tier. This is the most cost-effective strategy for diverse workloads. 70–80% of requests resolve at the cheapest tier. Only 5–10% escalate to frontier models. The downside is added latency on escalated requests and the need for reliable quality checks at each tier.

Semantic Caching: The Highest-ROI Gateway Feature

Traditional caching matches exact strings. Semantic caching matches meaning. "What is the capital of France?" and "Tell me France’s capital city" hit the same cache entry because the gateway embeds both queries into vector space and recognises them as semantically equivalent. The impact on cost and latency is dramatic.

~5mscache hit response time versus ~2,000ms for a fresh LLM callSource: Bifrost semantic caching benchmarks — 400x faster than provider round-trip

73%cost reduction in high-repetition workloads with semantic cachingSource: Redis LangCache production deployment data

80%+cache hit rates achievable in production with major providersSource: Production deployments using OpenAI, Claude, and Gemini APIs

How Semantic Caching Works

When a request arrives, the gateway generates an embedding of the prompt using a lightweight embedding model. It performs a vector similarity search against previously cached request-response pairs. If the similarity exceeds a configurable threshold — typically 0.92–0.98 depending on the use case — it returns the cached response without calling the LLM. Cache misses that require embedding generation plus vector search take approximately 60ms (50ms embedding, 10ms search) versus 2,000ms+ for a full provider round-trip.

The similarity threshold is the critical tuning parameter. Too low and you serve incorrect cached responses for semantically different queries. Too high and the cache never hits. Customer support workloads with repetitive queries tolerate lower thresholds (0.92–0.95). Code generation workloads where precision matters need higher thresholds (0.97–0.99) or should skip semantic caching entirely and use exact-match caching on normalised prompts.

When Semantic Caching Works Best

Customer support: high query repetition, tolerance for approximate matches, massive cost savings
Internal knowledge bases: employees asking similar questions about policies, processes, and tools
Classification and extraction: same document types processed repeatedly with similar prompts
Content generation with templates: product descriptions, email drafts, summaries with consistent structure

When Semantic Caching Fails

Code generation: subtle prompt differences produce vastly different code — exact caching only
Agentic workflows: each step depends on prior context that changes across runs
Personalised responses: user-specific data means cached responses are wrong for other users
Time-sensitive queries: "What happened today?" must not return yesterday’s cached answer

The MCP Gateway Convergence

Model Context Protocol introduced a standard for how AI agents communicate with external tools and data sources. In January 2026, Portkey shipped an MCP Gateway as generally available. Bifrost added MCP operations with sub-3ms latency overhead. This convergence matters because agents need both LLM routing and tool routing — and the gateway is the natural place to handle both.

An agentic workflow that plans with Claude Opus, executes code with Claude Sonnet, queries a database through an MCP server, calls a third-party API through another MCP server, and synthesises results with GPT-4o needs a routing layer that understands both LLM calls and tool calls. Without a unified gateway, you build separate infrastructure for each — an LLM router, an MCP proxy, a tool registry, and custom glue code to connect them.

The gateways that win in 2026 will be the ones that unify LLM routing, MCP tool routing, observability, and cost controls into a single layer. Portkey and Bifrost are ahead. Kong is extending from the API management side. Cloudflare and LiteLLM are behind on MCP integration.

“The AI gateway is becoming what the API gateway was in 2015 — optional infrastructure that everyone ignores until their production system falls over at 3am and they realise they have no fallback, no observability, and no idea which model call caused the cascade.”

How to Add a Gateway Without Rewriting Your Application

Production gateway adoption in five steps

Audit your current LLM calls

Before choosing a gateway, understand your workload. How many LLM calls per day? What is the distribution of request types? Which calls need frontier models and which could run on cheaper alternatives? What is your current monthly spend? Most teams are shocked by the audit results. The "simple classification" endpoint that nobody thought about is responsible for 40% of the token spend.

Start with a proxy, not a router

Deploy the gateway as a transparent proxy first. No routing changes. No caching. Just intercept every request, log it, and forward it to the existing provider. This gives you observability — cost per endpoint, latency distribution, error rates — without changing any behaviour. Run this for two weeks to build a baseline.

Add semantic caching for high-repetition endpoints

Identify the endpoints with the highest request repetition. Customer support, FAQ, classification, and extraction endpoints typically have 40–80% semantic overlap. Enable semantic caching for these endpoints first. Measure the cache hit rate and cost reduction. Most teams see 30–50% savings in the first week.

Implement rule-based routing for obvious cost wins

Route simple tasks to cheap models. Summarisation, classification, extraction, and formatting tasks rarely need frontier models. Build an evaluation set of 50–100 representative requests per task type. Test each candidate model against the evaluation set. If the cheaper model scores within 5% of the frontier model on your quality metrics, route to it. This step typically captures another 20–40% cost reduction.

Add fallback chains and budget controls

Configure provider failover: primary to secondary in under 500ms. Set per-team, per-feature, and per-user budget limits. Enable spend alerts at 70% and 90% of budget thresholds. This prevents the $47,000 surprise and ensures your production system survives the next provider outage.

What Could Go Wrong

Gateway adoption is not risk-free. Here are the failure modes we see in production deployments.

Gateway as a single point of failure

If the gateway goes down, every LLM call fails. This is the same risk as any reverse proxy, and the mitigation is the same: redundancy, health checks, and a circuit breaker that bypasses the gateway and calls providers directly if the gateway is unhealthy. Not every gateway supports this pattern natively. Ask before you adopt.

Semantic cache poisoning

If an incorrect response gets cached, every semantically similar query returns the wrong answer until the cache entry expires or is manually invalidated. This is particularly dangerous for factual queries and compliance-sensitive responses. Mitigation: short TTLs for critical endpoints, quality-score-gated caching where only high-confidence responses enter the cache, and manual cache invalidation APIs.

Routing quality regression

Routing a request to a cheaper model saves money but degrades quality. If the quality degradation is subtle — slightly worse summaries, slightly less accurate classifications — it can go undetected for weeks while user satisfaction erodes. Mitigation: continuous evaluation against golden sets, A/B testing of routing changes, and quality monitoring dashboards that alert on degradation before users notice.

Vendor lock-in at the gateway layer

Ironically, a tool designed to reduce provider lock-in can introduce gateway lock-in. Portkey’s guardrails, Kong’s plugin ecosystem, and Cloudflare’s edge integration all create switching costs. Mitigation: use OpenAI-compatible interfaces where possible, keep routing rules in version-controlled configuration, and ensure your gateway choice supports export of all observability data.

···

Where Fordel Builds

We have deployed LLM gateways for clients in fintech, SaaS, healthcare, and insurance. The pattern is consistent: teams start with a single-provider integration, costs spiral as usage grows, and adding a gateway layer after the fact is harder than building it in from day one. The routing logic, the caching strategy, the fallback chains, and the observability pipeline are all infrastructure decisions that compound over time.

We help teams audit their LLM spend, select the right gateway for their constraints, and implement routing strategies that cut costs without cutting quality. If your AI spend is growing faster than your revenue, the gateway layer is where you start. Reach out.

Frequently Asked Questions

What is an LLM gateway and why do you need one in production?

An LLM gateway is a routing and management layer that sits between your application and multiple LLM providers. You need one when running multiple models to get: cost optimization via model routing, unified rate limiting, failover between providers, centralized logging of all LLM calls, and a single place to manage API keys and cost budgets.

How do you implement intelligent LLM routing in production?

Production LLM routing strategies: rule-based (route by task type — cheap model for classification, expensive for generation), cost-threshold routing (use cheaper model first, upgrade on low confidence), latency-based routing (route to fastest available endpoint), and semantic routing (embed the query, match to a model trained on similar task distributions).

What are the common failure modes of LLM gateway setups?

LLM gateway failure modes: (1) routing logic drift as model capabilities change between versions, (2) fallback loops when primary and secondary models both fail simultaneously, (3) semantic cache poisoning returning stale responses for slightly different queries, (4) cost budgets set per-model rather than per-workflow, making optimization invisible.

How does a multi-model LLM gateway reduce AI infrastructure costs?

A well-configured LLM gateway typically reduces AI API costs by 40–70% by routing low-complexity tasks to cheaper models (GPT-4o-mini, Claude Haiku, Gemini Flash) and reserving frontier models for tasks requiring deep reasoning. The savings compound when you add semantic caching for repeated query patterns.

When should you build a custom LLM gateway vs using LiteLLM or Portkey?

Use LiteLLM or Portkey when you need standard routing, logging, and cost controls with minimal setup. Build custom when you need organization-specific routing logic, custom audit requirements, direct integration with your data layer, or when the gateway itself is a product you are selling. Most teams should start with managed options.

Build with us

Need this kind of thinking applied to your product?

We build AI agents, full-stack platforms, and engineering systems. Same depth, applied to your problem.

Start a conversation View services

Newsletter

Enjoyed this? Get the weekly digest.

Research highlights and AI news, delivered every Thursday. No spam.

Loading comments...

Keep Reading

All articles