How to Set Up AI Model Failover Without a $50K Gateway Platform

Skip the $50K gateway. Multi-provider failover for AI agents in under 200 lines of TypeScript: automatic retries, cost-aware routing, and streaming — so a provider outage does not kill your agent.

Abhishek Sharma· Founder, Fordel Studios

April 4, 2026Updated May 8, 202612 min read

How to Set Up AI Model Failover Without a $50K Gateway Platform

Every week someone asks me which AI gateway to buy. Here is the shorter answer.

What happens when your primary AI provider goes down?

If you are building AI features in production, you have already hit this. OpenAI returns 429s during peak hours. Anthropic has capacity limits. Google changes their API surface mid-quarter. The standard advice is to buy a gateway platform — Portkey, Helicone, whatever raised their Series A most recently — and route through that.

That works. It also costs $2K-50K per month depending on volume, adds another point of failure, and means a third party sees every prompt you send. For most teams shipping AI features in 2026, that is overkill.

Here is how we set up multi-provider failover at Fordel without any of that.

···

What you need before starting

A Next.js 14+ project (App Router)
API keys for at least two providers (OpenAI + Anthropic recommended)
The Vercel AI SDK installed: npm install ai @ai-sdk/openai @ai-sdk/anthropic
Basic familiarity with server actions or API routes
About 45 minutes

···

How does the AI SDK handle multiple providers?

The Vercel AI SDK has a concept most tutorials skip over: provider-agnostic model references. Instead of importing openai() or anthropic() directly in your route handler, you define a registry of models and reference them by alias. This means swapping providers is a config change, not a code change.

Here is the foundation:

Create a file at lib/ai/providers.ts:

import { experimental_createProviderRegistry as createProviderRegistry } from "ai"; import { openai } from "@ai-sdk/openai"; import { anthropic } from "@ai-sdk/anthropic";

export const registry = createProviderRegistry({ openai, anthropic });

Now instead of calling openai("gpt-4o") in your route, you call registry.languageModel("openai:gpt-4o"). The model string is provider:model-name. This seems like a trivial change. It is not. It means every downstream function is provider-agnostic by default.

···

How do you build the failover chain?

The registry gives you provider abstraction. Now you need automatic fallback. Here is the pattern we use — a wrapper function that tries providers in order:

// lib/ai/failover.ts

import { generateText, LanguageModel } from "ai"; import { registry } from "./providers";

type ModelChain = string[]; // e.g. ["anthropic:claude-sonnet-4-5-20250514", "openai:gpt-4o", "openai:gpt-4o-mini"]

export async function generateWithFailover(chain: ModelChain, options: { prompt: string; system?: string; maxTokens?: number; }) { for (const modelId of chain) { try { const model = registry.languageModel(modelId); const result = await generateText({ model, ...options }); return { ...result, modelUsed: modelId, failedProviders: chain.slice(0, chain.indexOf(modelId)) }; } catch (error: any) { if (isRetryable(error)) continue; throw error; } } throw new Error("All providers in failover chain exhausted"); }

function isRetryable(error: any): boolean { const status = error?.status || error?.statusCode; if (status === 429 || status === 503 || status === 529) return true; if (error?.code === "ECONNREFUSED" || error?.code === "ETIMEDOUT") return true; return false; }

The key decisions in this code:

We only retry on rate limits (429), service unavailable (503), overloaded (529), and network errors. A 400 means your prompt is malformed — retrying with a different provider will not fix that. A 401 means your key is wrong. These should throw immediately.

We track which providers failed. This matters for logging. If your primary provider is returning 429s three times an hour, you want to know that before your bill arrives.

···

How do you add cost-aware routing?

Failover solves availability. But you also want cost control. Not every request needs Claude Opus. Most need Sonnet or GPT-4o-mini.

$15 vs $0.15Per million tokensThe gap between frontier and mid-tier models on the same task

We classify requests into tiers and assign different chains:

// lib/ai/routing.ts

const CHAINS = { critical: ["anthropic:claude-sonnet-4-5-20250514", "openai:gpt-4o", "anthropic:claude-haiku-4-5-20251001"], standard: ["openai:gpt-4o-mini", "anthropic:claude-haiku-4-5-20251001"], bulk: ["anthropic:claude-haiku-4-5-20251001", "openai:gpt-4o-mini"], } as const;

export function getChain(tier: keyof typeof CHAINS): string[] { return [...CHAINS[tier]]; }

In your API route or server action, the caller decides the tier:

const result = await generateWithFailover(getChain("standard"), { prompt: userMessage, system: "You are a helpful assistant.", maxTokens: 1024 });

This is deliberately simple. You do not need a machine learning model to route to your machine learning models. A three-tier classification covers 95% of use cases.

···

How do you handle streaming with failover?

Streaming adds complexity because you cannot retry mid-stream. If the provider dies after sending 200 tokens, those tokens are gone. Here is how we handle it:

import { streamText } from "ai";

export async function streamWithFailover(chain: string[], options: Parameters<typeof streamText>[0] & { prompt: string }) { for (const modelId of chain) { try { const model = registry.languageModel(modelId); const result = streamText({ model, ...options }); // Test that the stream actually starts by reading the first chunk const reader = result.textStream.getReader(); const first = await reader.read(); if (!first.done) { // Stream is alive — return a new ReadableStream that yields the first chunk then continues return { stream: new ReadableStream({ async start(controller) { controller.enqueue(first.value); // pump remaining while (true) { const { done, value } = await reader.read(); if (done) break; controller.enqueue(value); } controller.close(); } }), modelUsed: modelId }; } } catch { continue; } } throw new Error("All providers exhausted"); }

The trick is reading the first chunk before committing. If the provider accepts the connection but fails immediately, you catch it before streaming garbage to the client.

···

How do you monitor which provider is actually serving traffic?

Without observability, failover is a black box. You need to know: which provider handled each request, what the latency was, and how often failovers actually triggered.

We log three things on every AI call:

1. The model that actually served the response (not the one requested). 2. The latency in milliseconds. 3. An array of providers that were tried and failed before the successful one.

// In your API route: const start = Date.now(); const result = await generateWithFailover(getChain("standard"), { prompt }); console.log(JSON.stringify({ event: "ai_completion", model: result.modelUsed, latency_ms: Date.now() - start, failed_providers: result.failedProviders, tokens: result.usage, }));

Ship this to your existing logging pipeline — Datadog, Axiom, even stdout if you are on Vercel (it goes to Log Drains). Build a dashboard that shows failover rate by hour. If it spikes above 5%, your primary provider has a problem and you should consider swapping the chain order.

“The goal is not zero failovers. The goal is that failovers are invisible to your users.”

Abhishek Sharma

···

What are the common mistakes?

After setting this up for multiple clients, here is what goes wrong:

Retrying on all errors. A 400 from OpenAI means your prompt exceeded context length or had invalid parameters. Sending that same prompt to Anthropic will also fail. Only retry on infrastructure errors (429, 503, network).

Not normalizing prompts across providers. If your system prompt uses OpenAI-specific XML tags or Anthropic-specific formatting, the failover will technically work but produce worse results on the backup provider. Keep prompts provider-agnostic.

Forgetting about token limits. Claude Sonnet supports 200K context. GPT-4o supports 128K. If your failover chain goes Claude then GPT-4o, a 150K token prompt will fail on GPT-4o — not with a retryable error but with a 400. Either truncate before sending or make sure your chain respects context windows.

Adding too many providers. Three is plenty. Each additional provider is another API key to manage, another billing relationship, another set of rate limits to understand. We have never needed more than three in a chain.

Not setting timeouts. A provider that hangs for 30 seconds before returning a 503 is worse than one that fails fast. Set a 10-second timeout on each attempt. The AI SDK supports this via the abortSignal option.

···

Is this actually worth it versus a managed gateway?

It depends on your scale and team.

< 100KRequests per monthBuild it yourself with this pattern. Total cost: your existing API keys.

100K - 1MRequests per monthThis pattern still works. Add Redis for rate limit tracking across instances.

> 1MRequests per monthConsider a gateway — but evaluate open-source options like LiteLLM self-hosted first.

The managed gateway sells you analytics, a dashboard, prompt caching, and semantic routing. Those are real features. But at early and mid scale, you are paying $2K/month for a dashboard you check once a week. The failover itself — the part that actually keeps your product alive when a provider goes down — is 150 lines of TypeScript.

···

What do you have now?

If you followed along, you have:

What you built

Provider-agnostic model references via the AI SDK registry
Automatic failover with smart retry logic (only on infrastructure errors)
Cost-aware routing with three tiers (critical, standard, bulk)
Streaming failover that validates the connection before committing
Structured logging for monitoring failover rates and provider health

The total code is under 200 lines. It has no external dependencies beyond the AI SDK. It runs on any Node.js host — Vercel, Railway, a VPS, whatever. And when OpenAI has their next rate limit incident at 2 PM on a Tuesday, your users will not notice.

We have been running this pattern in production for four months across three client projects. Zero user-facing outages from provider issues. The longest failover delay was 1.2 seconds. Most are under 400ms.

Build with us

Need this kind of thinking applied to your product?

We build AI agents, full-stack platforms, and engineering systems. Same depth, applied to your problem.

Start a conversation View services

Newsletter

Enjoyed this? Get the weekly digest.

Research highlights and AI news, delivered every Thursday. No spam.

Loading comments...

Keep Reading

All articles