How to Add Streaming AI to Your Next.js App Without a Surprise API Bill

The streaming layer every AI agent UI needs in Next.js: token limits, error boundaries, and cost controls the Vercel AI SDK tutorials skip. Built so the bill does not 10x overnight.

Abhishek Sharma· Founder, Fordel Studios

April 13, 2026Updated May 8, 202612 min read

How to Add Streaming AI to Your Next.js App Without a Surprise API Bill

Every Next.js tutorial shows you how to get a chatbot running in 15 minutes. None of them show you what happens when a user pastes in a 40,000-token document and asks for a summary. Here is how we actually set this up at Fordel.

···

Why does streaming matter for AI features?

When you call an LLM without streaming, the user stares at a blank screen for 3 to 15 seconds while the model generates its full response. That is an eternity in web UX. Streaming sends tokens to the browser as they are generated, so the user sees the first word in under 500 milliseconds.

But streaming is not just about perceived performance. It also changes your infrastructure economics. A non-streaming request ties up a serverless function for the full generation time. On Vercel, that means you are paying for 10 to 30 seconds of function execution per request. With streaming, you are still paying for that time — but the user experience justifies it, and you can add guardrails to cut generation short when it goes off the rails.

3-15sTypical LLM response timeWithout streaming, users see nothing until the full response arrives

···

What do you need before starting?

Prerequisites

A Next.js 14+ project with the App Router (Pages Router works but the patterns differ)
An OpenAI or Anthropic API key (this guide uses OpenAI, but the AI SDK abstracts the provider)
Node.js 18+ (for native fetch and streaming support)
Basic familiarity with React Server Components and Route Handlers

If you do not have a Next.js project yet, npx create-next-app@latest works. We are assuming TypeScript throughout — if you are writing plain JavaScript in 2026, that is a separate conversation.

···

How do you install and configure the AI SDK?

Step 1: Install the packages

You need two packages: ai (the core Vercel AI SDK) and the provider package for your model. For OpenAI:

The AI SDK is provider-agnostic. You write your streaming logic once, and swapping from GPT-4o to Claude to Gemini is a one-line change. This matters more than you think — we have switched providers mid-project twice in the last year when pricing or quality shifted.

Step 2: Set your environment variable

Add your API key to .env.local. Do not commit this file. Do not put it in .env. Do not hardcode it.

···

How do you build the streaming API route?

This is where most tutorials stop at the happy path. We will build the route handler with cost controls baked in from the start.

Step 3: Create the route handler

Create app/api/chat/route.ts. This is your server-side endpoint that talks to the LLM and streams the response back to the browser.

Let us break down what matters here.

maxDuration = 30 tells Vercel to allow this function to run for up to 30 seconds instead of the default 10. LLM responses regularly take 10 to 20 seconds for longer outputs. Without this, your function times out and the user sees a generic error.

maxTokens: 1024 is your first cost guardrail. Without it, the model will generate until it decides it is done — which could be 4,000 tokens for a simple question. At GPT-4o pricing, that is the difference between $0.002 and $0.01 per request. Multiply by 10,000 users and you are looking at real money.

Step 4: Add input validation

The route handler above trusts whatever the client sends. That is how you get a $500 bill from one user pasting in an entire codebase.

The character count check is a rough proxy for tokens. One token averages about 4 characters in English, so 50,000 characters is roughly 12,500 tokens. For GPT-4o-mini at $0.15 per million input tokens, that caps a single request at about $0.002. Adjust the limit based on your use case and budget.

The conversation depth limit prevents a different cost vector: users who keep chatting in the same thread until the context window fills up. Every message in the conversation gets re-sent with each request, so cost grows quadratically with conversation length.

···

How do you build the frontend that consumes the stream?

Step 5: Create the chat component

The AI SDK ships a useChat hook that handles streaming, message state, loading states, and error handling out of the box.

Key Insight

// app/chat/page.tsx "use client"; import { useChat } from "ai/react"; export default function ChatPage() { const { messages, input, handleInputChange, handleSubmit, isLoading, error } = useChat({ api: "/api/chat", onError: (err) => { console.error("Chat error:", err.message); }, }); return ( <div> <div> {messages.map((m) => ( <div key={m.id}> <strong> {m.role === "user" ? "You" : "AI"}: </strong> <p>{m.content}</p> </div> ))} </div> {error && ( <p style={{ color: "red" }}> {error.message || "Something went wrong"} </p> )} <form onSubmit={handleSubmit}> <input value={input} onChange={handleInputChange} placeholder="Ask something..." disabled={isLoading} /> <button type="submit" disabled={isLoading}> {isLoading ? "Thinking..." : "Send"} </button> </form> </div> ); }

The useChat hook does more than you would expect. It manages the full message array, handles the streaming protocol from toDataStreamResponse, provides loading and error states, and automatically appends new messages. You do not need to write any fetch or EventSource code.

Step 6: Add abort control

Users need a way to stop generation. Maybe the model misunderstood the question and is generating a 1,000-token response they do not need. Every token generated after the user mentally checks out is wasted money.

The stop function aborts the fetch request, which terminates the stream. The server-side function stops execution, and you stop accruing tokens. This is not just a UX nicety — it is a cost control mechanism.

···

How do you add system prompts without exposing them?

Step 7: Define your system prompt server-side

Your system prompt defines the AI's behavior, constraints, and personality. It should never be sent from the client — that is how prompt injection happens.

The system parameter is separate from the messages array. It gets prepended to every request but lives entirely on the server. The client never sees it, cannot override it, and cannot extract it through prompt injection — at least not through your API layer.

Keep your system prompt short. Every token in it gets sent with every request. A 500-token system prompt across 10,000 daily requests is 5 million input tokens per day. At GPT-4o rates, that is $0.75 per day just for the system prompt. At Claude rates, it is more. These numbers add up.

···

How do you monitor costs in production?

Step 8: Log token usage

The AI SDK gives you access to token usage after each request completes. Log it.

Ship this from day one. Not after you get the bill. We learned this the hard way — a client's chatbot feature went live without token logging, and we did not realize until the weekly invoice that one power user was burning through $40 per day in API calls by pasting in full PDF transcripts.

“The API bill is never a surprise if you log every request from day one. It is only a surprise when you do not.”

Fordel internal engineering playbook

Step 9: Set spending alerts

OpenAI and Anthropic both support usage limits and spending alerts in their dashboards. Set them. A hard cap at your monthly budget prevents runaway costs even if your application-level guardrails fail.

Cost control checklist

Set maxTokens on every streamText call — never let the model decide how long to generate
Validate input length server-side before sending to the LLM
Limit conversation depth to prevent quadratic cost growth
Log token usage on every request with estimated cost
Set API provider spending limits as a backstop
Give users a stop button to abort unnecessary generation
Use gpt-4o-mini or claude-3.5-haiku for features that do not need the flagship model

···

What are the most common mistakes?

After building AI features into six production apps this year, here are the patterns we see go wrong repeatedly.

Mistake 1: Using the flagship model for everything

GPT-4o costs 10 times more than GPT-4o-mini per token. Claude Opus costs 15 times more than Haiku. Most conversational features — summarization, Q&A, simple chat — work just as well on the smaller model. Start with mini or haiku and only upgrade if quality is genuinely insufficient. The AI SDK makes switching a one-line change.

Mistake 2: Not setting maxTokens

Without maxTokens, the model generates until it hits its internal limit or decides the response is complete. For a simple yes-or-no question, that might be 50 tokens. For a vague open-ended question, it could be 4,000. Your cost per request becomes unpredictable, which makes budgeting impossible.

Mistake 3: Sending the full conversation history forever

Every message in the messages array gets sent to the API on every request. A 50-message conversation means the model processes all 50 messages — plus its response — every time the user sends a new message. Either limit conversation length, implement a sliding window, or summarize older messages.

Mistake 4: No rate limiting

A single user can fire 100 requests per minute from the browser if you do not stop them. Add rate limiting at the API route level. Vercel KV or Upstash Redis work well for this. Even a simple in-memory counter per IP address is better than nothing.

Mistake 5: Exposing the API key client-side

This sounds obvious, but we have seen it in production code reviews. If your OPENAI_API_KEY is in an environment variable that starts with NEXT_PUBLIC_, it is in the browser bundle. Anyone can extract it and run up your bill. API keys belong in server-side route handlers only.

···

What do you have now?

If you followed this guide, you have a streaming AI chat feature in Next.js with input validation, token limits, cost logging, abort controls, and a server-side system prompt. That is not a demo — that is the foundation of a production feature.

From here, the next steps depend on your use case. If you need retrieval-augmented generation, the AI SDK supports tool calling and custom data sources. If you need multi-provider failover, we wrote a separate guide on that. If you need to fine-tune cost versus quality, start by analyzing the token usage logs you are now collecting.

The streaming part is the easy part. The cost controls, error handling, and input validation are what separate a demo from software people actually pay for.

Build with us

Need this kind of thinking applied to your product?

We build AI agents, full-stack platforms, and engineering systems. Same depth, applied to your problem.

Start a conversation View services

Newsletter

Enjoyed this? Get the weekly digest.

Research highlights and AI news, delivered every Thursday. No spam.

Loading comments...

Keep Reading

All articles