We started tracking prompt redundancy across our production AI systems six months ago. The numbers were embarrassing. Across twelve production features, an average of 73% of input tokens were identical between requests. A support ticket classifier sends the same system prompt and few-shot examples on every call. A document extraction pipeline sends the same schema thousands of times per day. We were paying frontier model prices to re-read the same instructions over and over.
The first layer is provider-level prompt caching. Anthropic and OpenAI both offer native prompt caching for their APIs. When you send a request with a long system prompt, the provider caches the processed prefix. Subsequent requests sharing that prefix get a 90% discount on cached input tokens. The implementation is nearly free: restructure prompts so static content comes first, dynamic content last. On a client processing 50,000 classification requests per day, this change alone dropped the monthly API bill from $4,200 to $1,800.
The second layer is application-level semantic caching. We hash the full prompt and check Redis before making an API call. For classification tasks where identical inputs recur, this catches about 35% of requests. We extended this with approximate caching using vector similarity. Compute embeddings for user inputs, check for semantically similar previous requests in Qdrant. If cosine similarity exceeds 0.97 and the task is deterministic, return the cached response. This bumped our overall hit rate to 52%.
The risks are real. Cache poisoning is the biggest: an incorrect response gets served to every similar future request. We mitigate with confidence-based caching (only high-confidence responses get cached), a 24-hour TTL on all entries, and a nightly validation job that re-runs 5% of cached responses against fresh model calls to detect drift.
The latency bonus was unexpected. Redis lookups take 1-2ms. Vector searches take 5-15ms. LLM calls take 300-2000ms. For cached requests, response time drops by two orders of magnitude. On user-facing features where AI classification determines routing, this turned a noticeable delay into an imperceptible one.
Our caching infrastructure costs about $40 per month total. It saves roughly $3,000 per month across all clients. Before optimizing prompts for fewer tokens, optimize your architecture for fewer calls. The cheapest LLM call is the one you never make.
About the Author
Fordel Studios
AI-native app development for startups and growing teams. 14+ years of experience shipping production software.
Not every feature needs AI. We developed a framework for evaluating whether an AI-powered approach delivers enough value over traditional logic to justify the complexity and cost.
The industry is fixated on chatbots. Meanwhile, the highest-ROI AI features we have shipped are multimodal applications that combine vision, text, and structured data extraction.

While everyone debates GPT-4o vs Claude, we quietly moved most of our production workloads to Gemini Flash Lite. The performance-to-cost ratio is unmatched for structured tasks.
We love talking shop. If this article resonated, let's connect.
Start a ConversationTell us about your project. We'll give you honest feedback on scope, timeline, and whether we're the right fit.
Start a Conversation