Prompt Caching Strategies That Cut Our AI Costs by 60%

We started tracking prompt redundancy across our production AI systems six months ago. The numbers were embarrassing. Across twelve production features, an average of 73% of input tokens were identical between requests. A support ticket classifier sends the same system prompt and few-shot examples on every call. A document extraction pipeline sends the same schema thousands of times per day. We were paying frontier model prices to re-read the same instructions over and over.

The first layer is provider-level prompt caching. Anthropic and OpenAI both offer native prompt caching for their APIs. When you send a request with a long system prompt, the provider caches the processed prefix. Subsequent requests sharing that prefix get a 90% discount on cached input tokens. The implementation is nearly free: restructure prompts so static content comes first, dynamic content last. On a client processing 50,000 classification requests per day, this change alone dropped the monthly API bill from $4,200 to $1,800.

The second layer is application-level semantic caching. We hash the full prompt and check Redis before making an API call. For classification tasks where identical inputs recur, this catches about 35% of requests. We extended this with approximate caching using vector similarity. Compute embeddings for user inputs, check for semantically similar previous requests in Qdrant. If cosine similarity exceeds 0.97 and the task is deterministic, return the cached response. This bumped our overall hit rate to 52%.

The risks are real. Cache poisoning is the biggest: an incorrect response gets served to every similar future request. We mitigate with confidence-based caching (only high-confidence responses get cached), a 24-hour TTL on all entries, and a nightly validation job that re-runs 5% of cached responses against fresh model calls to detect drift.

The latency bonus was unexpected. Redis lookups take 1-2ms. Vector searches take 5-15ms. LLM calls take 300-2000ms. For cached requests, response time drops by two orders of magnitude. On user-facing features where AI classification determines routing, this turned a noticeable delay into an imperceptible one.

Our caching infrastructure costs about $40 per month total. It saves roughly $3,000 per month across all clients. Before optimizing prompts for fewer tokens, optimize your architecture for fewer calls. The cheapest LLM call is the one you never make.

Related Articles

How We Evaluate Whether an AI Feature Is Worth Building

Multimodal AI Beyond Chatbots: Five Production Use Cases That Print Money

Gemini Flash Lite: The Underrated LLM That Powers Half Our Projects

Want to discuss this further?

Ready to build
something real?