The conversation about AI costs typically starts and ends with model API pricing. GPT-4o costs X per million tokens. Claude costs Y. Gemini costs Z. This comparison captures roughly 40% of the actual cost of running AI in production. The other 60% is infrastructure, engineering, evaluation, monitoring, and maintenance — costs that compound over time and are consistently underestimated in planning.
The Full Cost Stack
| Cost Category | Percentage of Total | Often Underestimated? | Scales With |
|---|---|---|---|
| Model inference (API or self-hosted) | 25-40% | No | Request volume |
| Infrastructure (hosting, databases, vector stores) | 15-25% | Yes | Data volume + request volume |
| Engineering time (development + maintenance) | 20-30% | Yes | System complexity |
| Evaluation and testing | 5-10% | Yes, often zero-budgeted | Model change frequency |
| Monitoring and observability | 3-5% | Yes | System complexity |
| Data pipeline (ingestion, embedding, indexing) | 5-10% | Yes | Corpus size + update frequency |
The Hidden Costs
Model API Cost Volatility
Model API pricing changes frequently, sometimes dramatically. When your provider releases a new model version and deprecates the old one, you face a choice: migrate (engineering cost) or accept the new pricing (sometimes higher, sometimes lower). Budget for at least one forced model migration per year per AI feature.
The Evaluation Tax
Every model change, prompt change, or RAG corpus update requires running your evaluation suite. If your evaluation suite uses LLM-as-judge (the most practical approach for quality evaluation at scale), the evaluation itself has inference costs. Organizations running frequent evaluations report that evaluation costs reach 10-15% of production inference costs.
Cost Optimization Strategies
- Model routing: use small models for simple tasks, large models for complex ones (40-60% savings)
- Semantic caching: cache responses for similar queries (20-40% savings depending on query diversity)
- Prompt optimization: shorter prompts with same quality reduce token costs proportionally
- Batch processing: aggregate non-real-time requests for volume discounts and efficiency
- Self-hosting for high-volume workloads: when volume exceeds the API cost crossover point, self-hosting saves 50-70%
- Output token limits: constrain response length to what is actually needed
Self-Hosted vs API: The Crossover Point
At low volumes, API-based inference is cheaper because you pay only for what you use with no infrastructure overhead. As volume increases, the per-request cost of API inference becomes the dominant expense, and self-hosting on dedicated GPU instances becomes more economical. The crossover point varies by model size and hardware, but for a 70B parameter model, the breakeven is typically around 500,000-1,000,000 requests per month.
Self-hosting introduces significant operational complexity: GPU provisioning, model serving infrastructure (vLLM, TGI, or similar), scaling, monitoring, and model updates. The cost savings are real but require a team with ML infrastructure expertise to realize them. Many organizations find that a hybrid approach works best: self-host high-volume, predictable workloads and use APIs for variable, low-volume, or frontier model workloads.
Building an AI Cost Model
Track token usage, inference latency, cache hit rate, and cost per request for every AI feature. You cannot optimize what you do not measure.
Different features have different cost profiles. A chatbot, a document processor, and a recommendation engine have different volume patterns, model requirements, and cost-per-value ratios.
Cost per query is meaningless without business context. Cost per support ticket resolved, cost per document processed, cost per lead qualified — these metrics connect AI cost to business value.
Project costs at 2x, 5x, and 10x current volume. If your cost scales linearly with volume, you need optimization before growth makes the system uneconomical.
“The organizations that control AI costs are not the ones that found the cheapest model. They are the ones that built cost visibility into every layer of their AI stack and optimize continuously.”