Every week someone posts a cost comparison on Hacker News showing self-hosted Llama beating Claude or GPT-4o on price per token. The math always looks compelling. The math is always incomplete.
What Does the Internet Say Self-Hosting Costs?
The number everyone quotes is the GPU hourly rate. An NVIDIA H100 on AWS is roughly $3.50 per hour. On Lambda Labs, about $2.50. On RunPod or Vast.ai, sometimes under $2. So you run the napkin math: $2.50/hr times 730 hours in a month equals $1,825 for a dedicated H100. You can serve Llama 3.3 70B at roughly 40-60 tokens per second on that. Sounds like a steal compared to paying Anthropic $15 per million input tokens for Opus.
That number is real. It is also roughly 40-50% of your actual bill.
What Does Self-Hosted Inference Actually Cost Per Month?
Here is a real breakdown from a client project we helped deploy — a 70B parameter model serving a customer-facing product with 99.9% uptime requirements. All figures are monthly USD.
| Line Item | Monthly Cost | Notes |
|---|---|---|
| GPU compute (2x H100, primary) | $5,100 | Lambda Labs reserved, needed for 70B at production throughput |
| GPU compute (failover instance) | $2,550 | 50% hot standby for SLA compliance |
| Inference server (vLLM on Linux) | $340 | c5.4xlarge equivalent for orchestration |
| Load balancer + networking | $180 | Plus egress, which surprises everyone |
| Object storage (model weights, checkpoints) | $85 | S3 or equivalent, ~200GB per model version |
| Monitoring and observability | $420 | Datadog APM + custom GPU metrics, or self-hosted Prometheus stack |
| MLOps engineer time (fractional) | $3,200 | ~20% of a senior ML engineer salary, allocated to this service |
| Model evaluation pipeline | $600 | Nightly eval runs, benchmark regression tests |
| Security patching and updates | $400 | Engineer time for OS, CUDA, driver, and framework updates |
| Guardrails and content filtering | $250 | Llama Guard or equivalent running on separate GPU slice |
| Logging and audit trail | $150 | Compliance requirement for regulated clients |
| Total | $13,275 | vs $5,100 raw GPU cost alone |
That $5,100 GPU cost became $13,275 once you added everything a production deployment actually needs. The multiplier here is 2.6x. In my experience across four projects, it ranges from 2x (if you are cutting corners on monitoring and redundancy) to 3.5x (if you are in a regulated industry with audit requirements).
How Does This Compare to API Providers?
Let us do the comparison that actually matters — not price per token in isolation, but total cost for the same workload.
Assume a workload of 5 million input tokens and 1 million output tokens per day. That is a mid-scale production application — a customer support agent, a document processing pipeline, or a code review tool handling a few hundred requests per hour.
| Provider / Model | Daily Cost | Monthly Cost | Notes |
|---|---|---|---|
| Self-hosted Llama 3.3 70B (full TCO) | ~$443 | ~$13,275 | Includes all line items above |
| Claude 3.5 Sonnet API | ~$45 | ~$1,350 | $3/M input + $15/M output |
| GPT-4o API | ~$22.50 | ~$675 | $2.50/M input + $10/M output |
| Claude Opus API | ~$150 | ~$4,500 | $15/M input + $75/M output |
| Gemini 2.5 Flash API | ~$2.25 | ~$68 | $0.15/M input + $0.60/M output (under 200K context) |
| Together.ai Llama 3.3 70B | ~$13.50 | ~$405 | $0.90/M input + $0.90/M output (hosted open-source) |
| Groq Llama 3.3 70B | ~$4.15 | ~$125 | $0.59/M input + $0.79/M output |
Read that again. Self-hosting Llama 3.3 70B at full TCO costs more per month than using Claude Opus via API for the same workload. And hosted open-source providers like Together.ai or Groq give you the same model at a fraction of the self-hosting cost because they amortize infrastructure across thousands of customers.
“The self-hosting cost advantage only materialises at scale most companies never reach. Below 50M tokens per day, you are almost certainly better off with an API.”
Where Is the Actual Crossover Point?
Self-hosting starts winning when three conditions are true simultaneously:
- Volume exceeds 50-100M tokens per day consistently — not peak, sustained
- You have existing ML infrastructure and an MLOps team already on payroll
- You have regulatory or data residency requirements that prohibit third-party API calls
- Your latency requirements demand co-located inference (sub-100ms P99)
For a company already running Kubernetes clusters with GPU nodes for training workloads, the marginal cost of adding inference is much lower — the MLOps team, monitoring stack, and security practices already exist. The 2-3x multiplier drops closer to 1.3-1.5x because you are sharing fixed costs.
But if you are spinning up GPU infrastructure specifically for inference, you are paying the full multiplier. And most startups and mid-size companies are in this category.
What Are the Hidden Costs Nobody Mentions?
Beyond the line items in the table, there are costs that do not show up in any spreadsheet:
Model update lag
When Meta releases Llama 4 (or a critical patch to Llama 3.3), you are responsible for testing, evaluating, and deploying the update. API providers do this for you. In my experience, a major model version upgrade takes 2-4 weeks of engineering time including evaluation, prompt regression testing, and gradual rollout. That is $8,000-$16,000 in engineer time that happens 2-3 times per year.
GPU availability risk
H100 spot instances on AWS have been volatile since 2024. Reserved instances require 1-3 year commitments. If your provider has an outage or price increase, migrating a GPU workload is not a one-day task. You are locked in more than you think.
The talent cost
Finding engineers who can debug CUDA out-of-memory errors, optimize KV-cache configurations, and tune vLLM batching parameters is expensive. According to Levels.fyi data from Q1 2026, ML infrastructure engineers command $180-280K total compensation in the US. Even at Indian market rates, a senior ML engineer with production inference experience runs INR 30-50 LPA. That is not a line item in your GPU cost comparison.
Opportunity cost
Every hour your ML engineer spends optimizing token throughput on a self-hosted model is an hour they are not building features that differentiate your product. For most companies, inference is a commodity. Treat it like one.
How Should You Actually Think About This Decision?
Here is the framework I use with clients:
- Step 1: Calculate your actual monthly token volume (not projected, actual)
- Step 2: Price that volume on 3 API providers including a hosted open-source option like Together.ai or Fireworks
- Step 3: Multiply the raw GPU cost by 2.5x for true TCO — if you do not have existing ML infra, use 3x
- Step 4: Only proceed with self-hosting if the API cost is 2x or more than your true TCO — you need that margin for the hidden costs
- Step 5: If data residency is the driver, look at dedicated API deployments (Azure OpenAI, AWS Bedrock, GCP Vertex) before self-hosting
The uncomfortable truth is that self-hosting is a capability decision, not a cost decision, for 90% of companies. You self-host because you need to fine-tune, because you need air-gapped deployment, because you need to control the full stack for compliance. If your only motivation is saving money, you will almost certainly spend more.
What About Running Models Locally for Development?
This is a different question entirely and the answer is straightforward: yes, do it. Running Gemma 4 or Llama 3.3 8B locally via Ollama or LM Studio costs you nothing beyond the hardware you already own. An M3 Pro MacBook can run a 7-8B model at 20+ tokens per second. That is perfectly adequate for development, testing, and prototyping.
The mistake is extrapolating local development costs to production deployment costs. Your MacBook does not need redundancy, monitoring, SLAs, or a pager rotation. Production does.
Is Self-Hosting AI Models Worth It in 2026?
For most teams: no. The API ecosystem is mature, prices have dropped 10-20x since 2023, and hosted open-source options like Together.ai, Groq, and Fireworks give you the same open-weight models without the infrastructure burden. The crossover math favours APIs more than it did a year ago, not less, because API prices are falling faster than GPU prices.
For teams with genuine scale (50M+ tokens/day), existing ML infrastructure, or hard regulatory constraints: yes, but go in with eyes open. Budget 2.5-3x the raw GPU cost, hire or allocate dedicated MLOps capacity, and treat inference infrastructure as a product, not a side project.
The worst outcome is the middle ground — self-hosting at moderate scale without dedicated infrastructure investment. That is where you get the worst of both worlds: higher costs than APIs and lower reliability than managed services.
“Self-hosting AI inference is like running your own email server. You can do it. You can even do it well. But for most organisations, the question is not whether you can — it is whether you should.”





