Skip to main content
Research
Engineering8 min read

What Self-Hosting AI Models Actually Costs (Nobody Tells You This)

Everyone quotes the GPU hourly rate. Nobody mentions the 14 other line items that double your actual cost. Here is the real TCO breakdown from teams running open-weight models in production.

AuthorAbhishek Sharma· Head of Engg @ Fordel Studios
What Self-Hosting AI Models Actually Costs (Nobody Tells You This)

Every week someone posts a cost comparison on Hacker News showing self-hosted Llama beating Claude or GPT-4o on price per token. The math always looks compelling. The math is always incomplete.

What Does the Internet Say Self-Hosting Costs?

The number everyone quotes is the GPU hourly rate. An NVIDIA H100 on AWS is roughly $3.50 per hour. On Lambda Labs, about $2.50. On RunPod or Vast.ai, sometimes under $2. So you run the napkin math: $2.50/hr times 730 hours in a month equals $1,825 for a dedicated H100. You can serve Llama 3.3 70B at roughly 40-60 tokens per second on that. Sounds like a steal compared to paying Anthropic $15 per million input tokens for Opus.

That number is real. It is also roughly 40-50% of your actual bill.

2-3xTrue cost multiplier vs raw GPU priceBased on production deployments across 4 client projects running self-hosted inference in 2025-2026
···

What Does Self-Hosted Inference Actually Cost Per Month?

Here is a real breakdown from a client project we helped deploy — a 70B parameter model serving a customer-facing product with 99.9% uptime requirements. All figures are monthly USD.

Line ItemMonthly CostNotes
GPU compute (2x H100, primary)$5,100Lambda Labs reserved, needed for 70B at production throughput
GPU compute (failover instance)$2,55050% hot standby for SLA compliance
Inference server (vLLM on Linux)$340c5.4xlarge equivalent for orchestration
Load balancer + networking$180Plus egress, which surprises everyone
Object storage (model weights, checkpoints)$85S3 or equivalent, ~200GB per model version
Monitoring and observability$420Datadog APM + custom GPU metrics, or self-hosted Prometheus stack
MLOps engineer time (fractional)$3,200~20% of a senior ML engineer salary, allocated to this service
Model evaluation pipeline$600Nightly eval runs, benchmark regression tests
Security patching and updates$400Engineer time for OS, CUDA, driver, and framework updates
Guardrails and content filtering$250Llama Guard or equivalent running on separate GPU slice
Logging and audit trail$150Compliance requirement for regulated clients
Total$13,275vs $5,100 raw GPU cost alone

That $5,100 GPU cost became $13,275 once you added everything a production deployment actually needs. The multiplier here is 2.6x. In my experience across four projects, it ranges from 2x (if you are cutting corners on monitoring and redundancy) to 3.5x (if you are in a regulated industry with audit requirements).

···

How Does This Compare to API Providers?

Let us do the comparison that actually matters — not price per token in isolation, but total cost for the same workload.

Assume a workload of 5 million input tokens and 1 million output tokens per day. That is a mid-scale production application — a customer support agent, a document processing pipeline, or a code review tool handling a few hundred requests per hour.

Provider / ModelDaily CostMonthly CostNotes
Self-hosted Llama 3.3 70B (full TCO)~$443~$13,275Includes all line items above
Claude 3.5 Sonnet API~$45~$1,350$3/M input + $15/M output
GPT-4o API~$22.50~$675$2.50/M input + $10/M output
Claude Opus API~$150~$4,500$15/M input + $75/M output
Gemini 2.5 Flash API~$2.25~$68$0.15/M input + $0.60/M output (under 200K context)
Together.ai Llama 3.3 70B~$13.50~$405$0.90/M input + $0.90/M output (hosted open-source)
Groq Llama 3.3 70B~$4.15~$125$0.59/M input + $0.79/M output

Read that again. Self-hosting Llama 3.3 70B at full TCO costs more per month than using Claude Opus via API for the same workload. And hosted open-source providers like Together.ai or Groq give you the same model at a fraction of the self-hosting cost because they amortize infrastructure across thousands of customers.

The self-hosting cost advantage only materialises at scale most companies never reach. Below 50M tokens per day, you are almost certainly better off with an API.
In my experience across 4 production deployments, 2025-2026
···

Where Is the Actual Crossover Point?

Self-hosting starts winning when three conditions are true simultaneously:

When Self-Hosting Makes Economic Sense
  • Volume exceeds 50-100M tokens per day consistently — not peak, sustained
  • You have existing ML infrastructure and an MLOps team already on payroll
  • You have regulatory or data residency requirements that prohibit third-party API calls
  • Your latency requirements demand co-located inference (sub-100ms P99)

For a company already running Kubernetes clusters with GPU nodes for training workloads, the marginal cost of adding inference is much lower — the MLOps team, monitoring stack, and security practices already exist. The 2-3x multiplier drops closer to 1.3-1.5x because you are sharing fixed costs.

But if you are spinning up GPU infrastructure specifically for inference, you are paying the full multiplier. And most startups and mid-size companies are in this category.

···

What Are the Hidden Costs Nobody Mentions?

Beyond the line items in the table, there are costs that do not show up in any spreadsheet:

Model update lag

When Meta releases Llama 4 (or a critical patch to Llama 3.3), you are responsible for testing, evaluating, and deploying the update. API providers do this for you. In my experience, a major model version upgrade takes 2-4 weeks of engineering time including evaluation, prompt regression testing, and gradual rollout. That is $8,000-$16,000 in engineer time that happens 2-3 times per year.

GPU availability risk

H100 spot instances on AWS have been volatile since 2024. Reserved instances require 1-3 year commitments. If your provider has an outage or price increase, migrating a GPU workload is not a one-day task. You are locked in more than you think.

The talent cost

Finding engineers who can debug CUDA out-of-memory errors, optimize KV-cache configurations, and tune vLLM batching parameters is expensive. According to Levels.fyi data from Q1 2026, ML infrastructure engineers command $180-280K total compensation in the US. Even at Indian market rates, a senior ML engineer with production inference experience runs INR 30-50 LPA. That is not a line item in your GPU cost comparison.

Opportunity cost

Every hour your ML engineer spends optimizing token throughput on a self-hosted model is an hour they are not building features that differentiate your product. For most companies, inference is a commodity. Treat it like one.

···

How Should You Actually Think About This Decision?

Here is the framework I use with clients:

The Self-Hosting Decision Framework
  • Step 1: Calculate your actual monthly token volume (not projected, actual)
  • Step 2: Price that volume on 3 API providers including a hosted open-source option like Together.ai or Fireworks
  • Step 3: Multiply the raw GPU cost by 2.5x for true TCO — if you do not have existing ML infra, use 3x
  • Step 4: Only proceed with self-hosting if the API cost is 2x or more than your true TCO — you need that margin for the hidden costs
  • Step 5: If data residency is the driver, look at dedicated API deployments (Azure OpenAI, AWS Bedrock, GCP Vertex) before self-hosting

The uncomfortable truth is that self-hosting is a capability decision, not a cost decision, for 90% of companies. You self-host because you need to fine-tune, because you need air-gapped deployment, because you need to control the full stack for compliance. If your only motivation is saving money, you will almost certainly spend more.

90%of companies would save money using API providers over self-hostingBased on token volumes under 50M/day, which covers most production deployments outside Big Tech
···

What About Running Models Locally for Development?

This is a different question entirely and the answer is straightforward: yes, do it. Running Gemma 4 or Llama 3.3 8B locally via Ollama or LM Studio costs you nothing beyond the hardware you already own. An M3 Pro MacBook can run a 7-8B model at 20+ tokens per second. That is perfectly adequate for development, testing, and prototyping.

The mistake is extrapolating local development costs to production deployment costs. Your MacBook does not need redundancy, monitoring, SLAs, or a pager rotation. Production does.

···

Is Self-Hosting AI Models Worth It in 2026?

For most teams: no. The API ecosystem is mature, prices have dropped 10-20x since 2023, and hosted open-source options like Together.ai, Groq, and Fireworks give you the same open-weight models without the infrastructure burden. The crossover math favours APIs more than it did a year ago, not less, because API prices are falling faster than GPU prices.

For teams with genuine scale (50M+ tokens/day), existing ML infrastructure, or hard regulatory constraints: yes, but go in with eyes open. Budget 2.5-3x the raw GPU cost, hire or allocate dedicated MLOps capacity, and treat inference infrastructure as a product, not a side project.

The worst outcome is the middle ground — self-hosting at moderate scale without dedicated infrastructure investment. That is where you get the worst of both worlds: higher costs than APIs and lower reliability than managed services.

Self-hosting AI inference is like running your own email server. You can do it. You can even do it well. But for most organisations, the question is not whether you can — it is whether you should.
Abhishek Sharma
Loading comments...