Every post-mortem I have read lists the same things: what broke, when, who fixed it, and how to prevent recurrence. None of them include a line item for what the incident actually cost the business. That calculation is usually left to finance, if it happens at all. It should be engineering's job.
What Does Everyone Get Wrong About the Cost of Downtime?
The most-cited figure in incident management is $5,600 per minute of downtime, roughly $336,000 per hour. That number comes from a 2014 Gartner study and has been repeated so often it has become engineering folklore. The problem is it is an average across large enterprises — banks, telcos, healthcare systems — where a single minute of downtime can trigger regulatory penalties on top of operational cost.
For a Series A SaaS with 200 customers and $3M ARR, the math looks completely different. The headline number anchors all the wrong conversations. Engineering teams either dismiss it as irrelevant to their scale, or cite it to justify overbuilding redundancy they cannot maintain. Neither response is grounded in your actual numbers.
The Knight Capital incident in 2012 is the extreme case everyone cites: a bad deployment caused $440M in losses in 45 minutes before anyone could stop it. That failure mode makes engineers reach for the Gartner figure to justify investment. Real production incidents are quieter, more frequent, and more distributed — and the costs aggregate in ways that are designed to be ignored.
How Does the Real Cost Actually Break Down?
Here is what a P1 incident costs for a mid-market SaaS (roughly $3M ARR, 15-25 engineers) versus an enterprise account. These are in-my-experience ranges, not industry surveys, because no one publishes this data at useful granularity.
| Cost Category | What It Covers | Mid-Market SaaS ($3M ARR) | Enterprise ($50M ARR) |
|---|---|---|---|
| Engineering — incident response | War room engineers x avg 4 hrs, fully-loaded at $200/hr | $1,200–$2,400 | $6,000–$12,000 |
| Engineering — post-incident | RCA, post-mortem, action items, follow-up PRs | $800–$1,600 | $4,000–$8,000 |
| Customer support overhead | Inbound tickets, proactive comms, SLA credits | $500–$2,000 | $5,000–$20,000 |
| Direct revenue loss | Downtime x hourly revenue run-rate | $300–$3,000 | $5,000–$50,000 |
| 90-day churn risk | Elevated churn probability x blended ACV | $5,000–$15,000 | $50,000–$200,000 |
| Sprint disruption | Unplanned work killed roadmap velocity for 2–4 days | $3,000–$8,000 | $15,000–$40,000 |
| Blended total per major incident | Sum of the above, excluding reputational lag | $10,800–$32,000 | $85,000–$330,000 |
The numbers that surprise teams most are the last two. Churn risk is real but deferred — you do not see it in the incident ticket, you see it three months later when a customer does not renew and cites reliability concerns in their exit survey. Sprint disruption is the most chronically undercosted: a full P1 on a Wednesday typically means Thursday and Friday are recovery mode, which means the sprint that was supposed to ship the new onboarding flow ships the following week, which means the conversion experiment starts a week late, which compounds into Q3 velocity being down 15% with no single cause anyone can point to.
What Are the Hidden Costs Nobody Accounts For?
The table above is at least discussable. What follows is harder to quantify but measurably real.
The first is on-call engineer recovery time. A 3am P1 does not end when the incident resolves. The engineer who handled it loses four to eight hours of productive capacity the next day. Over a year of frequent incidents, this compounds into burnout, then turnover. Engineering turnover at a mid-market SaaS costs roughly $80K-$150K per senior engineer — recruiting fees, ramp time, and the institutional knowledge that walks out the door. One departure linked to on-call load can cost more than two years of observability tooling.
The second is the institutional knowledge tax. Every major incident creates implicit knowledge — the root cause, the workaround, the three adjacent services that almost broke — that lives in Slack threads and people's memory rather than documentation. Two years later, a new engineer triggers the same class of failure because nobody recorded what the actual resolution required. I have seen this happen at three different clients.
The third is trust erosion with non-technical stakeholders. Each visible outage moves enterprise deals back. One month of visible instability can delay a sales cycle by a quarter. I have personally watched a high-profile incident kill a renewal conversation that had been tracked as closed-won for six weeks. That cost does not appear anywhere in the post-mortem.
Is Investing in Incident Prevention Actually Worth It?
“Prevention is not free, but the math almost always works. A $40K investment in observability that prevents two major incidents per year pays for itself in the first month.”
The ROI calculation on reliability investment is usually framed backwards. Teams ask what better observability costs when the real question is what the absence of observability costs per year. For a team running six or more P1s annually — which is common for growth-stage startups — a blended cost of $20K per incident means burning $120K a year on incidents. A mature observability stack — OpenTelemetry instrumentation, Grafana dashboards, context-rich alerting, runbooks tied to alert IDs — runs $15K-$30K per year all-in for a 20-engineer team on cloud-hosted tooling.
The math is not complicated. Teams do not do it because incident costs are spread across four budgets: engineering time hits headcount, support overhead hits CS, churn risk sits in revenue forecasting, and sprint disruption disappears into delivery variance. Nobody owns the total number, so nobody defends the investment.
- Pull your MTTR for the last 12 months — your incident management tool (PagerDuty, Incident.io, OpsGenie) has this data
- Count engineers pulled into each incident x fully-loaded hourly cost ($150-$250/hr for senior engineers including benefits and overhead)
- Pull support ticket volume during incident windows x average handle time x support cost per hour
- Estimate churn risk: compare 90-day retention for customers who experienced downtime against your baseline — incidents typically correlate with 2-3x elevated churn probability for the affected cohort
- Count days of sprint disruption per incident x daily team cost (annual engineering headcount / 250 working days)
- Sum the five numbers. That is your annual incident cost. Now price observability tooling against it.
The Uncomfortable Calculation
The $5,600/minute figure is not wrong. It is just not yours. Most teams that run this exercise land somewhere between $15K and $35K per major incident — and that number changes the conversation. Suddenly the $20K/year for a proper on-call platform, structured runbooks, and distributed tracing is not a nice-to-have engineering infrastructure request. It is basic cost of goods.
I have run this calculation with clients who were previously arguing against observability spend because the tooling felt expensive. Every single time, the annual incident cost came back higher than the tooling cost. Not by a little — typically by a factor of four to eight. The tools were not the expensive part. The incidents were.
The infrastructure investment that feels expensive will look different when you have a denominator. Start with what you can measure. The hidden costs will follow from there.



