Skip to main content
Research
Engineering9 min read

The Time a Silent Deployment Broke Production for 11 Hours and Nobody Noticed

A green CI pipeline, a passing test suite, and a config change that silently routed production writes to the wrong database partition for 11 hours. This is the story of the most dangerous kind of failure — the one that looks like success.

AuthorAbhishek Sharma· Head of Engg @ Fordel Studios
The Time a Silent Deployment Broke Production for 11 Hours and Nobody Noticed

This happened eighteen months ago. I still think about it weekly.

How does a deployment break production without a single alert firing?

We were seven months into a backend rebuild for a Series B fintech client. Payment processing, ledger management, reconciliation — the whole stack. The system handled roughly 40,000 transactions per day across three database partitions, sharded by merchant region.

The architecture was solid. We had a Go service layer talking to PostgreSQL with partition routing based on an environment variable called DB_PARTITION_MAP. This variable held a JSON blob mapping merchant region codes to database connection strings. Simple, readable, battle-tested over five months of production use.

On a Thursday afternoon, one of our engineers pushed a deployment that updated the rate limiting configuration for a specific API endpoint. The change was three lines. The PR had two approvals. CI was green. Integration tests passed. The deployment rolled out via our blue-green pipeline, health checks came back healthy, and traffic shifted over in under ninety seconds.

Everything looked perfect. And everything was broken.

···

What actually went wrong with the configuration?

Here is the part that still makes me uncomfortable. The deployment image was built from a branch that had been rebased earlier that day. During the rebase, a conflict in the Docker environment file was resolved by accepting the incoming changes — which included a DB_PARTITION_MAP value from a feature branch where someone had been testing a fourth partition for an upcoming geographic expansion.

The test partition connection string pointed to a database that existed. It accepted connections. It had the correct schema. It just was not production. It was a staging replica that had been set up weeks earlier for the expansion testing and never torn down.

11h 23mTime to detectionFrom deployment to first human noticing something wrong
~18,400Transactions affectedWrites that landed in the staging partition instead of production
0Alerts triggeredEvery health check and monitor reported green

The partition routing logic worked exactly as designed. It received a region code, looked it up in the map, found a valid connection string, connected, and wrote. The staging database had the same schema because it was a recent clone. The writes succeeded. The responses came back with 200 status codes. From the application’s perspective, nothing was wrong.

The most dangerous production failures are the ones that look like everything is working.
Internal post-mortem document

Why did none of our monitoring catch it?

This is the question that kept me up at night, and the answer is embarrassing in its simplicity: we monitored for failure, not for correctness.

Our monitoring stack — Prometheus, Grafana, PagerDuty — was comprehensive by any reasonable standard. We tracked error rates, latency percentiles, connection pool utilization, query duration, CPU, memory, disk I/O. Every metric you would find in a production monitoring checklist. All of them were green because the system was performing well. It was just performing well against the wrong database.

We had no check that verified the actual destination of writes. No assertion that said: the transaction I just wrote should be queryable from the production read path. No end-to-end verification that a write entering the system could be read back through the normal application flow.

The read path was fine, by the way. Reads still went to the correct production partitions. So the application worked for every existing merchant. Dashboards showed historical data correctly. It was only new transactions from merchants in the affected region — roughly 30% of daily volume — that vanished into the staging database.

···

How did we finally discover the problem?

A merchant support call. At 3:47 AM IST on Friday morning, a merchant in the affected region contacted our client’s support team because their end-of-day reconciliation was showing zero transactions for the past several hours despite their point-of-sale system confirming successful payments.

The support engineer checked the admin dashboard. No transactions. They assumed it was a UI bug and escalated to our on-call engineer, who checked the database directly. No transactions in the production partition for that region since 4:24 PM Thursday — the exact time the deployment completed.

It took our on-call another forty minutes to connect the dots between the deployment and the missing data. The breakthrough came when they ran a query against every known database in our infrastructure and found 18,400 fresh transaction records sitting in what was supposed to be an empty staging database.

What did the recovery look like?

Recovery was a controlled panic. We rolled back the deployment immediately, which restored the correct partition map. Then we had to migrate 18,400 transaction records from the staging database to the correct production partitions. This sounds simple. It was not.

The staging database had accumulated records with auto-incremented IDs that conflicted with the production partition’s ID sequences. The records had timestamps and audit trails referencing the staging database host. Several hundred transactions had triggered downstream webhook notifications to merchants with staging-specific metadata embedded in the payloads.

We wrote a migration script that ran for six hours. It re-sequenced IDs, corrected metadata, replayed the affected webhook notifications with corrected payloads, and generated reconciliation reports for every impacted merchant. Our client’s support team spent the entire weekend handling merchant inquiries.

6hMigration durationCustom script to move and correct 18,400 records
3 daysFull resolutionIncluding merchant communication and reconciliation
$0Financial lossNo transactions were lost, but trust damage was real

No money was actually lost — every transaction was recoverable. But the trust damage with the client was significant. We had positioned ourselves as the team that takes production seriously, and we had just demonstrated a fundamental gap in our deployment verification.

···

What changes did we make after the incident?

We implemented five changes, in order of how much they mattered:

Post-Incident Changes (Ranked by Impact)
  • Write-path verification: A synthetic transaction is written and read back through the full application path every 60 seconds. If the read fails or returns unexpected data, PagerDuty fires immediately.
  • Config fingerprinting: Every deployment now computes a SHA-256 hash of all environment variables and compares it against the expected hash stored in our deployment manifest. A mismatch blocks the deployment.
  • Partition connection labeling: Each database connection is tagged with a purpose label (production-region-1, staging-expansion-test, etc). The application refuses to write to any connection whose label does not match the expected pattern for the environment.
  • Staging infrastructure TTLs: All non-production databases are created with a mandatory TTL. When the TTL expires, the database is locked (not deleted) and an alert is sent. No more zombie staging databases sitting around for weeks.
  • Conflict resolution policy: Merge conflicts in environment and configuration files now require a dedicated review from someone not involved in the original change. We treat config conflicts as potential data-routing incidents.

The first change — write-path verification — is the one I wish we had built from day one. It is conceptually simple: write a canary record, read it back, verify it came from the right place. We could have built it in an afternoon. We just never thought to because our monitoring was already good. Good for catching failures, useless for catching misdirection.

Monitoring for failure is table stakes. Monitoring for correctness is what separates teams that sleep well from teams that get lucky.
Abhishek Sharma

Is it worth building deployment verification beyond CI checks?

Absolutely, and I am not saying that just because I lived through this. The pattern we missed is common enough to have a name in reliability engineering: silent corruption. Your system does the wrong thing successfully, and every signal tells you it is healthy.

CI and integration tests verify that your code does what it is supposed to do in the test environment. They do not and cannot verify that your production deployment is actually running against the correct infrastructure. That gap — between test environment verification and production environment verification — is where the worst incidents live.

If your application routes data based on configuration, you need a runtime assertion that the configuration is producing the expected behavior in the actual production environment. Not just that the config parsed correctly. Not just that the connections are alive. That the data is going where it should and is retrievable through the path your users will use.

···

What would we do differently if we could start over?

Beyond the specific technical changes, this incident shifted my perspective on what production readiness actually means.

Before this, I would have told you our deployment pipeline was mature. CI with integration tests, blue-green deployments, automated rollbacks, comprehensive monitoring. All the boxes checked. And all of it was useless against a misconfigured environment variable that pointed to a valid but wrong database.

The lesson is not about database partitioning or environment variables. It is about the assumption that a system working without errors is a system working correctly. Those are not the same thing, and the gap between them is where the expensive incidents hide.

If I were starting a new engagement today with the same architecture, I would build the write-path verification canary before I built the monitoring dashboard. I would treat configuration as a data-routing decision with the same rigor we apply to schema migrations. And I would schedule the teardown of every non-production resource on the same day it is created, not when someone remembers to clean it up.

The hardest part of this story is not the technical details. It is admitting that the incident was entirely preventable, that the gap in our monitoring was obvious in hindsight, and that we had simply never stress-tested our assumption that green metrics meant correct behavior.

Eleven hours. Zero alerts. Eighteen thousand transactions in the wrong database. That is what a gap between uptime monitoring and correctness monitoring actually looks like.

Loading comments...