Skip to main content
Research
Engineering9 min read

We Ran a Zero-Downtime Migration and Lost Four Days of Transactions

A fintech client needed a schema migration with zero downtime. We planned for weeks, tested in staging, and rolled out with confidence. Then the dual-write layer silently dropped four days of payment transactions. This is what actually happened.

AuthorAbhishek Sharma· Head of Engg @ Fordel Studios
We Ran a Zero-Downtime Migration and Lost Four Days of Transactions

This is the story of the most expensive mistake I have made in fourteen years of engineering. Not expensive in dollars — we recovered the data. Expensive in sleep, trust, and the particular kind of shame that comes from doing everything by the book and still getting it catastrophically wrong.

How did a well-planned migration go sideways?

In late 2024, we were working with a Series B fintech client — a payments platform processing about 40,000 transactions a day across three currencies. Their core payments table had grown into a 200-column monster over four years of feature accretion. Every new feature got bolted onto the same table. Payment metadata, fraud signals, compliance flags, reconciliation data — all of it living in one place.

The table had north of 800 million rows. Queries that used to run in 20ms were now taking 900ms. Their nightly reconciliation job, which needed to scan the previous day’s transactions, had grown from 12 minutes to nearly 3 hours. The on-call team was getting paged because the reconciliation was colliding with morning traffic in the APAC region.

The fix was structurally obvious: decompose the table. Pull payment metadata into its own table. Move fraud signals to a separate store. Keep the core transaction table lean — maybe 30 columns of essential payment data. Standard normalize-and-partition work.

The constraint was equally obvious: zero downtime. This was a payments platform. You cannot tell merchants that card payments will fail for six hours on Saturday night while you restructure your database. That is not a thing that happens in fintech.

What was the migration strategy?

We spent three weeks planning. The strategy was a textbook expand-contract migration with a dual-write layer. I had run this pattern successfully four times before at different scales. The plan looked like this:

The Migration Plan
  • Phase 1: Create new tables alongside old ones. Deploy application-level dual-write — every transaction writes to both old and new schema.
  • Phase 2: Backfill historical data from old table to new tables using batched background jobs.
  • Phase 3: Switch reads to new tables. Verify data consistency for 48 hours.
  • Phase 4: Drop writes to old table. Contract phase complete.

We tested it in staging with synthetic load. We ran it against a production-sized dataset on a replica. We had monitoring dashboards for write latency, row counts, and consistency checks. We had runbooks for twelve different failure scenarios.

I was confident. That should have been my first warning sign.

What actually went wrong?

Phase 1 went perfectly. We deployed the dual-write layer on a Tuesday afternoon. Both old and new tables were receiving writes. Latency was up by about 15ms per transaction — within our acceptable window. We high-fived in the Slack channel.

Phase 2 — the backfill — started that night. We were processing 50,000 rows per batch, with a 200ms pause between batches to avoid hammering the primary. The backfill would take about four days to complete.

Here is what we did not account for: the backfill job and the dual-write layer both used the same transaction ID as a deduplication key. The backfill would read a row from the old table, transform it, and insert it into the new table — but only if that transaction ID did not already exist in the new table. This was supposed to prevent duplicates when the backfill caught up to the dual-write window.

The problem was a race condition that is almost embarrassing in how textbook it was.

When a new payment came in during the backfill window, here is what happened: the application wrote to the old table first, then to the new table. But between those two writes — a window of maybe 3–8 milliseconds — the backfill job could read that row from the old table and try to insert it into the new table. The backfill would succeed because the dual-write had not reached the new table yet. Then the dual-write would try to insert and find the row already existed.

Our dual-write layer handled this with an ON CONFLICT DO NOTHING clause. Which is correct. That is the right thing to do for idempotent writes.

The dual-write said DO NOTHING. The backfill wrote a partially transformed row. And we did not notice for four days.
Abhishek Sharma

Except the backfill was not writing identical data. It was writing the transformed, decomposed version of the row. The dual-write was writing the full, un-transformed version. When the backfill won the race, the new table got the backfill’s version of the row — which was correct structurally but was missing the real-time fraud signals and compliance flags that the dual-write was supposed to carry.

For four days, roughly 12% of new transactions were being written to the new table with incomplete data. The 12% was not random — it tracked almost exactly with the backfill’s processing speed. As the backfill moved through the old table chronologically, it created a moving window where the race condition could trigger.

~19,200Transactions with incomplete dataAcross four days, 12% of new transactions hit the race condition window

How did we finally catch it?

We had consistency checks. They ran every six hours, comparing row counts between old and new tables. Row counts matched perfectly — because the rows were there. They just had incomplete data. Our consistency checks were counting rows, not validating content.

A junior engineer on the client’s team caught it. She was debugging an unrelated fraud alert and noticed that a transaction in the new table was missing its risk_score field. She checked another recent transaction. Same thing. She checked ten more. Seven of them had null fraud signals.

She pinged us at 11:47 PM IST on a Saturday. I will never forget the timestamp because I was watching cricket.

The next 72 hours were the worst of my professional career. We had to:

Recovery Steps
  • Immediately halt the backfill job to stop the bleeding.
  • Identify every affected transaction by comparing old and new table data field-by-field for the four-day window.
  • Rebuild the affected rows from the old table (which still had complete data, thankfully).
  • Verify that no downstream systems — fraud detection, reconciliation, compliance reporting — had acted on incomplete data.
  • Have a very honest phone call with the client’s CTO at 6 AM on a Sunday.

The recovery itself was straightforward once we understood the problem. The old table still had complete data. We wrote a targeted repair script that identified affected rows by comparing fraud signal columns between old and new tables, then overwrote the new table’s data from the old table’s authoritative copy. The repair took about four hours to run.

The hard part was the audit. This was a fintech company. Regulators care about transaction data integrity. We had to produce a full incident report documenting which transactions were affected, what data was incomplete, whether any compliance decisions were made on incomplete data, and what systemic changes we were making to prevent recurrence.

···

Was any real damage done?

We got lucky. The fraud detection system was still reading from the old table — we had not switched reads yet (Phase 3 had not started). So no fraud decisions were made on incomplete data. The reconciliation system was also still on the old table. The incomplete data in the new table was effectively a shadow copy that nothing was consuming yet.

If we had been two days further into the plan — if we had switched reads to the new table before catching this — real fraud signals would have been missing from real-time decisions. Transactions that should have been flagged might have gone through. The regulatory exposure would have been genuinely serious.

The margin between a recoverable mistake and a regulatory incident was about 48 hours.
Post-incident review

What would we do differently?

Everything about our process was right except the one thing that mattered. We tested the migration. We monitored the migration. We had runbooks for failures. But our testing and monitoring were both built around the wrong invariant. We checked that data existed. We did not check that data was complete.

What Changed After This Incident
  • Content-level consistency checks, not just row counts. We now hash critical columns and compare checksums between source and destination during any migration.
  • Backfill and dual-write use different conflict resolution strategies. The dual-write always wins. If the backfill encounters a row that already exists, it skips — it does not write its own version.
  • Time-bounded race condition analysis as a formal step in migration planning. We explicitly model the window between dual writes and ask: what happens if another process touches this row during the gap?
  • Canary reads during dual-write phases. Before switching reads officially, we run shadow reads against both tables and diff the results in real-time.
  • No migration goes more than 24 hours without a content-level audit of a random sample from the affected time window.

The deeper lesson was about confidence. I had run this pattern before. It had worked. I trusted the pattern more than I interrogated the specifics of this particular system. The dual-write layer was generic — it did not account for the fact that the backfill was writing a structurally different version of the same data. In a migration where old and new schemas are identical, ON CONFLICT DO NOTHING is fine. When the schemas diverge — when the backfill is transforming data — the conflict resolution strategy has to be aware of which writer is authoritative.

···

Is zero-downtime migration worth the risk?

Yes. Unambiguously yes. The alternative — scheduled downtime for a payments platform — carries its own risks: lost revenue, merchant churn, SLA violations. Zero-downtime migration is the right approach. But it demands a level of rigor that I underestimated.

Every migration strategy has a failure mode that the strategy itself makes harder to detect. Expand-contract migrations fail silently because both tables have data — the question is whether they have the right data. Blue-green database switches fail when the cutover timing is wrong. Strangler patterns fail when the routing layer has bugs you cannot see.

The common thread is that migration failures are quiet. Production does not crash. Alerts do not fire. Traffic keeps flowing. The data just slowly, silently diverges from what it should be. And by the time someone notices, the blast radius has been growing for days.

I tell this story to every client now before we start a migration engagement. Not as a sales pitch — the opposite, really. I tell them so they understand why we insist on content-level validation, why we build expensive reconciliation checks, why the migration timeline always has what looks like excessive buffer time. Those are not engineering overhead. They are the lessons we learned at 11:47 PM on a Saturday night, watching incomplete transaction data scroll across a terminal while my cricket match played unwatched in the background.

Loading comments...