Zero-Downtime Database Migrations: Our Battle-Tested Playbook

In 2023, we took down a client's production system for 47 minutes during what should have been a routine migration. The migration added a NOT NULL column with a default value to a table with 2 million rows. On our development database with 500 rows, it ran in 80 milliseconds. In production, PostgreSQL locked the entire table while it rewrote every row to add the default value. That incident forced us to develop a rigorous migration playbook. Since then, over 300 production migrations with zero unplanned downtime.

Rule one: every migration must be backward compatible. The application currently running must continue to work after the migration executes. Never rename a column, drop a column, or change a column type in a single migration. Use a multi-step process: add the new column, deploy code that writes to both old and new, backfill the new column, deploy code that reads from the new column, then drop the old column in a separate migration.

Rule two: never add a NOT NULL constraint with a default to an existing large table. This requires a full table rewrite and holds an ACCESS EXCLUSIVE lock. Instead, add a nullable column, set the default at the application level, backfill existing rows in batches, then add the NOT NULL constraint once all rows have values.

Rule three: always use CREATE INDEX CONCURRENTLY. A standard CREATE INDEX takes a SHARE lock on the table, blocking all writes for the duration of the index build. CONCURRENTLY builds the index without holding a lock.

Rule four: test migrations against production-sized data. We maintain sanitized copies of each client's production database for migration testing. If a migration takes more than 5 seconds on the test database, it gets rewritten.

Rule five: every migration has a rollback plan documented and tested on staging before running in production.

For large data backfills, we never run a single UPDATE touching millions of rows. We process 1,000 rows at a time with 100ms pauses between batches. A 2-million-row backfill takes about 40 minutes. Slow and safe beats fast and catastrophic.

Our workflow: write and test locally, run against production-sized data, deploy to staging, schedule during low traffic, execute with real-time monitoring, verify for 30 minutes. The 47 minutes of downtime in 2023 cost our client roughly $12,000 and months of trust. The extra hour we spend making every migration safe is the best investment in our process.

Related Articles

Connection Pooling: The Performance Fix You Are Probably Getting Wrong

Go vs Node.js for APIs: Performance Benchmarks from Real Projects

Postgres Is the Only Database You Need (Until It Isn't)

Want to discuss this further?

Ready to build
something real?