In 2023, we took down a client's production system for 47 minutes during what should have been a routine migration. The migration added a NOT NULL column with a default value to a table with 2 million rows. On our development database with 500 rows, it ran in 80 milliseconds. In production, PostgreSQL locked the entire table while it rewrote every row to add the default value. That incident forced us to develop a rigorous migration playbook. Since then, over 300 production migrations with zero unplanned downtime.
Rule one: every migration must be backward compatible. The application currently running must continue to work after the migration executes. Never rename a column, drop a column, or change a column type in a single migration. Use a multi-step process: add the new column, deploy code that writes to both old and new, backfill the new column, deploy code that reads from the new column, then drop the old column in a separate migration.
Rule two: never add a NOT NULL constraint with a default to an existing large table. This requires a full table rewrite and holds an ACCESS EXCLUSIVE lock. Instead, add a nullable column, set the default at the application level, backfill existing rows in batches, then add the NOT NULL constraint once all rows have values.
Rule three: always use CREATE INDEX CONCURRENTLY. A standard CREATE INDEX takes a SHARE lock on the table, blocking all writes for the duration of the index build. CONCURRENTLY builds the index without holding a lock.
Rule four: test migrations against production-sized data. We maintain sanitized copies of each client's production database for migration testing. If a migration takes more than 5 seconds on the test database, it gets rewritten.
Rule five: every migration has a rollback plan documented and tested on staging before running in production.
For large data backfills, we never run a single UPDATE touching millions of rows. We process 1,000 rows at a time with 100ms pauses between batches. A 2-million-row backfill takes about 40 minutes. Slow and safe beats fast and catastrophic.
Our workflow: write and test locally, run against production-sized data, deploy to staging, schedule during low traffic, execute with real-time monitoring, verify for 30 minutes. The 47 minutes of downtime in 2023 cost our client roughly $12,000 and months of trust. The extra hour we spend making every migration safe is the best investment in our process.
About the Author
Fordel Studios
AI-native app development for startups and growing teams. 14+ years of experience shipping production software.
We diagnosed a client application running at 40% of expected throughput. The fix was not more servers or better queries. It was proper connection pooling configuration.

We ran identical API implementations in Go and Node.js on the same hardware with real-world payloads. The results are more nuanced than "Go is faster." Here is the complete data.

PostgreSQL can handle full-text search, JSON documents, time-series data, vector embeddings, and pub/sub messaging. Here is when to use Postgres for everything and when to reach for a specialized database.
We love talking shop. If this article resonated, let's connect.
Start a ConversationTell us about your project. We'll give you honest feedback on scope, timeline, and whether we're the right fit.
Start a Conversation