If you have run PostgreSQL in production for more than a year, you have hit at least one of these: a table that is 10x larger than its actual data, a VACUUM that ran for six hours and pegged your I/O, or a mysterious wraparound warning that made your DBA send a 2 AM Slack message. All of these trace back to one design decision.
PostgreSQL chose Multi-Version Concurrency Control (MVCC) as its core concurrency strategy. And that choice has consequences that the docs gloss over.
What problem does MVCC actually solve?
The fundamental tension in any database is this: multiple transactions want to read and write the same data simultaneously. You have three options for handling that.
Option one: lock everything. Reader blocks writer, writer blocks reader. Simple, correct, slow. This is what MySQL with MyISAM did, and it is why MyISAM does not exist in serious production anymore.
Option two: lock at the row level, but still make readers wait for writers to finish. This is what early row-level locking systems did. Better, but readers still block on contested rows during writes.
Option three: never block readers at all. Give every transaction a snapshot of the database as it existed when the transaction started, and let it read from that snapshot while other transactions write freely. This is PostgreSQL’s approach.
The key insight is that PostgreSQL achieves this by keeping multiple physical versions of every row directly in the table’s heap files. There is no separate undo log. There is no rollback segment. The table itself is the version store.
“PostgreSQL does not have an undo log. The table IS the undo log. That single design decision explains 80% of PostgreSQL’s operational behavior.”
How does tuple versioning work?
Every row in PostgreSQL (internally called a tuple) carries hidden system columns that control its visibility. The two critical ones are xmin and xmax.
xmin is the transaction ID (XID) of the transaction that created this tuple version. xmax is the transaction ID that deleted or updated it — if xmax is zero, the tuple has not been deleted yet.
Here is what happens when you run an UPDATE:
- 1. PostgreSQL finds the current tuple (the one where xmin is committed and xmax is zero)
- 2. It sets xmax on the old tuple to the current transaction’s XID — this marks it as dead to future transactions
- 3. It inserts an entirely new tuple with xmin set to the current transaction’s XID and xmax set to zero
- 4. It updates the index entries to point to the new tuple (unless HOT optimization applies)
- 5. The old tuple still physically exists on disk. It is not removed. It is not overwritten. It just sits there.
This means an UPDATE in PostgreSQL is really a DELETE plus an INSERT at the storage level. A table that gets updated heavily will accumulate dead tuples at a rate proportional to its write throughput. A row that gets updated 100 times will have 100 dead physical copies sitting in the heap.
DELETE works the same way — it just sets xmax on the existing tuple. The physical space is not reclaimed until VACUUM runs.
How does PostgreSQL decide which tuples are visible?
When a transaction runs a SELECT, it does not see every tuple in the heap. It applies visibility rules based on the transaction’s snapshot.
A snapshot is essentially a record of which transactions were committed at the time the current transaction (or statement) began. The rules are roughly:
- VISIBLE if: xmin is committed AND (xmax is zero OR xmax is not yet committed from this transaction’s perspective)
- INVISIBLE if: xmin is not yet committed (inserted by an in-progress or future transaction)
- INVISIBLE if: xmax is committed (deleted or updated by a transaction that finished before our snapshot)
- The actual rules involve about 12 edge cases including subtransactions, combo CIDs, and hint bits
This is why PostgreSQL can provide true snapshot isolation without locks. Every transaction sees a consistent view of the data based on when it started, regardless of what other transactions are doing concurrently. Reads never block writes. Writes never block reads.
But there is a cost. Every single tuple access requires checking these visibility rules. PostgreSQL cannot just read a page from disk and return it — it has to evaluate every tuple on the page against the current snapshot. This is why sequential scans on heavily bloated tables are catastrophically slow: you are reading pages full of dead tuples that all fail the visibility check.
What is the Visibility Map and why does it matter?
Starting with PostgreSQL 8.4, the system maintains a Visibility Map (VM) — a bitmap with two bits per heap page. The first bit tracks whether all tuples on a page are visible to all current transactions. The second bit (added in 9.6) tracks whether the page has been frozen.
When an index-only scan runs, PostgreSQL checks the VM first. If the page is marked all-visible, it can answer the query from the index alone without touching the heap. If the page is not all-visible (because it contains dead tuples or recently modified tuples), it has to fetch the heap page to check visibility — which is dramatically slower.
This is one of the less obvious performance consequences of table bloat. Even if your query uses an index, a bloated table with many not-all-visible pages will force heap fetches that destroy your index-only scan performance. You will see this in EXPLAIN ANALYZE as Heap Fetches going from near-zero to millions.
How does VACUUM actually work?
VACUUM is the process that cleans up dead tuples. Without it, your tables grow forever. Understanding how it works — and how it fails — is the difference between running PostgreSQL smoothly and running it into the ground.
There are two kinds of VACUUM: regular VACUUM (which reclaims space within the existing table files) and VACUUM FULL (which rewrites the entire table into a new file, compacting it). Regular VACUUM is online and mostly non-blocking. VACUUM FULL locks the table exclusively for the entire duration and is essentially a last resort.
How regular VACUUM processes a table:
- 1. Scan the heap page by page, identify dead tuples (xmax committed, not visible to any running transaction)
- 2. Record the tuple IDs (TIDs) of dead tuples in a maintenance_work_mem-sized buffer
- 3. When the buffer fills, sort the TIDs and do an index scan pass to remove index entries pointing to those dead tuples
- 4. Return to the heap and mark the dead tuple space as reusable on the Free Space Map (FSM)
- 5. Update the Visibility Map bits for pages that are now all-visible
- 6. Advance the table’s relfrozenxid (more on this in the wraparound section)
The critical subtlety is step 3. If your table has five indexes and VACUUM finds 100,000 dead tuples, it has to do five complete index scan passes to clean up the references. On a table with large indexes, this is where VACUUM spends most of its time and I/O. This is also why adding more indexes to a high-write table has a multiplicative cost on VACUUM runtime.
Autovacuum is the daemon that triggers VACUUM automatically based on thresholds. The defaults are: vacuum a table when dead tuples exceed autovacuum_vacuum_threshold (50) plus autovacuum_vacuum_scale_factor (0.2) times the table row count. For a 10 million row table, that means VACUUM triggers after 2,000,050 dead tuples. For many workloads, that is far too late.
Where does MVCC get tricky?
| Aspect | PostgreSQL (Heap MVCC) | MySQL InnoDB (Undo Log MVCC) |
|---|---|---|
| Where old versions live | In the main heap file alongside live data | In a separate undo tablespace |
| UPDATE cost | Full new tuple + dead old tuple in heap | Modify in place + write undo record |
| Read perf on bloated tables | Degrades — must skip dead tuples on every page | Stable — undo log is separate from data pages |
| Index maintenance on UPDATE | New index entry per updated tuple (except HOT) | In-place update, index unchanged if key unchanged |
| Space reclamation | VACUUM (async, background) | Purge thread (async, background) |
| Reclamation falling behind risk | High — table bloat is visible and compounding | Moderate — undo log grows but data pages stay clean |
| Long transaction impact | Prevents VACUUM from reclaiming ANY tuples newer than oldest snapshot | Prevents undo purge, but data pages unaffected |
| Max transaction ID | 32 bits (4.2B) — requires freezing to prevent wraparound | 64 bits — wraparound not a practical concern |
The HOT optimization (Heap-Only Tuples) deserves special mention. When an UPDATE does not change any indexed column AND the new tuple fits on the same heap page, PostgreSQL can skip creating new index entries entirely. The old index entries are followed through a redirect chain to the new tuple. This dramatically reduces index bloat for common update patterns like incrementing counters or updating timestamps. But the moment you update an indexed column or the page is full, HOT cannot apply.
What is the transaction ID wraparound problem?
This is the failure mode that has taken down more production PostgreSQL instances than any other MVCC-related issue.
PostgreSQL uses 32-bit unsigned integers for transaction IDs. That gives you roughly 4.2 billion transactions before the counter wraps around to zero. The system uses modular arithmetic to compare transaction IDs — a transaction is in the past if its XID is within 2 billion before the current XID, and in the future if it is within 2 billion after.
The problem: if a tuple’s xmin is so old that it would be interpreted as a future transaction due to wraparound, that tuple becomes invisible. Your data disappears. Not corrupted — invisible. PostgreSQL calls this wraparound failure and it is catastrophic.
To prevent this, VACUUM has a second critical job: freezing. When VACUUM encounters a tuple old enough (controlled by vacuum_freeze_min_age, default 50 million transactions), it replaces the tuple’s xmin with a special FrozenTransactionId that is defined to be in the past for all comparisons. This permanently removes the tuple from the wraparound clock.
The danger arises when VACUUM cannot keep up with freezing. If you have tables where autovacuum is blocked (by long-running transactions, or because it is throttled too aggressively, or because the table is too large for the available I/O), the oldest unfrozen XID in the cluster creeps toward the wraparound boundary. At 10 million XIDs remaining, PostgreSQL starts logging warnings. At 1 million remaining, it refuses to start new transactions entirely — a self-imposed denial of service to prevent data loss.
What are the real-world failure modes?
After running PostgreSQL in production across fintech, SaaS, and e-commerce workloads for over a decade, these are the MVCC-related failures I have seen repeatedly:
- A developer opens a psql session, runs BEGIN, does a SELECT, then goes to lunch
- That open transaction holds a snapshot that prevents VACUUM from reclaiming ANY dead tuples created after the transaction started
- Meanwhile, a batch job updates 5M rows, creating 5M dead tuples that VACUUM cannot touch
- Table bloat compounds. Autovacuum runs but accomplishes nothing. Disk fills up.
- Prevention: Set idle_in_transaction_session_timeout (we use 5 minutes for application connections)
- Default autovacuum settings run 3 workers with conservative cost limits
- On a busy system with 500+ tables, workers spend more time context-switching than vacuuming
- The largest tables (which need VACUUM most) get starved because each worker is throttled
- Prevention: Per-table autovacuum settings for high-write tables, increase autovacuum_max_workers and autovacuum_vacuum_cost_limit
- Someone runs VACUUM FULL on a 200 GB table during peak traffic
- VACUUM FULL takes an ACCESS EXCLUSIVE lock — no reads, no writes, nothing
- The operation takes 45 minutes. The application throws connection timeout errors the entire time.
- Alternative: Use pg_repack for online table compaction. It creates a shadow table, replays changes via triggers, then does an atomic swap.
How should you monitor MVCC health?
If you run PostgreSQL in production, these are the metrics you should be tracking. Not optional. Not nice-to-have. These should be in your dashboards and alerting.
- Dead tuple ratio: SELECT relname, n_dead_tup, n_live_tup, n_dead_tup::float / NULLIF(n_live_tup, 0) as dead_ratio FROM pg_stat_user_tables ORDER BY n_dead_tup DESC
- Table bloat estimate: Use the pgstattuple extension or the bloat estimation query from the PostgreSQL wiki
- Oldest running transaction: SELECT max(age(backend_xmin)) FROM pg_stat_activity WHERE state != idle
- XID age (wraparound proximity): SELECT datname, age(datfrozenxid) FROM pg_database ORDER BY age DESC
- Autovacuum activity: SELECT relname, last_autovacuum, autovacuum_count FROM pg_stat_user_tables WHERE n_dead_tup > 10000
- Alert threshold: age(datfrozenxid) > 500 million is a yellow flag. Over 1 billion is a red alert.
When should you use PostgreSQL vs alternatives?
| Workload | PostgreSQL MVCC Verdict | Consider Instead |
|---|---|---|
| Read-heavy OLTP with few updates | Excellent — minimal dead tuples, VM stays clean | N/A, PostgreSQL is ideal |
| High-write OLTP with frequent row updates | Manageable with aggressive VACUUM tuning | MySQL InnoDB if VACUUM overhead is unacceptable |
| Append-only / time-series data | Good — dead tuples rare, but partition for maintenance | TimescaleDB (on PostgreSQL), ClickHouse for analytics |
| Multi-TB tables with mixed read/write | Painful — VACUUM I/O competes with production queries | CitusDB for sharding, or CockroachDB for distributed |
| Long analytical txns alongside OLTP | Problematic — long snapshots block VACUUM | Separate read replicas or dedicated OLAP system |
What should every engineer take away from this?
PostgreSQL’s MVCC is elegant in theory and brutal in practice. The mental model is simple: every write creates a new version, every read sees a snapshot, and a background process cleans up. But the operational implications of that design — table bloat, index bloat, VACUUM storms, wraparound risk, visibility map invalidation — are what separate teams that run PostgreSQL well from teams that run it into the ground.
“The teams that run PostgreSQL successfully are not the ones with the best schemas. They are the ones that understand VACUUM as a first-class production concern and monitor it accordingly.”
- 1. Dead tuples are inevitable. VACUUM is not optional. Monitor it like you monitor CPU and memory.
- 2. Long-running transactions are the enemy. Set idle_in_transaction_session_timeout. No exceptions.
- 3. Default autovacuum settings are designed for small databases. If your table has more than 10M rows, you need per-table tuning.
Every application framework abstracts this away. Your ORM does not care about dead tuples. Your connection pooler does not warn you about old snapshots. But the heap files on disk are accumulating invisible garbage with every UPDATE you run, and the only thing standing between you and a production incident is a background process that most teams never think about until it is too late.





