Skip to main content
Research
Engineering15 min read

How PostgreSQL MVCC Actually Works Under the Hood

Every senior engineer uses PostgreSQL daily, but few understand the multi-version concurrency control system that makes reads never block writes. Here is what actually happens when you run a query — tuple versioning, visibility maps, vacuum, and the failure modes nobody warns you about until your table is 400 GB of dead rows.

AuthorAbhishek Sharma· Head of Engg @ Fordel Studios
How PostgreSQL MVCC Actually Works Under the Hood

If you have run PostgreSQL in production for more than a year, you have hit at least one of these: a table that is 10x larger than its actual data, a VACUUM that ran for six hours and pegged your I/O, or a mysterious wraparound warning that made your DBA send a 2 AM Slack message. All of these trace back to one design decision.

PostgreSQL chose Multi-Version Concurrency Control (MVCC) as its core concurrency strategy. And that choice has consequences that the docs gloss over.

···

What problem does MVCC actually solve?

The fundamental tension in any database is this: multiple transactions want to read and write the same data simultaneously. You have three options for handling that.

Option one: lock everything. Reader blocks writer, writer blocks reader. Simple, correct, slow. This is what MySQL with MyISAM did, and it is why MyISAM does not exist in serious production anymore.

Option two: lock at the row level, but still make readers wait for writers to finish. This is what early row-level locking systems did. Better, but readers still block on contested rows during writes.

Option three: never block readers at all. Give every transaction a snapshot of the database as it existed when the transaction started, and let it read from that snapshot while other transactions write freely. This is PostgreSQL’s approach.

The key insight is that PostgreSQL achieves this by keeping multiple physical versions of every row directly in the table’s heap files. There is no separate undo log. There is no rollback segment. The table itself is the version store.

PostgreSQL does not have an undo log. The table IS the undo log. That single design decision explains 80% of PostgreSQL’s operational behavior.
···

How does tuple versioning work?

Every row in PostgreSQL (internally called a tuple) carries hidden system columns that control its visibility. The two critical ones are xmin and xmax.

xmin is the transaction ID (XID) of the transaction that created this tuple version. xmax is the transaction ID that deleted or updated it — if xmax is zero, the tuple has not been deleted yet.

Here is what happens when you run an UPDATE:

Anatomy of a PostgreSQL UPDATE
  • 1. PostgreSQL finds the current tuple (the one where xmin is committed and xmax is zero)
  • 2. It sets xmax on the old tuple to the current transaction’s XID — this marks it as dead to future transactions
  • 3. It inserts an entirely new tuple with xmin set to the current transaction’s XID and xmax set to zero
  • 4. It updates the index entries to point to the new tuple (unless HOT optimization applies)
  • 5. The old tuple still physically exists on disk. It is not removed. It is not overwritten. It just sits there.

This means an UPDATE in PostgreSQL is really a DELETE plus an INSERT at the storage level. A table that gets updated heavily will accumulate dead tuples at a rate proportional to its write throughput. A row that gets updated 100 times will have 100 dead physical copies sitting in the heap.

2x–10xTypical bloat ratio on high-write tablesA table with 1M live rows can easily have 5M–10M physical tuples if VACUUM is not keeping up. We have seen production tables at 40x bloat.

DELETE works the same way — it just sets xmax on the existing tuple. The physical space is not reclaimed until VACUUM runs.

···

How does PostgreSQL decide which tuples are visible?

When a transaction runs a SELECT, it does not see every tuple in the heap. It applies visibility rules based on the transaction’s snapshot.

A snapshot is essentially a record of which transactions were committed at the time the current transaction (or statement) began. The rules are roughly:

Tuple Visibility Rules (Simplified)
  • VISIBLE if: xmin is committed AND (xmax is zero OR xmax is not yet committed from this transaction’s perspective)
  • INVISIBLE if: xmin is not yet committed (inserted by an in-progress or future transaction)
  • INVISIBLE if: xmax is committed (deleted or updated by a transaction that finished before our snapshot)
  • The actual rules involve about 12 edge cases including subtransactions, combo CIDs, and hint bits

This is why PostgreSQL can provide true snapshot isolation without locks. Every transaction sees a consistent view of the data based on when it started, regardless of what other transactions are doing concurrently. Reads never block writes. Writes never block reads.

But there is a cost. Every single tuple access requires checking these visibility rules. PostgreSQL cannot just read a page from disk and return it — it has to evaluate every tuple on the page against the current snapshot. This is why sequential scans on heavily bloated tables are catastrophically slow: you are reading pages full of dead tuples that all fail the visibility check.

···

What is the Visibility Map and why does it matter?

Starting with PostgreSQL 8.4, the system maintains a Visibility Map (VM) — a bitmap with two bits per heap page. The first bit tracks whether all tuples on a page are visible to all current transactions. The second bit (added in 9.6) tracks whether the page has been frozen.

When an index-only scan runs, PostgreSQL checks the VM first. If the page is marked all-visible, it can answer the query from the index alone without touching the heap. If the page is not all-visible (because it contains dead tuples or recently modified tuples), it has to fetch the heap page to check visibility — which is dramatically slower.

This is one of the less obvious performance consequences of table bloat. Even if your query uses an index, a bloated table with many not-all-visible pages will force heap fetches that destroy your index-only scan performance. You will see this in EXPLAIN ANALYZE as Heap Fetches going from near-zero to millions.

100xPerformance difference between index-only scan (VM hit) vs heap fetchOn a 500M row table we benchmarked, the same query took 12ms with clean VM bits and 1.4 seconds when 30% of pages were not all-visible.
···

How does VACUUM actually work?

VACUUM is the process that cleans up dead tuples. Without it, your tables grow forever. Understanding how it works — and how it fails — is the difference between running PostgreSQL smoothly and running it into the ground.

There are two kinds of VACUUM: regular VACUUM (which reclaims space within the existing table files) and VACUUM FULL (which rewrites the entire table into a new file, compacting it). Regular VACUUM is online and mostly non-blocking. VACUUM FULL locks the table exclusively for the entire duration and is essentially a last resort.

How regular VACUUM processes a table:

VACUUM Internals
  • 1. Scan the heap page by page, identify dead tuples (xmax committed, not visible to any running transaction)
  • 2. Record the tuple IDs (TIDs) of dead tuples in a maintenance_work_mem-sized buffer
  • 3. When the buffer fills, sort the TIDs and do an index scan pass to remove index entries pointing to those dead tuples
  • 4. Return to the heap and mark the dead tuple space as reusable on the Free Space Map (FSM)
  • 5. Update the Visibility Map bits for pages that are now all-visible
  • 6. Advance the table’s relfrozenxid (more on this in the wraparound section)

The critical subtlety is step 3. If your table has five indexes and VACUUM finds 100,000 dead tuples, it has to do five complete index scan passes to clean up the references. On a table with large indexes, this is where VACUUM spends most of its time and I/O. This is also why adding more indexes to a high-write table has a multiplicative cost on VACUUM runtime.

Autovacuum is the daemon that triggers VACUUM automatically based on thresholds. The defaults are: vacuum a table when dead tuples exceed autovacuum_vacuum_threshold (50) plus autovacuum_vacuum_scale_factor (0.2) times the table row count. For a 10 million row table, that means VACUUM triggers after 2,000,050 dead tuples. For many workloads, that is far too late.

···

Where does MVCC get tricky?

AspectPostgreSQL (Heap MVCC)MySQL InnoDB (Undo Log MVCC)
Where old versions liveIn the main heap file alongside live dataIn a separate undo tablespace
UPDATE costFull new tuple + dead old tuple in heapModify in place + write undo record
Read perf on bloated tablesDegrades — must skip dead tuples on every pageStable — undo log is separate from data pages
Index maintenance on UPDATENew index entry per updated tuple (except HOT)In-place update, index unchanged if key unchanged
Space reclamationVACUUM (async, background)Purge thread (async, background)
Reclamation falling behind riskHigh — table bloat is visible and compoundingModerate — undo log grows but data pages stay clean
Long transaction impactPrevents VACUUM from reclaiming ANY tuples newer than oldest snapshotPrevents undo purge, but data pages unaffected
Max transaction ID32 bits (4.2B) — requires freezing to prevent wraparound64 bits — wraparound not a practical concern

The HOT optimization (Heap-Only Tuples) deserves special mention. When an UPDATE does not change any indexed column AND the new tuple fits on the same heap page, PostgreSQL can skip creating new index entries entirely. The old index entries are followed through a redirect chain to the new tuple. This dramatically reduces index bloat for common update patterns like incrementing counters or updating timestamps. But the moment you update an indexed column or the page is full, HOT cannot apply.

···

What is the transaction ID wraparound problem?

This is the failure mode that has taken down more production PostgreSQL instances than any other MVCC-related issue.

PostgreSQL uses 32-bit unsigned integers for transaction IDs. That gives you roughly 4.2 billion transactions before the counter wraps around to zero. The system uses modular arithmetic to compare transaction IDs — a transaction is in the past if its XID is within 2 billion before the current XID, and in the future if it is within 2 billion after.

The problem: if a tuple’s xmin is so old that it would be interpreted as a future transaction due to wraparound, that tuple becomes invisible. Your data disappears. Not corrupted — invisible. PostgreSQL calls this wraparound failure and it is catastrophic.

To prevent this, VACUUM has a second critical job: freezing. When VACUUM encounters a tuple old enough (controlled by vacuum_freeze_min_age, default 50 million transactions), it replaces the tuple’s xmin with a special FrozenTransactionId that is defined to be in the past for all comparisons. This permanently removes the tuple from the wraparound clock.

The danger arises when VACUUM cannot keep up with freezing. If you have tables where autovacuum is blocked (by long-running transactions, or because it is throttled too aggressively, or because the table is too large for the available I/O), the oldest unfrozen XID in the cluster creeps toward the wraparound boundary. At 10 million XIDs remaining, PostgreSQL starts logging warnings. At 1 million remaining, it refuses to start new transactions entirely — a self-imposed denial of service to prevent data loss.

4.2BTotal transaction IDs before wraparoundAt 1,000 transactions per second, you burn through the entire XID space in about 136 years. At 50,000 TPS (common for busy OLTP systems), you have about 23 hours of headroom if freezing stops completely.
···

What are the real-world failure modes?

After running PostgreSQL in production across fintech, SaaS, and e-commerce workloads for over a decade, these are the MVCC-related failures I have seen repeatedly:

Failure Mode 1: The Forgotten Long Transaction
  • A developer opens a psql session, runs BEGIN, does a SELECT, then goes to lunch
  • That open transaction holds a snapshot that prevents VACUUM from reclaiming ANY dead tuples created after the transaction started
  • Meanwhile, a batch job updates 5M rows, creating 5M dead tuples that VACUUM cannot touch
  • Table bloat compounds. Autovacuum runs but accomplishes nothing. Disk fills up.
  • Prevention: Set idle_in_transaction_session_timeout (we use 5 minutes for application connections)
Failure Mode 2: Autovacuum Cannot Keep Up
  • Default autovacuum settings run 3 workers with conservative cost limits
  • On a busy system with 500+ tables, workers spend more time context-switching than vacuuming
  • The largest tables (which need VACUUM most) get starved because each worker is throttled
  • Prevention: Per-table autovacuum settings for high-write tables, increase autovacuum_max_workers and autovacuum_vacuum_cost_limit
Failure Mode 3: VACUUM FULL During Business Hours
  • Someone runs VACUUM FULL on a 200 GB table during peak traffic
  • VACUUM FULL takes an ACCESS EXCLUSIVE lock — no reads, no writes, nothing
  • The operation takes 45 minutes. The application throws connection timeout errors the entire time.
  • Alternative: Use pg_repack for online table compaction. It creates a shadow table, replays changes via triggers, then does an atomic swap.
···

How should you monitor MVCC health?

If you run PostgreSQL in production, these are the metrics you should be tracking. Not optional. Not nice-to-have. These should be in your dashboards and alerting.

Essential MVCC Monitoring Queries
  • Dead tuple ratio: SELECT relname, n_dead_tup, n_live_tup, n_dead_tup::float / NULLIF(n_live_tup, 0) as dead_ratio FROM pg_stat_user_tables ORDER BY n_dead_tup DESC
  • Table bloat estimate: Use the pgstattuple extension or the bloat estimation query from the PostgreSQL wiki
  • Oldest running transaction: SELECT max(age(backend_xmin)) FROM pg_stat_activity WHERE state != idle
  • XID age (wraparound proximity): SELECT datname, age(datfrozenxid) FROM pg_database ORDER BY age DESC
  • Autovacuum activity: SELECT relname, last_autovacuum, autovacuum_count FROM pg_stat_user_tables WHERE n_dead_tup > 10000
  • Alert threshold: age(datfrozenxid) > 500 million is a yellow flag. Over 1 billion is a red alert.
···

When should you use PostgreSQL vs alternatives?

WorkloadPostgreSQL MVCC VerdictConsider Instead
Read-heavy OLTP with few updatesExcellent — minimal dead tuples, VM stays cleanN/A, PostgreSQL is ideal
High-write OLTP with frequent row updatesManageable with aggressive VACUUM tuningMySQL InnoDB if VACUUM overhead is unacceptable
Append-only / time-series dataGood — dead tuples rare, but partition for maintenanceTimescaleDB (on PostgreSQL), ClickHouse for analytics
Multi-TB tables with mixed read/writePainful — VACUUM I/O competes with production queriesCitusDB for sharding, or CockroachDB for distributed
Long analytical txns alongside OLTPProblematic — long snapshots block VACUUMSeparate read replicas or dedicated OLAP system
···

What should every engineer take away from this?

PostgreSQL’s MVCC is elegant in theory and brutal in practice. The mental model is simple: every write creates a new version, every read sees a snapshot, and a background process cleans up. But the operational implications of that design — table bloat, index bloat, VACUUM storms, wraparound risk, visibility map invalidation — are what separate teams that run PostgreSQL well from teams that run it into the ground.

The teams that run PostgreSQL successfully are not the ones with the best schemas. They are the ones that understand VACUUM as a first-class production concern and monitor it accordingly.
The Three Rules of PostgreSQL MVCC in Production
  • 1. Dead tuples are inevitable. VACUUM is not optional. Monitor it like you monitor CPU and memory.
  • 2. Long-running transactions are the enemy. Set idle_in_transaction_session_timeout. No exceptions.
  • 3. Default autovacuum settings are designed for small databases. If your table has more than 10M rows, you need per-table tuning.

Every application framework abstracts this away. Your ORM does not care about dead tuples. Your connection pooler does not warn you about old snapshots. But the heap files on disk are accumulating invisible garbage with every UPDATE you run, and the only thing standing between you and a production incident is a background process that most teams never think about until it is too late.

Loading comments...