19 Jun 2026

Index Bloat - The Problem VACUUM Leaves Behind

This is an opening post (of a series) about my attempt at automatic B-tree index compaction integrated into autovacuum.


Index bloat is one of the most common performance problems in production PostgreSQL deployments — and one of the most misunderstood. People assume that a fully vacuumed table has a healthy index - which isn't true. There is a significant gap between an index that has been cleaned and one that has been compacted, and although Postgres is closing that gap (v13 enabled index deduplication & v14 brought bottom-up index deletion), there is a lot more to do here.

This post explains what index bloat actually is, why standard VACUUM cannot fix it, and why the existing manual tools — while useful — are not a substitute for continuous automated maintenance.


What Index Bloat Is (and Isn't)

After a large bulk delete — say, removing 80% of the rows in a table — autovacuum will dutifully remove all 80% of the dead index entries. The index will look clean. But the leaf pages are still there, mostly empty. A table that went from 1 million to 200,000 rows may still have an index file that occupies roughly the same space as before (a leaf page that held 100 index entries, and now holds 20, still occupies 8 kB on disk). Only the entries are gone; the pages remain.

The practical consequence is stark. A fully vacuumed B-tree index can have hundreds of leaf pages each filled to just 20% capacity. A range scan covering 10% of the keyspace now could end up reading five times as many pages as a compact index would require. Every buffer-cache miss costs the same whether the page carries 10 tuples or 100.


Why VACUUM Can't Merge Pages

PostgreSQL's autovacuum already visits every B-tree index after vacuuming a table. It calls btbulkdelete to remove dead index entries, then btvacuumcleanup to perform index-level tidying. What it cannot do is merge two partially-filled leaf pages into one and free the other.

Merging requires moving index entries from one page (the source) to another (the destination), then marking the source as deleted and unlinking it from the sibling chain. The B-tree code already knows how to do this: _bt_pagedel handles the case where a leaf page becomes completely empty after all its entries are removed. But VACUUM never merges pages that are merely sparse. A page at 20% fill is considered live and is left alone indefinitely.

Why can't VACUUM just merge sparse pages the same way it handles empty ones?

The obstacle is locking. B-tree index scans descend to a leaf page, pin it in memory, and then walk right by following btpo_next links to adjacent sibling pages. If a merge moves entries from a right page to a left page while a concurrent scan has already pinned the right page but not yet finished reading from it, the scan will miss the moved entries. That is a correctness bug.

Page-level locks — shared for reading, exclusive for writing — are not enough to prevent this. A concurrent scan can pin the right page before a merge begins, but after the check that would have prevented the merge.

So the option is either (a) Carefully coordinating page-level locks across all concurrent access-paths, OR, (b) Taking a relation-level lock, held across the entire merge operation, which can then guarantee that no scan is mid-flight on the page being moved. This series of blog-posts tries to revisit this locking dilemma (that makes index compaction difficult today) and see what else is possible to do to help the average application workload.


The Existing Solutions and Their Limitations

The standard tool for recovering bloated index space is REINDEX — or its online variant, REINDEX CONCURRENTLY. A full REINDEX rebuilds the index from scratch and produces a perfectly packed result, but it holds an AccessExclusiveLock on the relation for the entire duration of the build. All reads and writes are blocked. On a large production table that can mean minutes of downtime.

REINDEX CONCURRENTLY avoids that lock but still requires building a complete shadow copy of the index: a second, full-sized index is written while the original stays online. This is expensive in I/O and CPU, can run for hours on large indexes, may leave an invalid index (on cancellation) and requires 2x disk-space.

VACUUM FULL and extensions like pg_repack have the same fundamental character. pg_repack is more surgical than VACUUM FULL — it rebuilds the table and its indexes while the original stays accessible, and swaps them in atomically — but it still builds a complete copy before anything is reclaimed.

PostgreSQL 19 (barring last minute rollbacks) introduces a built-in REPACK command that performs online table and index repacking without requiring an external extension. This is genuine progress: it eliminates the pg_repack dependency, integrates with standard tooling, and is maintained by the core team. But the fundamental character of the operation is unchanged. REPACK rebuilds from scratch, and it is a scheduled one-shot command.

The common thread across all of these tools: a DBA has to notice the bloat — typically through a monitoring query or a performance regression — schedule a maintenance window, and issue the command. In practice, indexes frequently stay bloated for months between maintenance events. A table that sees steady churn (a history log, an events queue, a session store) will re-accumulate bloat between scheduled maintenance runs, sometimes faster than it can be addressed manually.

What production systems actually need is continuous, incremental compaction that runs in the background — the way autovacuum continuously reclaims heap space — without ever taking a lock that blocks application traffic.


What This Series Covers

The remaining posts in this series measure what continuous automated compaction delivers across 19 benchmark scenarios: static bulk deletes, steady-state churn, HOT and non-HOT update workloads, UUID-keyed indexes, concurrent readers under load, sawtooth growth patterns, TPS degradation under accumulating bloat, and streaming replication. The scenarios start simple and progressively test harder conditions.

The final post in this series (coming soon) covers the algorithm itself: the multiple failed design attempts, the core insight, the locking design that keeps application traffic unblocked, the implementation details, and operational considerations.


This is personal research, done on my own time and not connected to or endorsed by my employer. This is a relatively complex topic and I am sharing my findings as I explore this space.

No comments:

Index Bloat - The Problem VACUUM Leaves Behind

This is an opening post (of a series) about my attempt at automatic B-tree index compaction integrated into autovacuum. Index bloat is on...