Showing posts with label sql. Show all posts
Showing posts with label sql. Show all posts

9 Dec 2025

3x Faster TID Range Scans - Postgres 19

If you've ever had to scrub significantly large tables—whether updating older records or deleting expired ones—you know the pain of trying to do it efficiently without locking the entire table. One common trick is to iterate over the ctid (the physical location of the tuple) to process the table in chunks. Until now, this was strictly a single-threaded affair. But with the latest commit to Postgres 19, TID Range Scans can now be parallelized, allowing you to harness multiple cores to rip through these maintenance tasks significantly faster.

In my tests, upto 3x Faster!!

The Problem with Linear Scanning

When you perform a batched operation using ctid (e.g., WHERE ctid >= '(0,0)' AND ctid < '(1000,0)'), Postgres uses a Tid Range Scan. Previously, this scan type didn't support parallelism. If you had a 1 TB table and wanted to process ranges of it, a single CPU core would have to do all the heavy lifting for that range, fetching blocks one by one. Even if you had 64 cores gathering dust, they couldn't help with that specific scan node.

The Solution: Parallel TID Range Scan

Commit 0ca3b169, developed by Cary Huang and committed by David Rowley, introduces infrastructure to allow Tid Range Scan to participate in parallel query plans. The logic effectively splits the block range among the available parallel workers. Instead of one process sweeping from block 0 to N, workers can grab chunks of blocks concurrently.

I was following the discussion on the mailing list, where Cary and David collaborated closely to refine the parallel safety logic, ensuring workers handle scan limits correctly—leading to the robust implementation that landed in the master branch.

NOTE
This feature is currently in the master branch (Postgres 19devel). As with all unreleased features, it could theoretically be rolled back or modified before the final release.

Benchmark: Before vs. After

To see this in action, I spun up two Postgres instances: one built just before the commit (Postgres 18) and one after (Postgres 19). I created a table bench_tid_range with 10 million rows and ran a count(*) query over the first 50% of the table using a ctid range condition.

Test Setup:

  • Table Size: 10 Million rows
  • Query: SELECT count(*) FROM bench_tid_range WHERE ctid >= '(0,0)' AND ctid < '(41667,0)'
Environment Workers Execution Time (Median) Speedup
Before (Pg 18) 0 448 ms 1.00x
After (Pg 19) 0 435 ms 1.03x
After (Pg 19) 1 238 ms 1.88x
After (Pg 19) 2 174 ms 2.58x
After (Pg 19) 3 151 ms 2.97x
After (Pg 19) 4 150 ms 2.98x
After (Pg 19) 5 147 ms 3.05x
After (Pg 19) 6 143 ms 3.14x
After (Pg 19) 7 147 ms 3.04x
After (Pg 19) 8 147 ms 3.04x

*(Each execution time is a Median of 9 runs)*

The chart above illustrates the performance scaling. Note that Postgres 19 with 0 workers (i.e. serial) is almost identical in speed to Postgres 18; the efficiency gains are visible only when parallelism is enabled.

We see a massive drop in execution time just by enabling 1 worker (which effectively gives us 2 processes scanning: the leader + 1 worker). The gains continue robustly up to 3 workers, where execution time settles around 150ms. Beyond that, we hit diminishing returns as the overhead of managing parallel workers and aggregating results begins to dominate the raw scanning speed. The "sweet spot" for this specific workload appears to be around 2-3 workers.

Under the Hood: Deep Dive

Let's look at the actual EXPLAIN (ANALYZE, BUFFERS) output to see *why* it's faster.

Before (Postgres 18)

The legacy behavior forces a serial scan. The single process reads all 41,667 pages.


 Aggregate  (cost=104171.87..104171.88 rows=1 width=8) (actual time=427.723..427.724 rows=1.00 loops=1)
   Buffers: shared read=41667
   ->  Tid Range Scan on bench_tid_range  (cost=0.01..91671.70 rows=5000069 width=0) (actual time=0.130..243.502 rows=5000040.00 loops=1)
         TID Cond: ((ctid >= '(0,0)'::tid) AND (ctid < '(41667,0)'::tid))
         Buffers: shared read=41667

After (Postgres 19)

The new Parallel Tid Range Scan node appears. Notice the Gather node launching 2 workers.


 Finalize Aggregate  (cost=68713.25..68713.26 rows=1 width=8) (actual time=166.873..169.987 rows=1.00 loops=1)
   Buffers: shared read=41667
   ->  Gather  (cost=68713.03..68713.24 rows=2 width=8) (actual time=166.238..169.978 rows=3.00 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared read=41667
         ->  Partial Aggregate  (cost=67713.03..67713.04 rows=1 width=8) (actual time=164.518..164.519 rows=1.00 loops=3)
               Buffers: shared read=41667
               ->  Parallel Tid Range Scan on bench_tid_range  (cost=0.01..62504.63 rows=2083362 width=0)
                     (actual time=0.095..97.492 rows=1666680.00 loops=3)
                     TID Cond: ((ctid >= '(0,0)'::tid) AND (ctid < '(41667,0)'::tid))
                     Buffers: shared read=41667

Key Improvements:

1. Distributed Cost: The cost estimate dropped from 104k to 68k, indicating the planner correctly anticipates the benefit of parallelism.

2. Shared Load: While the total buffers read (41667) is the same, the work is shared across 3 processes (1 leader + 2 workers), reducing the wall-clock time for the scan itself from ~243ms to ~97ms per worker.

Counter-Intuitive Alternative: Forced Parallel Seq Scan?

You might ask: *"Why didn't Postgres v18 just choose a Parallel Sequential Scan? Wouldn't scanning the whole table with 4 workers be faster than scanning half the table with 1?"*

I tested exactly that on the v18 instance by forcing enable_tidscan = off with 4 workers.

  • Execution Time: ~230 ms.

 Finalize Aggregate  (cost=124959.76..124959.77 rows=1 width=8) (actual time=227.917..229.476 rows=1.00 loops=1)
   Buffers: shared hit=15791 read=67543
   ->  Gather  (cost=124959.34..124959.75 rows=4 width=8) (actual time=227.660..229.469 rows=5.00 loops=1)
         Workers Planned: 4
         Workers Launched: 4
         Buffers: shared hit=15791 read=67543
         ->  Partial Aggregate  (cost=123959.34..123959.35 rows=1 width=8) (actual time=224.154..224.155 rows=1.00 loops=5)
               Buffers: shared hit=15791 read=67543
               ->  Parallel Seq Scan on bench_tid_range  (cost=0.00..120834.30 rows=1250017 width=0) 
                     (actual time=0.150..169.772 rows=1000008.00 loops=5)
                     Filter: ((ctid >= '(0,0)'::tid) AND (ctid < '(41667,0)'::tid))
                     Rows Removed by Filter: 999992
                     Buffers: shared hit=15791 read=67543
 Planning:
   Buffers: shared hit=25
 Planning Time: 0.217 ms
 Execution Time: 229.584 ms

It was faster than the serial TID scan (~448 ms)! However, it had to visit all ~83k pages (15,791 hit + 67,543 read) instead of just the ~41k pages we cared about.

The new Parallel TID Range Scan (~150 ms) is still 35% faster than the brute-force/forced Parallel Seq Scan *and* it generates half the I/O load. It is the best of both worlds: fast execution time (Parallelism) and efficient resource usage (Index-like scoping).

Use Case: The Hotel Analogy

To understand why this matters, let's step away from databases and imagine a massive hotel with 1 million rooms. You need to verify all guests whose names start with a letter before 'S'.

The Logical Approach (Offset Problem)

You tell the front desk: *"Give me a list of the first 100 guests with names A-R."* matches 100 people named 'Aaron', 'Abby', etc.

Then you ask: *"Give me the next 100 guests."*

To answer this, the clerk has to skip through the 'A's again just to know where to start counting the 'B's. As you get deeper into the alphabet, the clerk spends more and more time skipping known names just to find the new ones. By the time you are asking for names starting with 'R', they are re-reading 90% of the guest list just to find your batch.

The Physical Approach (TID Scan)

This is the TID Range Scan. You stop caring about the names (the logical value) and organize by Room Number (the physical location).

  • Worker 1: "Go check Rooms 1 to 100."
  • Worker 2: "Go check Rooms 101 to 200."

Worker 2 doesn't care who is in Room 1. They go straight to Room 101. If the person inside is named "Zach", they skip them. If it's "Alice", they verify them.

  • Zero Rescanning: No one re-reads the list of 'A's.
  • Parallelism: Because the rooms are physically distinct, you can send 8 workers to different floors simultaneously without them needing to coordinate or skip over each other's results.

This approach transforms an O(N^2) maintainence nightmare into a linear, parallelizable task.

Reproduce this Test

Want to try it yourself on Postgres 19devel? Here is the SQL to set up the test table and run the benchmark.


-- 1. Create the table
DROP TABLE IF EXISTS bench_tid_range;
CREATE TABLE bench_tid_range (id int, payload text);

-- 2. Insert 10M rows to generate ~41k pages
INSERT INTO bench_tid_range 
SELECT x, 'payload_' || x 
FROM generate_series(1, 10000000) x;

-- 3. Vacuum to set visibility map and freeze (important for stable benchmarks)
VACUUM (ANALYZE, FREEZE) bench_tid_range;

-- 4. Enable Parallelism for the Session
SET max_parallel_workers_per_gather = 4; -- Try 2, 4, 8
SET min_parallel_table_scan_size = 0;    -- Force parallel scan even for smaller tables

-- 5. Run the Query
EXPLAIN (ANALYZE, BUFFERS) 
SELECT count(*) 
FROM bench_tid_range 
WHERE ctid >= '(0,0)' AND ctid < '(41667,0)';

Conclusion

This is a welcome "plumbing" improvement. It might not change your daily ad-hoc queries, but for DBAs and developers building custom data maintenance scripts, the ability to parallelize TID based scans is a powerful new tool in the optimization toolkit.

References:

3 Dec 2025

Speed up JOIN Planning - upto 16x Faster!

The hidden cost of knowing too much. That's one way to describe what happens when your data is skewed, Postgres statistics targets are set high, and the planner tries to estimate a join.

For over 20 years, Postgres used a simple O(N^2) loop to compare (equi-join) Most Common Values (MCVs) during join estimation. It worked fine when statistics targets are small (default_statistics_target defaults to 100). But in the modern era - we often see Postgres best-practices recommend cranking that up. Customers are known to be using higher values (1000 and sometimes even higher) to handle complex data distributions + throw a 10 JOIN query to the mix - and this "dumb loop" can easily become a silent performance killer during planning. 

That changes in Postgres 19.

The Problem: It's Quadratic!

When you join two tables, the planner needs to estimate how many rows will match. If both columns have MCV lists (lists of the most frequent values), eqjoinsel() tries to match them up to get a precise estimate.

Historically, it did this by comparing every item in list A with every item in list B.

  • If you have 100 MCVs, that's 10,000 comparisons. Fast.
  • If you have 10,000 MCVs (max stats target), that's 100,000,000 comparisons. Not so fast.

This meant that simply *planning* a query could take significantly longer than executing it, especially for simple OLTP queries.

The Solution: Hash It Out

The fix is elegant and effective.

Instead of a nested loop, the planner now:

1. Checks if the total number of MCVs is greater than 200 (100 each side).

2. If so, it builds a Hash Table of the MCVs from one (smaller) side.

3. It then probes this hash table with the MCVs from the other side.

The threshold of 200 was chosen because hashing has a startup cost (allocating memory, computing hashes). For smaller lists, the simple loop is actually faster. But once you cross that threshold, the hash table wins.

This transforms the complexity from O(N^2) to O(N), making the estimation step virtually instantaneous even with the largest statistics targets.

Let's Benchmark

I wanted to verify this myself, so I set up a worst-case scenario: two tables with 100,000 rows, but only 10,000 distinct values, and I cranked the statistics target to the maximum (10,000) to force a massive MCV list.

Setup


CREATE TABLE t1 (x int);
CREATE TABLE t2 (x int);

-- Scenario A: 3M rows, Stats 10000
INSERT INTO t1 SELECT x % 10000 FROM generate_series(1, 3000000) x;
INSERT INTO t2 SELECT x % 10000 FROM generate_series(1, 3000000) x;

-- Maximize statistics target
ALTER TABLE t1 ALTER COLUMN x SET STATISTICS 10000;
ALTER TABLE t2 ALTER COLUMN x SET STATISTICS 10000;

ANALYZE t1;
ANALYZE t2;

Results

I ran the following query 10 times on both versions, interleaved to ensure fairness and used median for the results:


EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM t1 JOIN t2 ON t1.x = t2.x;

Scenario A: High Stats (10k MCVs)

  • Rows: 3 Million
  • Statistics Target: 10,000
  • MCV Count: 10,000
  • Before (Postgres 18): ~27.8 ms
  • After (Postgres 19): ~1.75 ms
  • Speedup: ~16x

Scenario B: Medium Stats (1k MCVs)

  • Rows: 300,000
  • Statistics Target: 1,000
  • MCV Count: 1,000
  • Before (Postgres 18): ~0.85 ms
  • After (Postgres 19): ~0.60 ms
  • Speedup: ~1.4x (40% speed-up)

Scenario C: Default Stats (100 MCVs)

  • Rows: 30,000
  • Statistics Target: 100 (Default)
  • MCV Count: 100
  • Before (Postgres 18): ~0.40 ms
  • After (Postgres 19): ~0.43 ms
  • Speedup: None (Optimization correctly skipped)

Since the total MCV count in Scenario C (100 + 100 = 200) did not exceed the 200 threshold, Postgres 19 correctly chose the simpler loop, avoiding the overhead of building a hash table. This confirms that the patch is smart enough to only kick in when it matters.

The "Quadratic Curve" in Action

To visualize the O(N^2) vs O(N) difference, I ran a benchmark across a wide range of statistics targets (from 10 to 10,000). In each test, the number of rows was set to 300x the statistics target to ensure the MCV list was fully populated and relevant.

Stats Target PG 18 (ms) PG 19 (ms) Speedup
10 0.40 0.38 -
100 0.40 0.45 -
200 0.45 0.43 -
500 0.53 0.52 -
1000 0.85 0.63 1.3x
2000 1.73 0.68 2.5x
5000 7.54 1.14 6.6x
8000 17.97 1.57 11.4x
10000 27.56 1.92 14.3x

As you can see, up to a target of 500, the difference is negligible (and the optimization might not even kick in). But as the target grows, the quadratic cost of the old method explodes, while the new hash-based method scales linearly and remains extremely fast.

This patch is a classic example of modernizing legacy assumptions. The code written 20 years ago assumed MCV lists would be short (in all fairness the default was 100 for a long time). Today's hardware and data requirements have pushed those boundaries, and Postgres is evolving to meet them.

Thanks to Ilia Evdokimov for the patch, David Geier for co-authoring, and Tom Lane for the review.

Note: This feature is currently in the master branch for Postgres 19. As with all development features, it is subject to change before the final release.

References

29 Nov 2025

Teaching Query Planner to See Inside C Functions

The Postgres planner just got a new superpower: X-ray vision for your black-box functions.

For years, Postgres has been able to "inline" simple SQL functions. If you wrote CREATE FUNCTION ... LANGUAGE SQL AS 'SELECT ...', the planner would often rip out the function call and paste the raw SQL directly into the main query. This was magic. It meant that WHERE clauses could be pushed down, indexes could be used, and performance was excellent.

But if you stepped outside the safety of simple SQL functions—say, into the world of C extensions or complex dynamic SQL—you were often stuck with a "black box." The planner saw a function, executed it blindly, and only *then* applied your filters.

That changes in Postgres 19.

The "Black Box" Problem

Imagine you have a function that returns a large dataset, but you only want a few rows:


SELECT * FROM my_complex_function() WHERE id = 42;

Before Postgres 19, if my_complex_function wasn't a simple SQL function, the database would:

1. Run my_complex_function() to completion (generating, say, 1 million rows).

2. Filter the result to find id = 42.

This is painfully inefficient. You wanted an Index Scan, but you got a full execution followed by a filter.

The Solution: SupportRequestInlineInFrom

With this feature, Postgres now allows function authors (specifically in C extensions) to provide a "support function" that tells the planner: *"Hey, if anyone calls me in a FROM clause, here is the raw SQL query tree that represents what I'm doing."*

The planner can then take that query tree and inline it, just like it does for standard SQL functions.

The Impact

This is a game-changer for:

  • Extensions: Many extensions provide table-like interfaces (e.g., foreign data wrappers, set-returning functions). They can now expose their internal logic to the planner.
  • Dynamic SQL: Functions that generate SQL strings and execute them can now potentially participate in optimization.

Before vs. After

Scenario: A PL/pgSQL function foo_from_bar that executes a dynamic query on text_tbl (containing 1 million rows).


CREATE TABLE text_tbl (f1 text);
INSERT INTO text_tbl SELECT 'row_' || generate_series(1, 1000000);
INSERT INTO text_tbl SELECT 'common_row' FROM generate_series(1, 100);
CREATE INDEX text_tbl_idx ON text_tbl(f1);
ANALYZE text_tbl;

Here is the PL/pgSQL function I used for testing (designed to return 100 rows). Note that it uses dynamic SQL, which is normally opaque to the planner, and then linked it to the C support function test_inline_in_from_support_func using ALTER FUNCTION.


CREATE OR REPLACE FUNCTION foo_from_bar(colname TEXT, tablename TEXT, filter TEXT)
RETURNS SETOF TEXT
LANGUAGE plpgsql
AS $function$
DECLARE
  sql TEXT;
BEGIN
  sql := format('SELECT %I::text FROM %I', colname, tablename);
  IF filter IS NOT NULL THEN
    sql := CONCAT(sql, format(' WHERE %I::text = $1', colname));
  END IF;
  RETURN QUERY EXECUTE sql USING filter;
END;
$function$ STABLE;

ALTER FUNCTION foo_from_bar(TEXT, TEXT, TEXT)
  SUPPORT test_inline_in_from_support_func;

And here is the query I ran:


SELECT * FROM foo_from_bar('f1', 'text_tbl', 'common_row');

Before (PG 18 and earlier)


->  Function Scan on public.foo_from_bar  (cost=0.25..10.25 rows=1000 width=32)
      Output: foo_from_bar
      Function Call: foo_from_bar('f1'::text, 'text_tbl'::text, 'common_row'::text)
      Buffers: shared hit=32 read=4
      Execution Time: 0.353 ms
The function is treated as a black box. The planner estimates 1000 rows (default) and scans the function output.

After (PG 19)


->  Index Only Scan using text_tbl_idx on public.text_tbl  (cost=0.42..8.44 rows=1 width=10)
      Output: text_tbl.f1
      Index Cond: (text_tbl.f1 = 'common_row'::text)
      Buffers: shared hit=1 read=3
      Execution Time: 0.064 ms
The planner "sees through" the function, realizes it's just a query on text_tbl, and uses the index directly. The execution time drops significantly (5.5x faster in this case).

I also ran a pgbench test (1000 iterations) to measure the end-to-end latency improvement:

  • Before: 0.107 ms average latency
  • After: 0.087 ms average latency
  • Improvement: ~19% faster end-to-end (including network overhead)

Bonus: Outer WHERE Pushdown - Orders of magnitude faster!

Next, I ran a test where NULL was passed to the function (returning all 1 million rows) and applied an external WHERE clause:


SELECT * FROM foo_from_bar('f1', 'text_tbl', NULL) WHERE foo_from_bar = 'common_row';

Before (PG 18):


->  Function Scan on public.foo_from_bar  (cost=0.25..12.75 rows=5 width=32)
      Output: foo_from_bar
      Function Call: foo_from_bar('f1'::text, 'text_tbl'::text, NULL::text)
      Filter: (foo_from_bar.foo_from_bar = 'common_row'::text)
      Rows Removed by Filter: 1000000
      Execution Time: 136.267 ms
The database scanned all 1 million rows from the function, then filtered them down to 100.

After (PG 19):


->  Index Only Scan using text_tbl_idx on public.text_tbl  (cost=0.42..8.44 rows=1 width=10)
      Output: text_tbl.f1
      Index Cond: (text_tbl.f1 = 'common_row'::text)
      Execution Time: 0.055 ms
The planner pushed the WHERE clause *inside* the function logic, triggering an index lookup (~2400x speedup).

Similarly, I tested adding a LIMIT 5 clause to the query and saw a ~20x speedup!!

  • Before: The database scanned the function output until it got 5 rows (~1.091 ms).
  • After: The planner pushed the limit down into the index scan, stopping immediately after finding 5 rows (~0.051 ms).

Testing Methodology

The benchmarks above were conducted using pgbench with the following parameters:

  • Iterations: 1000 transactions per client.
  • Concurrency: Single client (1 thread).
  • Data Volume: The target table text_tbl contained 1,000,100 rows.
  • Return Size: The query returned 100 rows per execution.
  • Metric: Average latency reported by pgbench (excluding connection overhead).
  • Commits tested: c0bc9af (before) and b140c8d (after).

The C support function (test_inline_in_from_support_func) used in this example is defined in src/test/regress/regress.c that is a part of the regression tests. To know more about support functions, see PostgreSQL Documentation on Function Optimization Information.

Discussion & Credits

Coincidentally a Linkedin thread mentioned that Set-Returning Functions (SRF) began their journey in Postgres 7.3 - that was 17 years ago! With this patch, the planner should be able to see through such a function and develop much better execution plans around them! Thanks to Paul A. Jungwirth (for initating the idea with a working patch) and Tom Lane (for in-depth review).

Note: This feature is currently in the master branch for Postgres 19. As with all development features, it is subject to change before the final release.

27 Nov 2025

Settling COUNT(*) vs COUNT(1) debate in Postgres 19

A recent commit to the PostgreSQL master branch brings a nice quality-of-life optimization for a very common SQL pattern - improving performance by up to 64% for SELECT COUNT(h) where h is a NOT NULL column.

If you've ever wondered whether you should use COUNT(*) or COUNT(1), or if you've been dutifully using COUNT(id) on a non-null column, this change is for you.

Note: This feature is currently on the PostgreSQL master branch (committed in November 2025). As with any commit on master, it is subject to change or even rollback before the final release, although this is relatively rare for committed features. If all goes well, this change will be part of the PostgreSQL 19 major version release.

The Optimization

Historically in PostgreSQL, COUNT(*) has been faster than COUNT(1) or COUNT(column).

  • COUNT(*): The planner knows it just needs to count rows. It doesn't need to check if any specific column is null.
  • COUNT(column): The executor has to fetch the value of the column for each row and check if it is NULL before counting it. This involves "deforming" the tuple (extracting the data), which adds overhead.
  • COUNT(1): Often used as a "faster" alternative by users, but treated similarly to COUNT(column) where the "column" is the constant 1.

The new commit, Have the planner replace COUNT(ANY) with COUNT(*), when possible, changes this behavior.

The planner can now detect when you are counting a generic expression (COUNT(ANY)) that:

1. Cannot be NULL: For example, a constant like 1 or a column defined as NOT NULL.

2. Has no extra clauses: No DISTINCT or ORDER BY inside the aggregate.

In these cases, the planner will automatically transform the query to use COUNT(*).

Performance Impact

This optimization eliminates the overhead of tuple deforming and null checking for these cases.

I ran a benchmark on a local PostgreSQL instance to verify the impact, comparing COUNT(h) (not null column), COUNT(1) (constant), and COUNT(*) (rows) on a table with 10 million rows.

Before Patch:

  • SELECT count(h) FROM t: ~195 ms
  • SELECT count(1) FROM t: ~126 ms
  • SELECT count(*) FROM t: ~119 ms
  • *Observation:* COUNT(h) is significantly slower (~64% overhead vs count(*)). COUNT(1) is faster than COUNT(h) but still lags behind COUNT(*).

After Patch:

  • SELECT count(h) FROM t: ~117 ms
  • SELECT count(1) FROM t: ~116 ms
  • SELECT count(*) FROM t: ~114 ms
  • *Observation:* COUNT(h) sees the major win, improving dramatically to match COUNT(*). COUNT(1) sees a minor speedup, also converging to the optimal COUNT(*) performance.

This means users who prefer COUNT(1) for stylistic reasons, or who count specific non-null columns, will now get the optimal performance of COUNT(*) automatically.

(The patch author, David Rowley, demonstrated similar speedups in the initial discussion).

What about Nullable Columns?

I also tested COUNT(nullable_col) to see if the optimization applies. As expected, it does not.

  • Before Patch: COUNT(nullable_col): ~162 ms
  • After Patch: COUNT(nullable_col): ~158 ms (No significant change)

Since the planner cannot guarantee the column is free of NULLs, it must still scan and check every value, so the performance remains unchanged. The optimization strictly targets columns where the planner *knows* the value cannot be NULL.

What about DISTINCT or ORDER BY?

I also tested COUNT(DISTINCT h) and COUNT(h ORDER BY h) to see if the optimization applies. As expected, it does not.

  • Before Patch: COUNT(DISTINCT h): ~2663 ms (Median of 10 runs)
  • After Patch: COUNT(DISTINCT h): ~2708 ms (Median of 10 runs)

Similarly for ORDER BY, the executor still needs to process the values to sort them, so no optimization is possible.

Test Methodology

  • Versions:

* Before: Tested on commit dbdc717ac6743074c3a55fc5c380638c91d24afd.

* After: Tested on commit 42473b3b31238b15cc3c030b4416b2ee79508d8c.

  • Data: Table t with 10 million rows, populated via INSERT INTO t (h) SELECT 1 FROM generate_Series(1,10000000);.
  • Procedure: Each query was executed 10 times in an interleaved pattern (A, B, C, A, B, C...) to ensure fairness and mitigate caching bias. The reported time is the median of those 10 runs.
  • System: Intel Core i7-4770 CPU @ 3.40GHz (4 cores/8 threads), 16GB RAM.

Detailed Analysis

To understand *why* the performance improved, we can look at the EXPLAIN plans and table statistics.

  • Table Size: The table t (10 million rows) is approximately 346 MB.
  • Disk vs. Memory: The benchmark results (median of 10 runs) indicate mostly in-memory / hot-cache performance (~3 GB/s throughput). However, even with some disk I/O involved (as seen in single-run EXPLAIN (ANALYZE, BUFFERS) tests showing ~28k buffer reads), the CPU overhead of deforming the tuple was the dominant differentiator.

The Key Difference:

In the Before scenario, COUNT(h) required the scanner to fetch the column data (width=4), whereas COUNT(*) did not (width=0).


-- Before Patch: COUNT(h)
->  Parallel Seq Scan on t  (... rows=4166657 width=4) ...

In the After scenario, the planner transforms COUNT(h) into COUNT(*), allowing it to skip fetching the column data entirely (width=0).


-- After Patch: COUNT(h)
->  Parallel Seq Scan on t  (... rows=4166657 width=0) ...

This reduction in "width" means the executor doesn't need to extract the integer value from the row, saving significant CPU cycles.

Community Discussion & Trivia

Thanks to David Rowley for working on the patch and Corey Huinker / Matheus Alcantara for reviewing this. The discussions debated on trade-offs between what is fastest and which of these is an "anti-pattern", but it is great to see that now all three are now equivalently optimized.

Future Possibilities:

The infrastructure added (SupportRequestSimplifyAggref) opens the door for further optimizations. For instance, COUNT(NULL) could theoretically be optimized to instantly return 0 without scanning any rows - although that specific change wasn't included in this commit to keep the scope focused.

References

14 Aug 2025

Testing PostgreSQL on Debian/Hurd: A Windows + QEMU Adventure

Curiosity often leads to the most interesting technical adventures. This time, I decided to explore something off the beaten path: running Debian GNU/Hurd inside a virtual machine on my Windows 11 host and compiling PostgreSQL from source.

This post is part 1 of a multi-part series documenting the process, challenges, and discoveries along the way. Future parts will dive deeper into advanced topics, automation, and ongoing compatibility work—so if you're interested in PostgreSQL, alternative operating systems, or open source testing, stay tuned!

What is Debian?
Debian is one of the oldest and most respected Linux distributions, known for its stability, vast software repositories, and commitment to free software principles. While most people associate Debian with the Linux kernel, it’s actually a complete operating system that can run on different kernels.

What is GNU/Hurd?
GNU/Hurd is an alternative kernel developed by the GNU Project. Unlike Linux, GNU/Hurd is built on a microkernel architecture (specifically GNU Mach), aiming for greater modularity and flexibility. While GNU/Hurd is still experimental and not as mature or widely used as Linux, it represents a fascinating approach to operating system design.

Debian GNU/Hurd combines the familiar Debian userland (tools, package management, etc.) with the GNU/Hurd kernel, offering a unique environment for open source enthusiasts and OS tinkerers.

My goal for this experiment was to see how far I could get with a modern database stack—specifically, compiling and running PostgreSQL—on this unusual platform.



Setting Up the VM

Instead of the CD image, I used the pre-built disk image available here. After downloading and extracting the .img file, I launched the VM with QEMU using the following command:

qemu-system-x86_64.exe -machine type=pc,accel=whpx,kernel-irqchip=off -boot d -m 4096 -usb -display default,show-cursor=on -drive file=".\debian-hurd-i386-20250807.img",cache=writeback

Explanation of the command:

  • qemu-system-x86_64.exe: Runs QEMU for 64-bit x86 systems (works for 32-bit guests too).
  • -machine type=pc,accel=whpx,kernel-irqchip=off: Specifies a PC-type machine, enables Windows Hypervisor Platform acceleration (WHPX), and disables kernel IRQ chip emulation for compatibility.
  • -boot d: Boots from the first hard disk.
  • -m 4096: Allocates 4GB of RAM to the VM.
  • -usb: Enables USB support.
  • -display default,show-cursor=on: Uses the default display and ensures the mouse cursor is visible.
  • -drive file=".\debian-hurd-i386-20250807.img",cache=writeback: Uses the extracted Hurd disk image as the hard drive and enables writeback caching for better disk performance.

This boots directly into the installed Debian/Hurd system with improved performance and usability on a Windows 11 host.

Preparing to Build PostgreSQL

Debian/hurd is minimal out of the box, so the first step was to install all the build tools and libraries required for compiling PostgreSQL:

sudo apt-get update
sudo apt-get install build-essential git libxml2-dev libxslt-dev autotools-dev automake libreadline-dev zlib1g-dev bison flex libssl-dev libpq-dev ccache

This command installs the compiler, linker, version control tools, XML and SSL libraries, autotools, and all other dependencies PostgreSQL may need for a successful build and test cycle.

Downloading and Compiling PostgreSQL

Instead of downloading a release tarball, I cloned the official PostgreSQL git repository and compiled the master branch:

git clone https://github.com/postgres/postgres.git
cd postgres
./configure --prefix=~/proj/localpg
make
make install

This approach ensures you're building the latest development version of PostgreSQL directly from source, and installs it locally to your user's ~/proj/localpg directory.

Setting Up the Database Cluster

PostgreSQL needs a data directory (cluster) to store its databases. Since the installation was local to my user, I simply initialized the cluster and started the server using the full path to the binaries (since they're not in my PATH):

~/proj/localpg/bin/initdb -D ~/proj/localpg/pgdata
~/proj/localpg/bin/pg_ctl -D ~/proj/localpg/pgdata -l logfile start

Connecting and Creating a Table

With the server running, I connected to the database and created a sample table:

~/proj/localpg/bin/psql -d postgres

Inside psql:

CREATE TABLE test_table (id SERIAL PRIMARY KEY, name TEXT);
INSERT INTO test_table (name) VALUES ('Hello from Debian/Hurd!');
SELECT * FROM test_table;

Example output:

CREATE TABLE
INSERT 0 1
 id |         name         
----+----------------------
  1 | Hello from Debian/Hurd!
(1 row)

Running the Test Suite

To ensure the build was solid, I went back to the source directory and ran:

cd ~/postgres
make check

This runs PostgreSQL's regression tests, verifying that the core features work as expected—even on Hurd. This ran mostly fine (except for a few tests that failed - more to be researched on that failure).

Quick QEMU Tip

When working with QEMU, remember that Ctrl-Alt-G is your friend—it releases the mouse and keyboard from the VM window, making it much easier to switch back to your host system.

Adding a Separate Volume for More Disk Space

The base Debian/Hurd image is quite small and can easily run out of space, especially when compiling large projects or running make check. I frequently hit disk full errors during testing.

Solution:

  1. Shutdown the VM.

  2. Resize the disk image:

    qemu-img resize debian-hurd-i386-20250807.img +10G
    

    This adds 10GB to the existing disk image.

  3. Restart the VM.

  4. Create a new partition:

    • Use fdisk /dev/hd0 (or the appropriate device) to create a new partition in the extra space.
  5. Format the new partition:

    mkfs.ext4 /dev/hd0s3
    

    (Note: On my setup, the original root partition was /dev/hd0s2, so the new partition created for extra space was /dev/hd0s3. Adjust the device name as needed for your configuration.)

    Although the root volume is of ext2 type (!!!), Debian/Hurd works fine with ext4—so feel free to use ext4 for the new partition.

  6. Mount the new volume:

    mkdir -p /mnt/newvol
    mount /dev/hd0s3 /mnt/newvol
    
  7. Grant non-root user access:

    • As root, change ownership:
      chown robins:robins /mnt/newvol
      
    • Now your non-root user (e.g., robins) can use /mnt/newvol for compiling PostgreSQL and running make check without running out of disk space.

Why use a non-root user for PostgreSQL? PostgreSQL is designed to run as a non-root user for security reasons. Running the database server or its tests as root can expose your system to unnecessary risks and may even cause certain operations to fail. Always use a dedicated non-root user for installation, testing, and day-to-day database operations.

This approach made it possible to complete the build and test cycle without disk space issues.

Final Thoughts

Running Debian/Hurd in a VM on Windows 11 was surprisingly smooth, though some packages and features are less mature than on Linux. Compiling PostgreSQL from scratch was a great way to explore the system's capabilities and compatibility. If you're looking for a fun, geeky weekend project, give Debian/Hurd a try!

Next Steps & What's Still Pending

This is only part 1 of a multi-part series. In future installments, I'll cover:

  • Setting up the PostgreSQL buildfarm for automated testing on Debian/Hurd
  • Deeper investigation into SMP/multi-core support (currently not working)
  • More QEMU optimization and compatibility testing
  • Additional performance tuning and disk management strategies
  • Troubleshooting Perl module installation issues (e.g., LWP::Protocol::https, LWP::Simple, Net::SSLeay), which currently fail to install—more research is needed to understand and resolve these problems.
  • Investigating why make check did not complete successfully (failed on a few tests)—this requires further research.

Some features, like multi-core support, full buildfarm integration, reliable Perl module installation, and passing all PostgreSQL regression tests, are not yet working or fully tested. These will be explored in detail in future posts. Stay tuned!

11 May 2025

FOR KEY SHARE optimization and the SLRU Trap

 

FOR KEY SHARE optimization and the SLRU Trap


Optimizing database concurrency is a constant balancing act. We often tweak locking strategies in PostgreSQL, aiming to allow more simultaneous operations without compromising data integrity. A common scenario involves shifting from stricter row-level locks to more lenient ones. But sometimes, what seems like a straightforward optimization can lead to unexpected performance bottlenecks in less obvious parts of the system.

This post explores one such scenario: moving from SELECT FOR NO KEY UPDATE to SELECT FOR KEY SHARE, the potential for subsequent MultiXactOffsetSLRU wait events, and how PostgreSQL 17 offers a direct solution.


The Locking Strategy Shift: Aiming for Higher Concurrency

Let's start with the locks themselves. Within a transaction, you might initially use:

  • SELECT FOR NO KEY UPDATE: This acquires a moderately strong row lock. It prevents others from deleting the row, updating key columns, or acquiring FOR UPDATE / FOR NO KEY UPDATE locks on it. However, it does allow concurrent non-key updates and weaker FOR SHARE / FOR KEY SHARE locks. Importantly (and we’ll see why later), only one transaction can hold this lock (or FOR UPDATE) on a given row at a time.

To potentially increase concurrency, especially if you only need to prevent key changes or deletions (like ensuring a foreign key reference remains valid), you might switch to:

  • SELECT FOR KEY SHARE: This is a weaker, shared lock. It blocks deletions and key updates but allows concurrent non-key updates and even other concurrent SELECT FOR KEY SHARE (or FOR SHARE) locks on the exact same row.

The intended outcome of switching to FOR KEY SHARE is often to reduce blocking and allow more transactions to proceed in parallel, particularly if the main concern is referential integrity rather than preventing all concurrent modifications.


The Unforeseen Bottleneck: Enter MultiXacts and SLRU Caches

While the switch does allow higher concurrency at the row-lock level, it can create pressure elsewhere. Here’s the chain reaction:

  1. Increased Shared Lock Concurrency: Your application now has more situations where multiple transactions hold a shared lock (FOR KEY SHARE) on the same row simultaneously.

  2. The MultiXact System: How does PostgreSQL track that multiple transactions (potentially dozens or hundreds) have a shared interest in a single row? It uses a mechanism called MultiXact IDs (Multi-Transaction IDs). Instead of just one transaction ID locking the row, PostgreSQL assigns a special MultiXact ID that represents the group of transactions currently sharing a lock on it.

  3. SLRU Caches: Managing this MultiXact metadata efficiently requires quick access. PostgreSQL uses specialized SLRU (Simple Least Recently Used) caches in shared memory for this. These caches store the mappings (offsets) from rows to their MultiXact member lists (MultiXactOffsetSLRU) and the member lists themselves (MultiXactMemberSLRU).

  4. The Bottleneck (PG 16 and older): Before PostgreSQL 17, these SLRU caches had relatively small, fixed sizes determined at compile time. When the workload switch dramatically increased the demand for MultiXact tracking (due to more concurrent shared locks), these small caches could easily become overwhelmed.

  5. The Symptom (MultiXactOffsetSLRU Waits): An overloaded SLRU cache leads to performance degradation manifesting as specific Wait Events. You might see high MultiXactOffsetSLRU waits, indicating processes are frequently:

  • Waiting for disk I/O because the required MultiXact offset data wasn't found in the small cache (cache miss).

  • Waiting to acquire low-level locks needed to access or update the cache's shared memory buffers, because many processes are trying to use the limited cache concurrently (lock contention).

  • Many backends appear hung - as can be seen in this recent community thread.




So trying to increase concurrency at the row-level, created a bottleneck in the underlying mechanism to manage that very concurrency!



PostgreSQL 17 to the Rescue: Configurable SLRU Buffers

Recognizing this potential bottleneck, the PostgreSQL developers introduced direct solutions in PostgreSQL 17 (released September 2024). These come in the form of new configurable parameters:

  • multixact_offset_buffers:

  • This parameter controls the size (in buffer pages) of the MultiXactOffset SLRU cache in shared memory. The default value is very small (16) and this allows administrators to allocate more RAM to cache the crucial row-to-member-list mappings. This significantly increases the cache hit rate, reduces disk I/O for MultiXact offsets, and directly alleviates the pressure causing MultiXactOffsetSLRU waits.

  • multixact_member_buffers:

  • This parameter controls the size of the MultiXactMember SLRU cache, which stores the actual lists of transaction IDs. This is possibly less directly tied to the Offset wait event, ensuring in-cache member lists improves the overall performance and throughput of the entire MultiXact lookup process, which is essential when handling high shared-lock concurrency.

You can learn more about these new parameters in this fantastic discussion - https://pganalyze.com/blog/5mins-postgres-17-configurable-slru-cache

These parameters allow DBAs to tune the memory allocated to these critical caches based on their specific workload, moving away from the one-size-fits-all limitation of previous versions.

Conclusion

Switching locking strategies in PostgreSQL requires careful consideration not just of the direct blocking rules but also of the potential impact on underlying mechanisms. Moving from SELECT FOR NO KEY UPDATE to the more concurrency-friendly SELECT FOR KEY SHARE can be beneficial, but it increases the load on the MultiXact system. In versions before PostgreSQL 17, this could lead to performance bottlenecks manifesting as MultiXactOffsetSLRU wait events due to contention on small, fixed-size caches.

Thankfully, PostgreSQL 17 provides the tools needed to manage this directly with the multixact_offset_buffers and multixact_member_buffers GUCs. If you encounter these specific wait events after increasing shared lock usage, upgrading to PostgreSQL 17+ and tuning these parameters should be a key part of your resolution strategy. As always, monitor your system's wait events and performance metrics closely when making changes to locking or configuration.


27 Oct 2024

What's in an empty table?


How much storage does an empty table in Postgres take?

This is a post about Postgres tables that store ... well basically ... Nothing.


The idea for this post came from this tweet that hinted that an empty table on most databases today takes 16Kb of storage. Now admittedly Franck was probably reminiscing the good-old days so this is probably quite out of context, but it did get me thinking, and thus this post.


NB: Here's a video showing this in action ! - See video .

A "regular" empty table in Production

Here's a regular small table that could be found in Production. It has a Primary Key, a text column and a JSONB column. Let's check the table size using the pg_total_relation_size() postgres function (you can read more about that function here).

db1=# create table t(id bigint primary key, b text, c jsonb);

CREATE TABLE


db1=# select pg_total_relation_size('t');

 pg_total_relation_size

------------------------

                  16384

(1 row)



Hmmm, so that tweet did have a point. Given how low-cost memory has become over the decades, it is easy to understand why databases today chose to optimize speed over memory efficiency (more on this later) and so even an empty table in Postgres, does consume 16 kb. 

But "where" is the 16kb being used?


db1=# select pg_relation_size('t');
 pg_relation_size
------------------
                0
(1 row)

The relation itself isn't consuming any space! That is good (again more on this later) but then where is the space being used then?

db1=# \d t
                 Table "public.t"
 Column |  Type  | Collation | Nullable | Default
--------+--------+-----------+----------+---------
 id     | bigint |           | not null |
 b      | text   |           |          |
 c      | jsonb  |           |          |
Indexes:
    "t_pkey" PRIMARY KEY, btree (id)


We see that the table has a Primary Key - and thus an index - t_pkey

Does the index is consuming 16kb?

db1=# select pg_relation_size('t_pkey');
 pg_relation_size
------------------
             8192
(1 row)

So the index is using some of it - 8kb - so that is progress - but who's using the other 8kb?


Let's start cutting the table down

Let's start cutting down the columns and see if the disk-usage goes down. 

db1=# DROP TABLE t; CREATE TABLE t (id BIGINT PRIMARY KEY, b JSONB); select pg_total_relation_size('t');
DROP TABLE
CREATE TABLE
 pg_total_relation_size
------------------------
                  16384
(1 row)

db1=# DROP TABLE t; CREATE TABLE t (id BIGINT PRIMARY KEY, b TEXT); select pg_total_relation_size('t');
DROP TABLE
CREATE TABLE
 pg_total_relation_size
------------------------
                  16384
(1 row)


Hmmm, that didn't help at all. Dropping either of the TEXT or JSONB column didn't help. Let's look at the expanded version of this table to see if there's any similarity in the two columns. (I've clipped the output to make it easier to read)


db1=# \d+ t
                                            Table "public.t"
 Column |  Type  | Collation | Nullable | Default | Storage  | ...
--------+--------+-----------+----------+---------+----------+-...
 id     | bigint |           | not null |         | plain    | ...
 b      | text   |           |          |         | extended | ...
 c      | jsonb  |           |          |         | extended | ...
Indexes:
    "t_pkey" PRIMARY KEY, btree (id)
Access method: heap


Clearly, the two column's are "extended" (you can read more about it here), but basically what happened here is that an extended column type resulted in the creation of a TOAST table (you can read more about TOAST here). Let's find out "how" to find the TOAST table for the table "t", and check if that could be consuming the remaining 8kb?


db1=# select oid, relname, reltoastrelid, reltoastrelid::regclass from pg_class where relname = 't';
  oid  | relname | reltoastrelid |      reltoastrelid
-------+---------+---------------+-------------------------
 18300 | t       |         18303 | pg_toast.pg_toast_18300
(1 row)

db1=# select pg_relation_size('pg_toast.pg_toast_18300');
 pg_relation_size
------------------
                0
(1 row)

Hmmm, it's not the TOAST table. But just like the main table "t", could it be that the TOAST table has supporting relations that are to blame?


db1=# select pg_total_relation_size('pg_toast.pg_toast_18300');
 pg_total_relation_size
------------------------
                   8192
(1 row)

db1=# \d pg_toast.pg_toast_18300
TOAST table "pg_toast.pg_toast_18300"
   Column   |  Type
------------+---------
 chunk_id   | oid
 chunk_seq  | integer
 chunk_data | bytea
Owning table: "public.t"
Indexes:
    "pg_toast_18300_index" PRIMARY KEY, btree (chunk_id, chunk_seq)

db1=# select pg_relation_size('pg_toast.pg_toast_18300_index');
 pg_relation_size
------------------
             8192
(1 row)

Yes! So we see above, that the TOAST table implicitly has a primary key (of it's own) that uses an index - which has 1 page (8kb) assigned to it.

In a nutshell, the empty table 't' above, consumes 16kb and the storage allocation takes this shape:

  • Main relation - 0 bytes - 't'
  • Main relation  Index - 8kb - 't_pkey'
  • Toast relation - 0 bytes - 'pg_toast_18300'
  • Toast relation Index - 8kb - 'pg_toast_18300_index'

Cut Cut Cut

Okay, lets see if we can reduce the table size further, by dropping both the Extended columns.


db1=# DROP TABLE t; CREATE TABLE t (id BIGINT PRIMARY KEY); select pg_total_relation_size('t');
DROP TABLE
CREATE TABLE
 pg_total_relation_size
------------------------
                   8192
(1 row)


Okay, that makes sense. Now the main table (heap) is still not consuming anything, but the index still does. 

Let's reduce further.

db1=# DROP TABLE t; CREATE TABLE t (id BIGINT); select pg_total_relation_size('t');
DROP TABLE
CREATE TABLE
 pg_total_relation_size
------------------------

                      0
(1 row)





0 bytes!! 


Nice! But seriously - 0 bytes?

It kind of makes sense, that since Postgres doesn't yet have anything to store - besides the metadata of the table (and since the metadata is stored in the system catalogs - for e.g. pg_catalog schema) - there isn't anything to store in the main relation (heap) as yet. 

Disk Usage - Check filesystem

Hmmm - Nah, what if I don't trust Postgres?

Let's skip Postgres functions and ask the filesystem directly - and see if the table is actually 0 bytes. Here we first find the file path of the table in question using the postgres function pg_relation_filepath() and then ask the file-system for the file-size.


db1=# select pg_relation_filepath('t');
 pg_relation_filepath
----------------------
 base/17727/18356
(1 row)

db1=# \! ls -la /home/robins/proj/localpg/data/base/17727/18356

-rw------- 1 robins robins 0 Sep 23 10:19 /home/robins/proj/localpg/data/base/17727/18356


So the file corresponding to the table, actually is using 0 bytes. Nice!

Now, when a table is created, some entries are added to the system catalog. Let see if the database size is signficantly more than a blank database?


postgres=# create database db1;
CREATE DATABASE

postgres=# \c db1
You are now connected to database "db1" as user "robins".

db1=# CREATE TABLE t (id BIGINT);
CREATE TABLE

db1=# select pg_database_size('db1');
 pg_database_size
------------------
          7482515
(1 row)

db1=# create database db2;
CREATE DATABASE

db1=# \c db2
You are now connected to database "db2" as user "robins".

db2=# select pg_database_size('db2');
 pg_database_size
------------------
          7482515
(1 row)


Good. So this somewhat confirms that a blank database and a database with an "empty" table use the same disk-space. 

Postgres is hiding something

Technically though, I'm lying. Well actually Postgres is lying hiding something from the file-system (i.e. every new table does make the database grow logically - a tad little - just that most of the times, the filesystem doesn't get the memo).

Under the cover, the way postgres stores data in a table (catalog table, or any user table), although it consumes a page of disk-space (often 8Kb), logically it may be consuming only a small part of that page. This is very helpful when more rows need to be stored in the table. When more rows come in, Postgres is able to reuse the same (first) page to now logically store more data - although for the filesystem - no extra pages were requested. This is what I was hinting at earlier that today's database use disk space (and thus memory cache) with larger (8kb) chunks and continue to keep using that page (until a new page is needed). Further below, I show a brief example of how all of this works.

But to summarize, it is then unfair to say that the system catalog did not grow at all (when a new table was added) - since some catalogs are guaranteed to have grown (for e.g. metadata of the new table is stored as an extra row in pg_class etc.) within the disk blocks already allocated as in-use for that catalog.

Let's squeeze a little more? 


db1=# create table a();

CREATE TABLE


Right off the bat, that might seem completely wrong. Does Postgres allow a table with no Columns?

Yes !! 😎

All databases allow creation of an empty table (obviously), but Postgres allows a new table even if there are no columns! Let's verify this from the Postgres Documentation. Although suttle, we can see that in the syntax section of the CREATE TABLE page, the column_name data_type is enclosed with a square braces [] - which implies that columns are in fact, optional. What's more, this "feature" is a part of Postgres at least for the past 20 years!

Now the utility of this table is arguable (we'll explore that below), but it is now clear that this syntax is legal and works just fine. 

Let's see how such a table looks like with psql \d 

db1=# \d a
                Table "public.a"
 Column | Type | Collation | Nullable | Default
--------+------+-----------+----------+---------


That's it! That's the complete output - Since the table has no columns, the output above (rightly) doesn't show anything.


Let's go a little deeper

The obvious next question is - What on earth could a table like this be used for? That is a perfectly good question, and the answer is probably not much. However if you instead ask whether "Squeeze a little more" mean that a no column table takes less storage space? The answer (depends on input data but) is most probably yes. Let's see how does it help Postgres if it knows that you don't want to store any column in the table.


db1=# create table a();
CREATE TABLE
db1=# \dt+ a
                                  List of relations
 Schema | Name | Type  | Owner  | Persistence | Access method |  Size   | Description
--------+------+-------+--------+-------------+---------------+---------+-------------
 public | a    | table | robins | permanent   | heap          | 0 bytes |
(1 row)

db1=# insert into a select;
INSERT 0 1

db1=# \dt+ a
                                    List of relations
 Schema | Name | Type  | Owner  | Persistence | Access method |    Size    | Description
--------+------+-------+--------+-------------+---------------+------------+-------------
 public | a    | table | robins | permanent   | heap          | 8192 bytes |
(1 row)


Nothing new here. We see that although an empty (0 column) table consumes 0 byte for storage. And like a regular table, as soon as the first row is inserted, the table uses 1 page - which in my test database (and probably 99.99% of postgres databases world-wide) consumes 8192 bytes. This is expected, but do note that the storage of logical rows in a postgres page, is a little oddly done (and for good reason). There is lllllooooottttt of detail here - but I wouldn't blame you if you'd want to keep that aside for a cold winter morning - when armed with a cup of hot coffee.

For now, we see below that each row that is inserted into the table, consumes 24 bytes - in that 8kb page.

db1=# create extension pageinspect ;
CREATE EXTENSION

db1=# truncate table a;
TRUNCATE TABLE

db1=# insert into a select FROM generate_series(1,3);
INSERT 0 3

db1=# SELECT * FROM heap_page_items(get_raw_page('a', 0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits
| t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------
+-------+--------
  1 |   8168 |        1 |     24 |  85105 |      0 |        0 | (0,1)  |           0 |       2048 |     24 |
|       | \x
  2 |   8144 |        1 |     24 |  85105 |      0 |        0 | (0,2)  |           0 |       2048 |     24 |
|       | \x
  3 |   8120 |        1 |     24 |  85105 |      0 |        0 | (0,3)  |           0 |       2048 |     24 |
|       | \x
(3 rows)



Still going Deeper - but on a Tangent

So does the above imply that adding more columns to a table would mean Postgres consumes more bytres-per-row? Let's verify:

db1=# drop table t;
DROP TABLE

db1=# create table t(id bigint);
CREATE TABLE

db1=# truncate table t; insert into t select generate_series(1,3); vacuum full t; \dt+ t
TRUNCATE TABLE
INSERT 0 3
VACUUM
                                    List of relations
 Schema | Name | Type  | Owner  | Persistence | Access method |    Size    | Description
--------+------+-------+--------+-------------+---------------+------------+-------------
 public | t    | table | robins | permanent   | heap          | 8192 bytes |
(1 row)

db1=# SELECT * FROM heap_page_items(get_raw_page('t', 0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits
| t_oid |       t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------
+-------+--------------------
  1 |   8160 |        1 |     32 |  85100 |      0 |        0 | (0,1)  |           1 |       2816 |     24 |
|       | \x0100000000000000
  2 |   8128 |        1 |     32 |  85100 |      0 |        0 | (0,2)  |           1 |       2816 |     24 |
|       | \x0200000000000000
  3 |   8096 |        1 |     32 |  85100 |      0 |        0 | (0,3)  |           1 |       2816 |     24 |
|       | \x0300000000000000
(3 rows)


Here we see that each row is now consuming 32 bytes - which is an extra 8 bytes from earlier. Good chances the only column we've added is the reason for the extra 8 bytes. Let's verify that using the pg_column_size() function (you can read more about it here):


db1=# select pg_column_size(1::bigint);
 pg_column_size
----------------
              8
(1 row)


But wait, there's one more twist here:

db1=# INSERT INTO t SELECT;
INSERT 0 1

db1=# SELECT * FROM heap_page_items(get_raw_page('t', 0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff |  t_bits
  | t_oid |       t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------
--+-------+--------------------
  1 |   8160 |        1 |     32 |  85111 |      0 |        0 | (0,1)  |           1 |       2816 |     24 |
  |       | \x0100000000000000
  2 |   8128 |        1 |     32 |  85111 |      0 |        0 | (0,2)  |           1 |       2816 |     24 |
  |       | \x0200000000000000
  3 |   8096 |        1 |     32 |  85111 |      0 |        0 | (0,3)  |           1 |       2816 |     24 |
  |       | \x0300000000000000
  4 |   8072 |        1 |     24 |  85113 |      0 |        0 | (0,4)  |           1 |       2049 |     24 | 0000000
0 |       | \x
(4 rows)

So wait, see row 4. Although the table has a column, just because the column didn't have a value, the row actually consumed only 24 bytes (the minimum)?

Is this scalable? I mean can I have a 5 column table and still Postgres stores a row, but only consume the bare minimum 24 bytes? Let's see:


db1=# drop table h;
DROP TABLE

db1=# create table h(c1 bigint, c2 bigint, c3 bigint, c4 bigint, c5 bigint);
CREATE TABLE

db1=# insert into h select;
INSERT 0 1

db1=# SELECT * FROM heap_page_items(get_raw_page('h', 0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff |  t_bits
  | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------
--+-------+--------
  1 |   8168 |        1 |     24 |  86262 |      0 |        0 | (0,1)  |           5 |       2049 |     24 | 0000000
0 |       | \x
(1 row)


So yes, that does work and it does scale for "many" columns - but with a minor variation. There's more detail in code, but basically for regular columns the header contains 1 bit per column which expands in 8 byte chunks - and so say for 100 column table (with no data) - Postgres consumes ~40 bytes per row. 


db1=# drop table h; create table h(); select 'alter table h add column c' || n || ' bigint;' from generate_series(1,100) e(n); \gexec
.
.
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE

db1=# select count(*) from pg_attribute where attrelid = 'h'::regclass;
 count
-------
   106
(1 row)

db1=# insert into h select; vacuum full h; SELECT * FROM heap_page_items(get_raw_page('h', 0));
INSERT 0 1
VACUUM
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff |
                                          t_bits                                                  | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------
--------------------------------------------------------------------------------------------------+-------+--------
  1 |   8152 |        1 |     40 |  88376 |      0 |        0 | (0,1)  |         100 |       2817 |     40 | 0000000
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 |       | \x
(1 row)

Given the task at hand, I'd say that's still decently crisp.

So does a zero column table, squeeze max rows per page?


Yes, and No. So going back to a 0 column table - let's try to fill the whole page with rows and see how many can be stuffed on a single page. 

If you're running low on coffee - the page-header is 24 bytes and so back-of-the-envelope math suggests that number of rows possible to squeeze into a page should be - ( 8192 bytes in a page - some bytes for page header & footer ) / 24 bytes per row = ~340 rows.


db1=# truncate table a; insert into a select FROM generate_series(1,340); vacuum full a; \dt+ a
TRUNCATE TABLE
INSERT 0 340
VACUUM
                                 List of relations
 Schema | Name | Type  | Owner  | Persistence | Access method | Size  | Description
--------+------+-------+--------+-------------+---------------+-------+-------------
 public | a    | table | robins | permanent   | heap          | 16 kB |
(1 row)

Something's not right! That size should have stayed 8Kb. Why did Postgres use 16Kb (two pages - instead of one)?

That's because of this tiny bit of trivia that the maximum number of tuples that Postgres can squeeze onto a page, is hard-coded to 291 (for an 8kb page) - which was interesting to know - but to clarify, Heap-Only-Tuples (HOT feature) can effectively force a few more rows on a running database, but we'll go deeper into that some other day.

So let's go back and confirm if that understanding is correct - that Postgres can in fact squeeze at max only 291 rows onto the same page. 


db1=# truncate table a; insert into a select FROM generate_series(1,291); vacuum full a; \dt+ a
TRUNCATE TABLE
INSERT 0 291
VACUUM
                                    List of relations
 Schema | Name | Type  | Owner  | Persistence | Access method |    Size    | Description
--------+------+-------+--------+-------------+---------------+------------+-------------
 public | a    | table | robins | permanent   | heap          | 8192 bytes |
(1 row)


db1=# truncate table a; insert into a select FROM generate_series(1,292); vacuum full a; \dt+ a
TRUNCATE TABLE
INSERT 0 292
VACUUM
                                 List of relations
 Schema | Name | Type  | Owner  | Persistence | Access method | Size  | Description
--------+------+-------+--------+-------------+---------------+-------+-------------
 public | a    | table | robins | permanent   | heap          | 16 kB |
(1 row)

Here we see that:
  • When 291 rows are inserted (blue), the table stays at 1 page (8kb)
  • Whereas when 292 (291+1) rows are inserted (green), the table expands to 2 pages (16 kb)

Utility

So, that's all fine, but what's this table good for?

Well beyond understanding Postgres :) not much. This table is unhelpful for most database tasks. It can't store values - it wouldn't allow columns / indexes / selective deletes (let's avoid ctid hacks for now) etc. 

But if I was forced to conjure an idea, a high-contention ticker app (that only needs to store +1s) via postgres functions - may be (and that's a BIG may be) this table could be used to store +1s - with a back-off algorithm on the application side. Good chances if there's a good DBA - this is done much better in many other ways (simplest of which is at the application end, or as a value in a column etc.) but it'd be better than nothing.

There are few other possible use-cases discussed here - for e.g. if (for some reason) you'd want to add columns to a table in a programmatic fashion after a base table is created - like we did above in the 100 column table test, OR, if you want to reserve a table name (in a multi-user setup) months / years in advance etc.

Finally

Much Ado About Nothing... This was a good exercise where we learnt something new about how Postgres stores table data - when ironically - there's nothing to store :) 

Hope you had fun! Comments more than welcome.

3x Faster TID Range Scans - Postgres 19

If you've ever had to scrub significantly large tables—whether updating older records or deleting expired ones—you know the pain of tryi...