That Guy From Delhi

22 Apr 2026

Local LLM Update - ThinkStation Meets Blackwell

Finally - an update to the AI workbench, and one hell of an update it is :) !

The home server farm has now been blessed with dedicated silicon, and it is an absolute beast: the NVIDIA RTX 5060 Ti 16GB. The old setup—which was painfully grinding through CPU inference at a mere 1 token/sec on 46GB of RAM—has officially been retired. After months of relying on remote APIs, lightning-fast local agentic workflows are finally a reality, all without seriously over-working my legacy workstation (and waiting eons for responses).

To clarify, my paid subscriptions to cloud LLMs will continue. However, with this capable local setup handling the bulk of my daily tasks, I expect to hit the dreaded "credits have expired" message much later in the month—if ever. At the very least, having this local horsepower certainly pushes off the need to upgrade to a pricier "Ultra" plan just to get more API tokens.

The ultimate objective here is to deploy an agentic setup for the home and family. I initially investigated OpenClaw for the orchestration layer, but its security model was simply too porous for a paranoid systems engineer. IronClaw, on the other hand, looks like a solid, secure candidate to serve as the local agent nexus. Before any of that software could run, I needed an inference engine that did not crawl.

The Hardware: The ThinkStation S30 Meets Blackwell

The host machine is my trusty Lenovo ThinkStation S30 (Machine Type 4351). It is a Sandy/Ivy Bridge Xeon platform, firmly anchored in the PCIe 3.0 era. Dropping a brand-new Blackwell-architecture GPU into a system this old is technically a severe mismatch, but the RTX 5060 Ti 16GB is the perfect fit for this specific niche.

Instead of chasing older, power-hungry professional cards like the RTX A4000, or settling for consumer Turing cards with split memory bottlenecks, the 5060 Ti offered exactly what the S30 needed:

16GB VRAM: The absolute minimum needed to comfortably fit modern 7B-9B parameter models.
GDDR7 Bandwidth: Hitting 448 GB/s, drastically improving throughput over older 60-series cards.
Native FP8/FP4 Support: Crucial for running highly quantized models efficiently.

Overcoming Legacy Architecture Limits

Getting a 2026 GPU to speak with a 2013 motherboard required some immediate troubleshooting. Initial boots into Linux Mint resulted in a wall of kernel panics and DMAR (DMA Remapping) faults. The modern GPU's memory management completely clashed with the S30's legacy Intel VT-d (IOMMU) implementation.

I resolved this by disabling Intel VT-d in the BIOS and appending intel_iommu=off to the GRUB bootloader parameters. This bypassed the broken firmware tables and allowed the system to boot stably with the proprietary NVIDIA 580.126.09 drivers.

Another significant bottleneck was the PCIe 3.0 bus itself. When running vLLM, the default CUDA Graph capture and torch.compile phases initially took a grueling 16 minutes to complete due to the slow bus speed. While it worked beautifully after that initial warmup, the long delay posed a problem: because I set up the inference engine as an auto-start systemd service at boot, systemd would assume the process had hung and shoot it down before compilation could finish. To resolve this, I bypassed the compilation overhead by starting vLLM with the --enforce-eager flag, ensuring the service starts up reliably without getting restarted by the OS.

Performance: 40-70 Tokens per Second

Despite the older host system, the 5060 Ti excels at small, localized tasks.

I settled on the Qwen2.5-Coder-7B-Instruct-FP8-Dynamic model. Because it leverages FP8 precision, the model weights consume roughly 8.5GB of the available 15.48 GiB VRAM. This leaves plenty of overhead for the KV cache and a 16k context window without spilling over into system RAM (which, across a PCIe 3.0 bus, would throttle performance down to single-digit tokens per second).

With vLLM 0.19.1 managing the inference (I initially started with Ollama but switched to vLLM—I don't have comparison numbers yet, but that is a topic for a future post), the S30 consistently pushes 40 to 70 tokens/sec for generation, and handles prompt processing at over 700 tokens/sec.

Here is a quick snapshot from the vLLM logs confirming these real-world speeds during a typical code completion task:

(APIServer pid=967222) INFO 04-22 19:09:50 [loggers.py:259] Engine 000: 
  Avg prompt throughput: 709.3 tokens/s, Avg generation throughput: 4.3 tokens/s, 
  Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.9%, 
  Prefix cache hit rate: 0.4%

(APIServer pid=967222) INFO:     127.0.0.1:60326 - "POST /v1/chat/completions 
  HTTP/1.1" 200 OK

(APIServer pid=967222) INFO 04-22 19:10:00 [loggers.py:259] Engine 000: 
  Avg prompt throughput: 52.4 tokens/s, Avg generation throughput: 41.0 tokens/s, 
  Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, 
  Prefix cache hit rate: 0.6%

The inference service now runs automatically via systemd. To securely access the engine from my laptop at home, I rely entirely on an SSH port forward. This elegant solution means that dealing with network-level security or opening firewall ports isn't even a configuration requirement; SSH handles the secure tunnel perfectly. On the client side, this local endpoint works beautifully with VS Code and the Continue extension, providing a seamless and entirely private AI coding experience.

Power, Thermals, and Footprint

One of the best surprises of this Blackwell upgrade is how remarkably quiet and power-efficient the card is. It idles gracefully at 10W and rarely pushes past 20W during my typical small localized tasks. Because it runs so cool (sitting around 38°C with the fans completely off at 0%), stuffing the entire server rig into a cupboard does not trigger any thermal anxiety whatsoever.

For reference, here is the current footprint while loaded:

With the hardware foundation finally stabilized, incredibly power-efficient, and pushing excellent tokens per second, the runway is completely clear to deploy IronClaw and build out the actual home agent capabilities.

Looking Ahead

I am absolutely thrilled to have reached this stage! Pressing my trusty old ThinkStation into service was a gamble that paid off beautifully, saving a cool $2,000 to $3,000 that would have otherwise gone into an entirely new workstation on top of the GPU cost. Better yet, this setup gives me massive flexibility to stagger future upgrades. If I ever need more VRAM, I can simply drop in a second Blackwell card for dual mode—provided I finally upgrade the host machine to support PCIe 5.0 so the interconnect isn't choked by legacy bus speeds.

For now, I am eagerly looking forward to what's next. Having a secure, lightning-fast, and entirely local AI bedrock opens up incredible possibilities for the home network. I'll be diving deep into the IronClaw orchestration very soon, and you can expect more blog posts detailing that agentic journey in the coming weeks!

29 Mar 2026

Understanding Google AI Credits: What You're Actually Paying For

If you've subscribed to Google AI Pro [1] (or are eyeing an upgrade), you've probably noticed the word "credits" appearing everywhere — in your billing dashboard, in the Gemini app, and inside your code editor [5]. But what are they? How do they reset? And if your family is on the plan, who's burning through them? [3]

I spent an afternoon untangling this, and the short version is: Google doesn't have one quota system — it has three, each resetting on a different clock. Here's how it all works.

The Three Clocks

The single biggest source of confusion with Google AI Pro is that there are three independent limits running simultaneously, each with its own reset schedule:

Timer	What Resets?	Applies To
Monthly	1,000 AI Credits	Video / 4K Images / IDE Overages
Weekly	Antigravity Baseline Quota	"Free" high-tier usage in Antigravity
Daily	Prompt & Media Limits	Chat (Thinking/Pro), Images, Music

Understanding which "fuel tank" you're drawing from at any given moment is the key to not running dry at the wrong time.

Monthly AI Credits (The 1,000 Pool)

When you subscribe to AI Pro, Google allocates 1,000 AI Credits [1] at the start of each billing cycle. These are the credits you see in your Google One dashboard with a countdown showing days until reset.

Key facts:

No rollover. If you have 380 credits left with 10 days to go, those 380 vanish on your billing date [1]. You start fresh at 1,000.
These are for premium "heavy" tasks. High-end video generation (Google Flow/Veo 3.1) [4], advanced image creation (Imagen 4 / Nano Banana 2), and IDE overage billing all draw from this pool.
Standard tasks are free. Typing prompts into the Gemini app, generating standard images (Gemini / Imagen 4), and basic music tracks (Lyria 3) do not consume monthly credits up to their daily limits [4].

The Top-Up Exception: If you manually purchase a "Top-up" credit pack, those purchased credits are typically valid for 12 months from the date of purchase and do carry over across billing cycles [1]. Only your subscription credits are use-it-or-lose-it.

What Can You Buy With Credits?

Here are the standard costs for premium generation (2026 pricing) [2, 6]:

Feature	Credit Cost (Approx.)	What 380 credits buys you
Nano Banana 2 (Imagen 4)	1	~380 high-res assets
Veo 3.1 Fast (via Flow)	10	~38 cinematic clips
Veo 3.1 Quality (via Flow)	100	~3 cinematic clips
IDE Overage (per hour)*	15	~25 hours of high-tier use

*Approximation based on typical token consumption in Antigravity.

Tip: If you're close to the end of your billing month with credits to spare, it's the perfect time for high-resolution 4K image generation or experimental Veo video clips. They won't cost you anything "extra" once the new month starts.

Antigravity Weekly Baseline Quota (The IDE Clock)

If you use Google Antigravity (the AI-powered code editor), you'll notice a separate "refresh date" in your settings. This date is typically different from your monthly credit reset.

What it is: Google gives AI Pro users a "Free Baseline" of high-performance model usage (Gemini 3.1 Pro, Claude 4.6 Sonnet, etc.) within Antigravity every week [5, 6].

Key facts:

5-Hour Sprints: Your immediate capacity refreshes every 5 hours [5].
7-Day Hard Cap: There is a weekly baseline limit. If you exhaust this, the 5-hour refresh stops working until your next 7-day reset (the "refresh date" you see) [5].
This quota is individual — each account on your family plan has its own personal IDE baseline.

The Overage Toggle

In Antigravity settings, the "Use AI Credits for Overages" toggle lets you decide what happens when your weekly baseline runs out:

ON: Antigravity draws from your monthly 1,000-credit pool to keep you coding at full speed.
OFF: You're limited to the "Flash" model until the weekly refresh hits.

Daily Quotas (Chat, Music, and Deep Research)

Finally, your day-to-day interactions have their own daily caps that reset at midnight Pacific Time. These never touch your 1,000 monthly credits:

Feature	Limit Per Day
Thinking Model Chat	300 prompts
Pro Model Chat	100 prompts
Nano Banana 2 (Std)	1,000 images
Lyria 3 (Music)	50 tracks [4]
Deep Research Reports	3 reports

Family Plans: What's Shared, What's Not

Feature	Shared with Family?	Reset Cadence
1,000 AI Credits	Yes [3] (shared pool)	Monthly
2 TB Storage	Yes [1] (shared pool)	Monthly
Daily Interaction Limits	No (individual)	Daily
Antigravity Baseline	No [5] (individual)	Weekly

The Credits are the only shared fuel. If your family member generates a few Veo 3.1 videos, they are spending from the same 1,000-credit bucket you use for your IDE overages [3]. You can monitor this in the Google One → AI Credits Activity dashboard.

Important: Family members must be 18+ for high-tier model access. Under-18 accounts are restricted to Gemini Basic.

References

Google One AI Pro: Membership Overview - Plan pricing and 1,000 credit allocation.
Google One - AI Credits Pricing (2026) - Details on credit reset and top-ups.
Google One Help - Managing Family AI Credit Activity - Shared pool tracking and activity history.
Google Cloud - Expanding AI Creativity with Google Flow and Veo 3.1 - Media generation models and limits.
Google DeepMind - Antigravity IDE Baseline Quota Framework - 5-hour refresh and weekly baseline rules.
Anthropic - Claude Sonnet 4.6 on Google Cloud Vertex AI - February 2026 release news.

9 Mar 2026

Display IMDb Ratings on Einthusan

Technical Features

Surfing niche streaming sites without inline film ratings is a recipe for endless tab-opening and "analysis paralysis." To scratch my own itch, I put together a small userscript called Masala Script to fix this exact problem.

For context, Einthusan is a massive streaming directory for South Asian cinema. While it's an excellent digital archive, exploring its thousands of regional films is tedious because it lacks external metadata like IMDb ratings.

To solve this friction, Masala Script (presently just a single Tampermonkey extension file) reads the Einthusan page, interfaces with the free OMDB API, and renders IMDb rating badges right next to the title.

Technical Features

While it might seem trivial to inject an API call onto a DOM load, building this script properly required addressing a few interesting technical hurdles:

Intelligent Fallback Matching via Wikipedia: South Asian movie titles vary wildly in transliteration. An exact-title search against the OMDB API frequently fails for regional films. However, Einthusan usually provides a Wikipedia link for each movie. I built a transparent scrape fallback: if the direct title/year fetch fails, the script fetches the linked Wikipedia page in the background, extracts the definitive ttXXXXXXX IMDb ID using a regular expression, and then repeats the OMDB query using that exact ID for 100% accuracy.
Aggressive Client-Side Local Caching: The free tier of the OMDB API provides 1,000 requests per day. A typical Einthusan browse page can render over 20 movie cards at once. Scrolling through just 50 pages would immediately exhaust the daily allowance and result in rate limits. The script counters this by heavily utilizing the GM_setValue and GM_getValue Tampermonkey APIs—caching successful queries in the browser for 7 days, and failed title lookups for 1 day.
Detailed Error Tooltips: Rather than failing silently, any lookup that ultimately misses (or fails due to API config errors) renders a "Fail" or "N/A" UI badge. When hovered, it provides an exact exception traceback or error string so the user knows exactly why the movie metadata wasn't found.
Single Page Application Navigation Detection: Einthusan manages categories via AJAX updates, utilizing history.pushState and popstate to load new frames. The userscript actively monkey-patches and listens to these navigation boundaries, injecting the DOM observers properly on dynamic content swapping.

The Genesis of Masala Script

Built using modern AI tools, this project demonstrates how rapidly useful and robust browser enhancements can be coded from scratch with minimal manual boilerplating.

Looking at the initial Git history over the course of the day shows how quickly the script escalated from a basic exact-title scraper into a much more mature and intelligent tool:

commit 70027efcd906586cb46928b6da16c46c2402ae25
docs: replace development instructions with a detailed features and notes section in README.

commit 133dbdf7e6123f3a9ac774d820e1a1aa932051f4
Add caching for imdb ratings

commit acb25cc1187884e771efa9e32cf9836461d403c1
feat: Do a fallback search (via Wikipedia), in case first imdb rating fetch fails.

commit 06f3f05be8f6ebd08dfedffb80cf4d53544786f6
feat: Add movie page to the list of pages where this works

Try It Out

If you want to try it out on your next Einthusan movie night:

Install the Tampermonkey browser extension.
Claim a free OMDB API Key.
Install the script by clicking on imdb-einthusan.user.js.

The next time you navigate to any browse or movie page on Einthusan, the script will prompt you for your OMDB API key (only once), and start displaying ratings next to the movie cards!

Although this extension might only interest a very niche subset of users, for those of us who regularly browse Einthusan (and heavily rely on IMDb to filter through the noise), it's a massive quality-of-life add-on to have.

If anyone is looking for new features, has a bug to report, or wants to contribute, feel free to create an issue or submit a pull request on the repository. Happy watching!

4 Feb 2026

The "Skip Scan" You Already Had Before v18

PostgreSQL 18 introduces native "Skip Scan" for multicolumn B-tree indexes, a major optimizer enhancement. However, a common misconception is that pre-v18 versions purely resort to sequential scans when the leading column isn't filtered. In reality, the cost-based planner has long been capable of leveraging such indexes when efficient.

How it works under the hood:

In pre-v18 versions, if an index is significantly smaller than the table (the heap), scanning the entire index to find matching rows—even without utilizing the tree structure to "skip"—can be cheaper than a sequential table scan. This is sometimes also referred as a Full-Index Scan. While this lacks v18’s ability to hop between distinct leading keys, it effectively still speeds up queries on non-leading columns for many common workloads.

Notably, we would see why this v18 improvement is not a game-changer for all workloads, and why you shouldn't assume speed-up for all kinds of datasets.

Benchmarks: v17 vs v18

To understand when the new v18 Skip Scan helps, we tested two distinct scenarios.

Scenario A: Low Cardinality (The "Success" Case)

The Setup: We created an index on (bid, abalance) and ran the following query:

SELECT COUNT(*) FROM pgbench_accounts WHERE abalance = -2187;

Note: We did not run a read-write workload, so abalance is 0 for all rows. The query returns 0 rows.

Data Statistics:

Table Rows: 1,000,000
Leading Column (bid): ~10 distinct values (Low Cardinality).
Filter Column (abalance): 1 distinct value (Uniform).

Results:

Version	Strategy	TPS	Avg Latency	Buffers Hit
v17	Index Only Scan (Full)	~2,511	0.40 ms	845
v18	Index Only Scan (Skip Scan)	~14,382	0.07 ms	39

Why the massive 5.7x gain?

Pre-v18 Postgres sees that bid is not constrained. It decides its only option is to read the entire index (all leaf pages) to find rows where abalance = -2187. It scans 1,000,000 items (845 pages).

Postgres v18 uses Skip Scan. It knows bid comes first. instead of scanning sequentially, it:

Seeks to the first bid (e.g., 1). Checks for abalance = -2187.
"Skips" to the next unique bid (e.g., 2). Checks for abalance = -2187.
Repeats for all 10 branches.

It effectively turns one giant scan into 10 small seeks. The EXPLAIN output confirms this with Index Searches: 11 (10 branches + 1 stop condition) and only 39 buffer hits.

v18 Skip Scan Plan (Scenario A):

 Index Only Scan using pgbench_accounts_bid_abalance_idx on public.pgbench_accounts ...
   Index Cond: (pgbench_accounts.abalance = '-2187'::integer)
   Heap Fetches: 0
   Index Searches: 11
   Buffers: shared hit=39

Scenario B: High Cardinality (The "Failure" Case)

The Setup: We used the same query, but against an index on (aid, abalance).

Data Statistics:

Leading Column (aid): 1,000,000 distinct values (100% Unique / High Cardinality).

Results:

Version	Strategy	TPS	Avg Latency	Buffers Hit
v17	Index Only Scan (Full)	~55.8	17.92 ms	2735
v18	Index Only Scan (Full)	~51.2	19.52 ms	2741

Why no improvement? If v18 attempted to "Skip Scan" here, it would have to skip 1,000,000 times (once for every unique aid). Performing 1 million seeks is significantly heavier than simply reading the 1 million index entries linearly (which benefits from sequential I/O and page pre-fetching). The planner correctly estimated this cost and fell back to the standard "Index Only Scan" used in v17.

v18 Plan (Identical to v17):

 Index Only Scan using pgbench_accounts_aid_abalance_idx on public.pgbench_accounts ...
   Index Cond: (pgbench_accounts.abalance = '-2187'::integer)
   Heap Fetches: 0
   Index Searches: 1
   Buffers: shared hit=2741

How to Identify a Skip Scan

You might notice that the node type in the EXPLAIN output remains Index Scan or Index Only Scan in both cases. PostgreSQL 18 does not introduce a special "Skip Scan" node.

Instead, you must look at the Index Searches line (visible with EXPLAIN (ANALYZE)):

v17 (and older): The Index Searches row does not appear.
v18 (Standard Scan): Index Searches: 1. The scan started at one point and read sequentially (as seen in Scenario B).
v18 (Skip Scan): Index Searches: > 1. The engine performed multiple seeks (as seen in Scenario A which had 11).

In our Scenario A (Success), Index Searches: 11 tells us it performed ~11 hops, confirming the feature was used.

How to reproduce:

Note: We use PostgreSQL 13 in this section to meaningfully demonstrate that even older versions (long before v18) could efficiently utilize multicolumn indexes for these types of queries, to clarify that older Postgres versions were capable of avoiding a Full Table Scan (and instead fallback to Full Index Scan) in such scenarios.

PostgreSQL 13 (a running cluster)
A pgbench-initialized database (run pgbench -i to create pgbench_accounts, pgbench_branches, pgbench_tellers, and pgbench_history)
Reasonable data loaded (default pgbench -i -s 10 or similar)

Steps to reproduce the example:

Follow these steps in your terminal to set up the test environment. This assumes you have PostgreSQL 13 (or a similar version) installed and its command-line tools are in your PATH.

Step 1: Create and Initialize the Database

createdb testdb                # Create a new database
pgbench -i -s 10 testdb         # Initialize it with 1 million rows

Step 2: Run a Quick Read-Write Test

To simulate some database activity, run a standard 10-second read-write benchmark.

pgbench -T 10 testdb

Step 3: Connect and Run the Test Query

Now, connect to the database with psql to run the SQL commands that demonstrate the index scan behavior.

SQL steps (run in psql):

psql testdb
-- create a multicolumn index (first column aid, second column abalance)
testdb=# CREATE INDEX CONCURRENTLY IF NOT EXISTS pgbench_accounts_aid_abalance_idx
  ON pgbench_accounts(aid, abalance);

testdb=# select abalance, count(*) from pgbench_accounts group by abalance order by count(*) desc limit 5;
 abalance | count
----------+--------
        0 | 995151
    -2187 |      4
    -4621 |      4
    -4030 |      4
    -2953 |      4
(5 rows)

-- run an explain on a selection that filters only on the second column (abalance)
testdb=# EXPLAIN (ANALYSE, VERBOSE, COSTS)
SELECT COUNT(*) FROM pgbench_accounts WHERE abalance = -2187;

Aggregate  (cost=18480.77..18480.78 rows=1 width=8) (actual time=18.672..18.674 rows=1 loops=1)
   Output: count(*)
   ->  Index Only Scan using pgbench_accounts_aid_abalance_idx on public.pgbench_accounts  (cost=0.42..18480.69 rows=33 width=0) (actual time=8.583..18.659 rows=4 loops=1)
         Output: aid, abalance
         Index Cond: (pgbench_accounts.abalance = '-2187'::integer)
         Heap Fetches: 0
 Planning Time: 0.328 ms
 Execution Time: 18.732 ms
(8 rows)

Notes on the output above:

The important line is Index Only Scan using pgbench_accounts_aid_abalance_idx along with Index Cond: (pgbench_accounts.abalance = '-2187'::integer) — this demonstrates the planner chose an index scan that probes the multicolumn btree for entries where the second column matches the predicate.

Why this is not the v18 "skip-scan" feature in full:

The v18 skip-scan adds optimizer logic to efficiently iterate distinct values of leading columns and probe the remainder of the index; it's a targeted algorithmic addition. What we show above is the planner choosing an index scan over a multicolumn index and applying the Index Cond to the second column. That can be effective for many queries and data layouts, but it lacks the specialized skip-scan internals that v18 adds for certain other cases.

Production Tips

If you need queries on a non-leading column to be fast on older Postgres versions, create the right index and keep statistics current (ANALYZE). The planner may prefer an index scan over a seq scan when selectivity and costs align.
Consider partial or expression indexes when appropriate; they let you make an index that directly serves the filter you need.
When portability across versions is important, test on the earliest Postgres version you need to support; planner behavior can vary by release, statistics, and data distribution.

Conclusion

Postgres v18's documented skip-scan addition for B-tree indexes is a welcome and useful optimizer enhancement, specifically for low-cardinality leading columns. However, for high-cardinality leading columns (like our first example), the standard Full Index Scan remains optimal, and pre-v18 versions handle them just fine.

References

PostgreSQL 18 release notes (skip-scan, indexes): https://www.postgresql.org/docs/18/release-18.html
nbtree skip-scan optimization: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=92fe23d93
Further optimize nbtree search scan key comparisons: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8a510275d

18 Jan 2026

Turbocharging LISTEN/NOTIFY with 40x Boost

Unless you've built a massive real-time notification system with thousands of distinct channels, it is easy to miss the quadratic performance bottleneck that Postgres used to have in its notification queue. A recent commit fixes that with a spectacular throughput improvement.

The commit in question is 282b1cde9d, which landed recently and targets a future release (likely Postgres 19, though as with all master branch commits, there's a standard caveat that it could be rolled back before release).

The Problem

Previously, when a backend sent a notification (via NOTIFY), Postgres had to determine which other backends were listening to that channel. The implementation involved essentially a linear search or walk through a list for each notification, which became costly when the number of unique channels grew large.

If you had a scenario with thousands of unique channels (e.g., a "table change" notification system where each entity has its own channel), the cost of dispatching notifications would scale quadratically. For a transaction sending N notifications to N different channels, the effort was roughly O(N^2).

The Fix

The optimization introduces a shared hash table to track listeners. Instead of iterating through a list to find interested backends, the notifying process can now look up the channel in the hash table to instantly find who (if anyone) is listening. This implementation uses Partitioned Hash Locking, allowing concurrent LISTEN/UNLISTEN commands without global lock contention.

Additionally, the patch includes an optimization to "direct advance" the queue pointer for listeners that are not interested in the current batch of messages. This is coupled with a Wake-Only-Tail strategy that signals only the backend at the tail of the queue, avoiding "thundering herd" wake-ups and significantly reducing CPU context switches.

Finally, the patch helps observability by adding specific Wait Events, making it easier to spot contention in pg_stat_activity.

Benchmarking Methodology

Tests were conducted on an Intel Core i7-4770 server with 16GB RAM.

Two Postgres instances were set up:

Before: Commit 23b25586d (the parent of the fix), running on port 5432.
After: Commit 282b1cde9 (the fix), running on port 5433.

The "Sender's Commit Time" (in milliseconds) was measured across three different scenarios to test scalability limits.

Results

Test 1: Channel Scalability

Metric: Time to send 1 NOTIFY to N unique channels.

The "Before" code exhibits clear O(N^2) or similar quadratic degradation as the number of channels increases. The "After" code remains effectively flat.

Channels	Before (ms)	After (ms)	Speedup
1	0.095	0.094	1x
10	0.059	0.054	1.1x
100	0.080	0.089	0.9x
1,000	2.032	0.485	4.2x
2,000	6.195	0.791	7.8x
3,000	13.314	1.019	13x
4,000	24.256	1.656	15x
5,000	35.711	1.693	21x
6,000	54.520	2.456	22x
7,000	70.088	3.274	21x
8,000	93.798	3.450	27x
9,000	128.252	4.483	29x
10,000	140.061	3.598	39x

Test 2: Listener Scalability (at 1,000 channels)

Metric: Impact of multiple backends listening to the same channels.

Listeners	Before (ms)	After (ms)	Speedup
1	2.468	0.452	5.5x
10	5.085	0.536	9.5x
50	2.728	0.522	5.2x
100	4.342	0.620	7.0x
200	5.485	0.783	7.0x

Test 3: Idle User Overhead (at 1,000 channels)

Metric: Impact of connected but non-listening backends.

Idle Users	Before (ms)	After (ms)	Speedup
0	3.065	0.425	7.2x
50	3.123	0.474	6.6x
100	4.219	0.353	12x
200	3.262	0.437	7.5x
300	2.102	0.381	5.5x
400	2.500	0.384	6.5x
500	2.091	0.362	5.8x

Conclusion

This optimization eliminates a significant scalability cliff in Postgres's messaging system. While "don't create 100,000 channels" is still good advice, it's great to see Postgres becoming more robust against these high-load scenarios.

For those interested in the technical nitty-gritty, reading the discussion thread is highly recommended to see how the solution evolved from a simple list switch to a shared hash map implementation over thirty-five (35!) iterations before finally landing in the codebase.

Credits: The commit was authored by Joel Jacobson and reviewed extensively by Tom Lane (who also committed it) as well as Chao Li.

9 Dec 2025

3x Faster TID Range Scans - Postgres 19

If you've ever had to scrub significantly large tables—whether updating older records or deleting expired ones—you know the pain of trying to do it efficiently without locking the entire table. One common trick is to iterate over the ctid (the physical location of the tuple) to process the table in chunks. Until now, this was strictly a single-threaded affair. But with the latest commit to Postgres 19, TID Range Scans can now be parallelized, allowing you to harness multiple cores to rip through these maintenance tasks significantly faster.

In my tests, upto 3x Faster!!

The Problem with Linear Scanning

When you perform a batched operation using ctid (e.g., WHERE ctid >= '(0,0)' AND ctid < '(1000,0)'), Postgres uses a Tid Range Scan. Previously, this scan type didn't support parallelism. If you had a 1 TB table and wanted to process ranges of it, a single CPU core would have to do all the heavy lifting for that range, fetching blocks one by one. Even if you had 64 cores gathering dust, they couldn't help with that specific scan node.

The Solution: Parallel TID Range Scan

Commit 0ca3b169, developed by Cary Huang with an extensive review (and committed) by David Rowley, introduces infrastructure to allow Tid Range Scan to participate in parallel query plans. The logic effectively splits the block range among the available parallel workers. Instead of one process sweeping from block 0 to N, workers can grab chunks of blocks concurrently.

NOTE

This feature is currently in the master branch (Postgres 19devel). As with all unreleased features, it could theoretically be rolled back or modified before the final release.

Benchmark: Before vs. After

To see this in action, I spun up two Postgres instances: one built just before the commit (Postgres 18) and one after (Postgres 19). I created a table bench_tid_range with 10 million rows and ran a count(*) query over the first 50% of the table using a ctid range condition.

Test Setup:

Table Size: 10 Million rows
Query: SELECT count(*) FROM bench_tid_range WHERE ctid >= '(0,0)' AND ctid < '(41667,0)'

Environment	Workers	Execution Time (Median)	Speedup
Before (Pg 18)	0	448 ms	1.00x
After (Pg 19)	0	435 ms	1.03x
After (Pg 19)	1	238 ms	1.88x
After (Pg 19)	2	174 ms	2.58x
After (Pg 19)	3	151 ms	2.97x
After (Pg 19)	4	150 ms	2.98x
After (Pg 19)	5	147 ms	3.05x
After (Pg 19)	6	143 ms	3.14x
After (Pg 19)	7	147 ms	3.04x
After (Pg 19)	8	147 ms	3.04x

(Each execution time is a Median of 9 runs)

The chart above illustrates the performance scaling. Note that Postgres 19 with 0 workers (i.e. serial) is almost identical in speed to Postgres 18; the efficiency gains are visible only when parallelism is enabled.

We see a massive drop in execution time just by enabling 1 worker (which effectively gives us 2 processes scanning: the leader + 1 worker). The gains continue robustly up to 3 workers, where execution time settles around 150ms. Beyond that, we hit diminishing returns as the overhead of managing parallel workers and aggregating results begins to dominate the raw scanning speed. The "sweet spot" for this specific workload appears to be around 2-3 workers.

Under the Hood: Deep Dive

Let's look at the actual EXPLAIN (ANALYZE, BUFFERS) output to see *why* it's faster.

Before (Postgres 18)

The legacy behavior forces a serial scan. The single process reads all 41,667 pages.


 Aggregate  (cost=104171.87..104171.88 rows=1 width=8) (actual time=427.723..427.724 rows=1.00 loops=1)
   Buffers: shared read=41667
   ->  Tid Range Scan on bench_tid_range  (cost=0.01..91671.70 rows=5000069 width=0) (actual time=0.130..243.502 rows=5000040.00 loops=1)
         TID Cond: ((ctid >= '(0,0)'::tid) AND (ctid < '(41667,0)'::tid))
         Buffers: shared read=41667

After (Postgres 19)

The new Parallel Tid Range Scan node appears. Notice the Gather node launching 2 workers.


 Finalize Aggregate  (cost=68713.25..68713.26 rows=1 width=8) (actual time=166.873..169.987 rows=1.00 loops=1)
   Buffers: shared read=41667
   ->  Gather  (cost=68713.03..68713.24 rows=2 width=8) (actual time=166.238..169.978 rows=3.00 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared read=41667
         ->  Partial Aggregate  (cost=67713.03..67713.04 rows=1 width=8) (actual time=164.518..164.519 rows=1.00 loops=3)
               Buffers: shared read=41667
               ->  Parallel Tid Range Scan on bench_tid_range  (cost=0.01..62504.63 rows=2083362 width=0)
                     (actual time=0.095..97.492 rows=1666680.00 loops=3)
                     TID Cond: ((ctid >= '(0,0)'::tid) AND (ctid < '(41667,0)'::tid))
                     Buffers: shared read=41667

Key Improvements:

1. Distributed Cost: The cost estimate dropped from 104k to 68k, indicating the planner correctly anticipates the benefit of parallelism.

2. Shared Load: While the total buffers read (41667) is the same, the work is shared across 3 processes (1 leader + 2 workers), reducing the wall-clock time for the scan itself from ~243ms to ~97ms per worker.

Counter-Intuitive Alternative: Forced Parallel Seq Scan?

You might ask: "Why didn't Postgres v18 just choose a Parallel Sequential Scan? Wouldn't scanning the whole table with 4 workers be faster than scanning half the table with 1?"

I tested exactly that on the v18 instance by forcing enable_tidscan = off with 4 workers.

Execution Time: ~230 ms.


 Finalize Aggregate  (cost=124959.76..124959.77 rows=1 width=8) (actual time=227.917..229.476 rows=1.00 loops=1)
   Buffers: shared hit=15791 read=67543
   ->  Gather  (cost=124959.34..124959.75 rows=4 width=8) (actual time=227.660..229.469 rows=5.00 loops=1)
         Workers Planned: 4
         Workers Launched: 4
         Buffers: shared hit=15791 read=67543
         ->  Partial Aggregate  (cost=123959.34..123959.35 rows=1 width=8) (actual time=224.154..224.155 rows=1.00 loops=5)
               Buffers: shared hit=15791 read=67543
               ->  Parallel Seq Scan on bench_tid_range  (cost=0.00..120834.30 rows=1250017 width=0) 
                     (actual time=0.150..169.772 rows=1000008.00 loops=5)
                     Filter: ((ctid >= '(0,0)'::tid) AND (ctid < '(41667,0)'::tid))
                     Rows Removed by Filter: 999992
                     Buffers: shared hit=15791 read=67543
 Planning:
   Buffers: shared hit=25
 Planning Time: 0.217 ms
 Execution Time: 229.584 ms

It was faster than the serial TID scan (~448 ms)! However, it had to visit all ~83k pages (15,791 hit + 67,543 read) instead of just the ~41k pages we cared about.

The new Parallel TID Range Scan (~150 ms) is still 35% faster than the brute-force/forced Parallel Seq Scan *and* it generates half the I/O load. It is the best of both worlds: fast execution time (Parallelism) and efficient resource usage (Index-like scoping).

Use Case: The Hotel Analogy

To understand why this matters, let's step away from databases and imagine a massive hotel with 1 million rooms. You need to verify all guests whose names start with a letter before 'S'.

The Logical Approach (Offset Problem)

You tell the front desk: "Give me a list of the first 100 guests with names A-R." matches 100 people named 'Aaron', 'Abby', etc.

Then you ask: "Give me the next 100 guests."

To answer this, the clerk has to skip through the 'A's again just to know where to start counting the 'B's. As you get deeper into the alphabet, the clerk spends more and more time skipping known names just to find the new ones. By the time you are asking for names starting with 'R', they are re-reading 90% of the guest list just to find your batch.

The Physical Approach (TID Scan)

This is the TID Range Scan. You stop caring about the names (the logical value) and organize by Room Number (the physical location).

Worker 1: "Go check Rooms 1 to 100."
Worker 2: "Go check Rooms 101 to 200."

Worker 2 doesn't care who is in Room 1. They go straight to Room 101. If the person inside is named "Zach", they skip them. If it's "Alice", they verify them.

Zero Rescanning: No one re-reads the list of 'A's.
Parallelism: Because the rooms are physically distinct, you can send 8 workers to different floors simultaneously without them needing to coordinate or skip over each other's results.

This approach transforms an O(N^2) maintainence nightmare into a linear, parallelizable task.

Reproduce this Test

Want to try it yourself on Postgres 19devel? Here is the SQL to set up the test table and run the benchmark.


-- 1. Create the table
DROP TABLE IF EXISTS bench_tid_range;
CREATE TABLE bench_tid_range (id int, payload text);

-- 2. Insert 10M rows to generate ~41k pages
INSERT INTO bench_tid_range 
SELECT x, 'payload_' || x 
FROM generate_series(1, 10000000) x;

-- 3. Vacuum to set visibility map and freeze (important for stable benchmarks)
VACUUM (ANALYZE, FREEZE) bench_tid_range;

-- 4. Enable Parallelism for the Session
SET max_parallel_workers_per_gather = 4; -- Try 2, 4, 8
SET min_parallel_table_scan_size = 0;    -- Force parallel scan even for smaller tables

-- 5. Run the Query
EXPLAIN (ANALYZE, BUFFERS) 
SELECT count(*) 
FROM bench_tid_range 
WHERE ctid >= '(0,0)' AND ctid < '(41667,0)';

Conclusion

This is a welcome "plumbing" improvement. It might not change your daily ad-hoc queries, but for DBAs and developers building custom data maintenance scripts, the ability to parallelize TID based scans is a powerful new tool in the optimization toolkit.

References:

3 Dec 2025

Speed up JOIN Planning - upto 16x Faster!

The hidden cost of knowing too much. That's one way to describe what happens when your data is skewed, Postgres statistics targets are set high, and the planner tries to estimate a join.

For over 20 years, Postgres used a simple O(N^2) loop to compare (equi-join) Most Common Values (MCVs) during join estimation. It worked fine when statistics targets are small (default_statistics_target defaults to 100). But in the modern era - we often see Postgres best-practices recommend cranking that up. Customers are known to be using higher values (1000 and sometimes even higher) to handle complex data distributions + throw a 10 JOIN query to the mix - and this "dumb loop" can easily become a silent performance killer during planning.

That changes in Postgres 19.

The Problem: It's Quadratic!

When you join two tables, the planner needs to estimate how many rows will match. If both columns have MCV lists (lists of the most frequent values), eqjoinsel() tries to match them up to get a precise estimate.

Historically, it did this by comparing every item in list A with every item in list B.

If you have 100 MCVs, that's 10,000 comparisons. Fast.
If you have 10,000 MCVs (max stats target), that's 100,000,000 comparisons. Not so fast.

This meant that simply *planning* a query could take significantly longer than executing it, especially for simple OLTP queries.

The Solution: Hash It Out

The fix is elegant and effective.

Instead of a nested loop, the planner now:

1. Checks if the total number of MCVs is greater than 200 (100 each side).

2. If so, it builds a Hash Table of the MCVs from one (smaller) side.

3. It then probes this hash table with the MCVs from the other side.

The threshold of 200 was chosen because hashing has a startup cost (allocating memory, computing hashes). For smaller lists, the simple loop is actually faster. But once you cross that threshold, the hash table wins.

This transforms the complexity from O(N^2) to O(N), making the estimation step virtually instantaneous even with the largest statistics targets.

Let's Benchmark

I wanted to verify this myself, so I set up a worst-case scenario: two tables with 100,000 rows, but only 10,000 distinct values, and I cranked the statistics target to the maximum (10,000) to force a massive MCV list.

Setup


CREATE TABLE t1 (x int);
CREATE TABLE t2 (x int);

-- Scenario A: 3M rows, Stats 10000
INSERT INTO t1 SELECT x % 10000 FROM generate_series(1, 3000000) x;
INSERT INTO t2 SELECT x % 10000 FROM generate_series(1, 3000000) x;

-- Maximize statistics target
ALTER TABLE t1 ALTER COLUMN x SET STATISTICS 10000;
ALTER TABLE t2 ALTER COLUMN x SET STATISTICS 10000;

ANALYZE t1;
ANALYZE t2;

Results

I ran the following query 10 times on both versions, interleaved to ensure fairness and used median for the results:


EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM t1 JOIN t2 ON t1.x = t2.x;

Scenario A: High Stats (10k MCVs)

Rows: 3 Million
Statistics Target: 10,000
MCV Count: 10,000

Before (Postgres 18): ~27.8 ms
After (Postgres 19): ~1.75 ms
Speedup: ~16x

Scenario B: Medium Stats (1k MCVs)

Rows: 300,000
Statistics Target: 1,000
MCV Count: 1,000

Before (Postgres 18): ~0.85 ms
After (Postgres 19): ~0.60 ms
Speedup: ~1.4x (40% speed-up)

Scenario C: Default Stats (100 MCVs)

Rows: 30,000
Statistics Target: 100 (Default)
MCV Count: 100

Before (Postgres 18): ~0.40 ms
After (Postgres 19): ~0.43 ms
Speedup: None (Optimization correctly skipped)

Since the total MCV count in Scenario C (100 + 100 = 200) did not exceed the 200 threshold, Postgres 19 correctly chose the simpler loop, avoiding the overhead of building a hash table. This confirms that the patch is smart enough to only kick in when it matters.

The "Quadratic Curve" in Action

To visualize the O(N^2) vs O(N) difference, I ran a benchmark across a wide range of statistics targets (from 10 to 10,000). In each test, the number of rows was set to 300x the statistics target to ensure the MCV list was fully populated and relevant.

Stats Target	PG 18 (ms)	PG 19 (ms)	Speedup
10	0.40	0.38	-
100	0.40	0.45	-
200	0.45	0.43	-
500	0.53	0.52	-
1000	0.85	0.63	1.3x
2000	1.73	0.68	2.5x
5000	7.54	1.14	6.6x
8000	17.97	1.57	11.4x
10000	27.56	1.92	14.3x

As you can see, up to a target of 500, the difference is negligible (and the optimization might not even kick in). But as the target grows, the quadratic cost of the old method explodes, while the new hash-based method scales linearly and remains extremely fast.

This patch is a classic example of modernizing legacy assumptions. The code written 20 years ago assumed MCV lists would be short (in all fairness the default was 100 for a long time). Today's hardware and data requirements have pushed those boundaries, and Postgres is evolving to meet them.

Thanks to Ilia Evdokimov for the patch, David Geier for co-authoring, and Tom Lane for the review.

Note: This feature is currently in the master branch for Postgres 19. As with all development features, it is subject to change before the final release.

References

Commit: Speed up eqjoinsel() with lots of MCV entries
Discussion: Mailing List Thread