robins.in: June 2026

23 Jun 2026

AI Ping-Pong with Polly (and Omnigent)

Sometimes the most entertaining way to test a new orchestration tool isn't to run a standard benchmark, but to lock two state-of-the-art AI models in a virtual room and make them debate historical facts. This week, I got to do exactly that using an interesting new tool called polly within the omniagent ecosystem.

The idea was simple: set up a relay where Claude and Gemini (specifically the Antigravity agent) pass a task back and forth, validating and revalidating each other's work.

However, getting there wasn't entirely smooth sailing. I see recent polly changes that seem to be working on Antigravity support but I presume that it is just heavily in beta and under active development - Simply because I couldn't get it to work cleanly on my local test setup out of the box. In a somewhat ironic twist of AI-assisted development, I ended up having Claude help me debug my environment and write the custom harness around Antigravity to get the toolchain working.

Once the harness was solid, the fun began. I decided to test their interaction on a simple prompt asking for a list of India's Prime Ministers (based on their first stint). I then pass that list back-n-forth between the two AI engines for cross-validation and updation. To avoid an endless loop, I limited the number of iterations to exactly two.

Here is the exact prompt and how it kicked off:



 ❯ ask antigravity to generate the list of prime ministers of india in the
 chronological order (of their first stint) and then pass the list to
 claude for validation and an updated list of the same. and pass that onto
 to antigravity for revalidation with the same request. Do this round trip
 only 2 times and display the resultant list.

   · thinking…
◆ I'll run this as a relay: antigravity generates → 
    Claude validates/updates → antigravity revalidates, repeated for 
    2 round trips. Starting with antigravity's initial list.

◆ Antigravity produced the initial list (15 entries). Round trip 1 —
   passing to Claude for validation + an updated list:

◆ Claude is validating now. Waiting for its result.

.
.
.

Watching the two models interact was fascinating. Antigravity would confidently output the initial sequence, Claude would meticulously review it (often catching nuance around interim Prime Ministers like Gulzarilal Nanda), and then Antigravity would process Claude's corrections. It functioned like a highly pedantic game of ping-pong.

While this specific test was just good fun, the real motivation behind integrating polly is much more practical. I've been working on a complex PostgreSQL B-tree index compaction patch to solve the index bloat problem that VACUUM leaves behind. Developing database internals is notoriously tricky, and I've been relying on Claude (and now Antigravity) to help write and re-write and review the code heavily.

Having two AI agents look at the developing patch from different angles has been invaluable. They continually find subtle bugs, naming inconsistencies, edge-case production gotchas, and deviations from C best practices. Instead of manually passing code snippets back and forth between different chat windows, the idea here is to see if polly (or a similar tool) can allow me to seamlessly automate this cross-validation part without my intervention.

Now that the Antigravity harness is patched up and working, I've proven that this dual-engine review process can work, and be an automated part of my Postgres hacking workflow. That said, because Google's API billing operates separately from the standard Google AI Pro account, this specific Claude-and-Gemini pairing was mostly a test run. For the near future, I may just end up cross-validating between Claude and Codex (or a similar alternative) to keep the pipeline running smoothly and cost-effectively.

For the curious, here is the Relay trace of that conversation, and the final result.



   Relay summary

    Hop               Agent                 Result
    ─────────────────────────────────────────────────────
    Generate          antigravity (Gemini)  15-entry list
    RT1 → validate    Claude                two changes
    RT1 → revalidate  antigravity           no changes
    RT2 → validate    Claude                no changes
    RT2 → revalidate  antigravity           no changes


   Resultant list — Prime Ministers of India (chronological by start of first term)


     1 Jawaharlal Nehru — 1947
     2 Gulzarilal Nanda (Acting) — 1964
     3 Lal Bahadur Shastri — 1964
     4 Indira Gandhi — 1966
     5 Morarji Desai — 1977
     6 Charan Singh — 1979
     7 Rajiv Gandhi — 1984
     8 Vishwanath Pratap Singh — 1989
     9 Chandra Shekhar — 1990
    10 P. V. Narasimha Rao — 1991
    11 Atal Bihari Vajpayee — 1996
    12 H. D. Deve Gowda — 1996
    13 Inder Kumar Gujral — 1997
    14 Manmohan Singh — 2004
    15 Narendra Modi — 2014

   One note worth flagging (the cross-checks agreed, but for completeness): this
   list counts Gulzarilal Nanda once at his first acting stint (1964). He also
   served a second acting term in 1966 after Shastri's death — both models
   consistently treated 'first stint' as the ordering key, so he
   appears a single time at 1964, which matches your request.

Keep having fun !!

19 Jun 2026

Index Bloat - The Problem VACUUM Leaves Behind

This is an opening post (of a series) about my attempt at automatic B-tree index compaction integrated into autovacuum.

Index bloat is one of the most common performance problems in production PostgreSQL deployments — and one of the most misunderstood. People assume that a fully vacuumed table has a healthy index - which isn't true. There is a significant gap between an index that has been cleaned and one that has been compacted, and although Postgres is closing that gap (v13 enabled index deduplication & v14 brought bottom-up index deletion), there is a lot more to do here.

This post explains what index bloat actually is, why standard VACUUM cannot fix it, and why the existing manual tools — while useful — are not a substitute for continuous automated maintenance.

What Index Bloat Is (and Isn't)

After a large bulk delete — say, removing 80% of the rows in a table — autovacuum will dutifully remove all 80% of the dead index entries. The index will look clean. But the leaf pages are still there, mostly empty. A table that went from 1 million to 200,000 rows may still have an index file that occupies roughly the same space as before (a leaf page that held 100 index entries, and now holds 20, still occupies 8 kB on disk). Only the entries are gone; the pages remain.

The practical consequence is stark. A fully vacuumed B-tree index can have hundreds of leaf pages each filled to just 20% capacity. A range scan covering 10% of the keyspace now could end up reading five times as many pages as a compact index would require. Every buffer-cache miss costs the same whether the page carries 10 tuples or 100.

Why VACUUM Can't Merge Pages

PostgreSQL's autovacuum already visits every B-tree index after vacuuming a table. It calls btbulkdelete to remove dead index entries, then btvacuumcleanup to perform index-level tidying. What it cannot do is merge two partially-filled leaf pages into one and free the other.

Merging requires moving index entries from one page (the source) to another (the destination), then marking the source as deleted and unlinking it from the sibling chain. The B-tree code already knows how to do this: _bt_pagedel handles the case where a leaf page becomes completely empty after all its entries are removed. But VACUUM never merges pages that are merely sparse. A page at 20% fill is considered live and is left alone indefinitely.

Why can't VACUUM just merge sparse pages the same way it handles empty ones?

The obstacle is locking. B-tree index scans descend to a leaf page, pin it in memory, and then walk right by following btpo_next links to adjacent sibling pages. If a merge moves entries from a right page to a left page while a concurrent scan has already pinned the right page but not yet finished reading from it, the scan will miss the moved entries. That is a correctness bug.

Page-level locks — shared for reading, exclusive for writing — are not enough to prevent this. A concurrent scan can pin the right page before a merge begins, but after the check that would have prevented the merge.

So the option is either (a) Carefully coordinating page-level locks across all concurrent access-paths, OR, (b) Taking a relation-level lock, held across the entire merge operation, which can then guarantee that no scan is mid-flight on the page being moved. This series of blog-posts tries to revisit this locking dilemma (that makes index compaction difficult today) and see what else is possible to do to help the average application workload.

The Existing Solutions and Their Limitations

The standard tool for recovering bloated index space is REINDEX — or its online variant, REINDEX CONCURRENTLY. A full REINDEX rebuilds the index from scratch and produces a perfectly packed result, but it holds an AccessExclusiveLock on the relation for the entire duration of the build. All reads and writes are blocked. On a large production table that can mean minutes of downtime.

REINDEX CONCURRENTLY avoids that lock but still requires building a complete shadow copy of the index: a second, full-sized index is written while the original stays online. This is expensive in I/O and CPU, can run for hours on large indexes, may leave an invalid index (on cancellation) and requires 2x disk-space.

VACUUM FULL and extensions like pg_repack have the same fundamental character. pg_repack is more surgical than VACUUM FULL — it rebuilds the table and its indexes while the original stays accessible, and swaps them in atomically — but it still builds a complete copy before anything is reclaimed.

PostgreSQL 19 (barring last minute rollbacks) introduces a built-in REPACK command that performs online table and index repacking without requiring an external extension. This is genuine progress: it eliminates the pg_repack dependency, integrates with standard tooling, and is maintained by the core team. But the fundamental character of the operation is unchanged. REPACK rebuilds from scratch, and it is a scheduled one-shot command.

The common thread across all of these tools: a DBA has to notice the bloat — typically through a monitoring query or a performance regression — schedule a maintenance window, and issue the command. In practice, indexes frequently stay bloated for months between maintenance events. A table that sees steady churn (a history log, an events queue, a session store) will re-accumulate bloat between scheduled maintenance runs, sometimes faster than it can be addressed manually.

What production systems actually need is continuous, incremental compaction that runs in the background — the way autovacuum continuously reclaims heap space — without ever taking a lock that blocks application traffic.

What This Series Covers

The remaining posts in this series measure what continuous automated compaction delivers across 19 benchmark scenarios: static bulk deletes, steady-state churn, HOT and non-HOT update workloads, UUID-keyed indexes, concurrent readers under load, sawtooth growth patterns, TPS degradation under accumulating bloat, and streaming replication. The scenarios start simple and progressively test harder conditions.

The final post in this series (coming soon) covers the algorithm itself: the multiple failed design attempts, the core insight, the locking design that keeps application traffic unblocked, the implementation details, and operational considerations.

This is personal research, done on my own time and not connected to or endorsed by my employer. This is a relatively complex topic and I am sharing my findings as I explore this space.

15 Jun 2026

Why Postgres Doesn't Have synchronous_commit=remote_receive?

In distributed database environments, balancing durability and performance is a constant tug-of-war. PostgreSQL’s synchronous_commit parameter sits at the heart of this, giving administrators a dial to choose exactly when a COMMIT returns success to the client.

The idea of remote_receive was born from a simple question: does skipping the standby's disk write yield a measurable, real-world performance benefit? By waiting only for WAL bytes to reach the standby's memory, could we get a meaningful boost over remote_write? I set out to implement and benchmark this feature to find out.

What followed was a journey through network latency, OS page caches, CPU scheduler thrashing, and benchmarking noise. Here is the breakdown of the implementation, the tests, the initial anomalies, and the final results.

1. The Feature: What is `remote_receive`?

Before this branch, PostgreSQL offered four primary synchronous commit modes:

off: Fully asynchronous. (Fastest, least safe)
local: Waits for local disk flush on the primary.
remote_write: Waits for the standby to write the WAL to its OS buffer cache (pwrite).
remote_apply: Waits for the standby to fully replay the WAL. (Slowest, most safe)

remote_receive sits directly between local and remote_write. In this mode, the primary guarantees that the WAL bytes have physically arrived at the standby's walreceiver process buffer. It does not wait for the standby to call pwrite().

The Hypothesis: By completely bypassing the standby's disk I/O, remote_receive should deliver lower latency and higher throughput than remote_write, especially on replica hardware with slow disks.

Implementation Details

To build this, I had to modify both the standby and the primary:

Standby Status Update: I modified the 34-byte wire message that the standby sends to the primary, adding a new 8-byte receivePtr (creating a 42-byte message, backward compatible).
Early Replies: I modified walreceiver.c to send a reply message immediately upon receiving a WAL chunk in memory, before the XLogWalRcvWrite() call executes the pg_pwrite.
Primary Wait Logic: I updated syncrep.c and walsender.c to track the SYNC_REP_WAIT_RECEIVE wait queue, releasing waiting backends as soon as the standby's receivePtr advanced.

2. Scenario 1: The SSD Baseline (Fast Primary, Fast Replica)

To validate the code, I first set up a baseline test using two fast machines on a gigabit LAN.

The Servers:

Primary: lenovo (Intel Core i7-12700 12C/20T, 48GB RAM, NVMe SSD)
Replica: camry (Intel Core i7-4770 4C/8T, 24GB RAM, SATA SSD)

I ran pgbench (Scale 10, 4 clients) for 30 seconds across the different modes.

The Result:

remote_write: 3,944 TPS (Median)
remote_receive: 3,946 TPS (Median)

The performance was virtually identical (a 0.06% difference). Why didn't remote_receive pull ahead?

The Reality of pwrite(): On modern operating systems with free RAM, a standard pwrite() to a buffered file does not write to physical disk immediately. It copies the data into the OS page cache (essentially a memory copy taking mere microseconds), leaving the kernel to flush dirty pages asynchronously.

With network round-trip time (RTT) on a gigabit LAN between 0.2ms and 1.0ms, the 5 microseconds saved by bypassing pwrite() is completely dwarfed by network latency. This makes remote_write and remote_receive perform nearly identically in typical conditions.

This fundamental reality explains why core PostgreSQL developers historically questioned the return on investment (RoI) of a memory-only receive mode. Because a standard pwrite() to the OS page cache is already a RAM-speed operation (taking mere microseconds), remote_write is practically as fast as any network-receive-only mode under normal conditions, but with a major durability advantage: it survives a PostgreSQL process crash on the standby (as long as the OS remains running). Bypassing it would reduce durability without providing any real-world performance benefit in typical conditions. In pgsql-hackers discussions—such as the thread on Measuring Replay Lag where distinct write_lag and flush_lag tracking was introduced—it is clear that the network round-trip time dominates the replication pipeline. To find a measurable benefit for a receive-only mode, I needed a replica where disk I/O was a severe enough bottleneck to cause page cache pressure and slow down the pwrite() call itself.

3. Scenario 2: The HDD Challenge (Asymmetric Hardware)

For the next test, I replaced the fast camry replica with a much weaker machine.

The Servers:

Primary: lenovo (i7-12700, SSD)
Replica: mac (Intel Core i5-2520M 2C/4T, 16GB RAM, 5400RPM HDD)

With a mechanical hard drive and a slow dual-core processor, I expected remote_receive to outpace remote_write. I ran three 30-second runs per mode, but the initial results were unexpected:

The First (Anomalous) Result:

remote_write: 203.6 TPS (Median)
remote_receive: 179.0 TPS (Median)

remote_write was ~20 TPS faster than remote_receive. Tracing the walreceiver and walsender loops ruled out code bugs; instead, the bottleneck came down to four factors:

The OS Cache Illusion: Even on a slow HDD, pwrite() still writes to RAM. The mechanical disk's extreme latency only hits during fsync or when the page cache fills up, meaning the receive-only advantage remained small.
CPU/Scheduler Thrashing: By sending an "early reply" before writing to disk, remote_receive generates twice as many TCP reply packets (one for receive, one later for flush). On the replica's older dual-core CPU, processing this packet flood alongside WAL replay caused high context-switching overhead.
Flow Control: remote_write acted as a natural flow-control mechanism. By waiting to write before replying, it throttled the primary slightly, keeping the replica's CPU out of a thrashing state.
Statistical Noise: The remote_write runs ranged from 177 to 211 TPS (a 19% spread). A 3-run, 30-second test was simply too noisy to yield a reliable median.

Aside: The Raspberry Pi 4 Attempt Before settling on the Mac Mini HDD, I attempted to use a Raspberry Pi 4 (pi4 — Cortex-A72 4C/4T, 4GB RAM, SD card / USB storage) as the slow replica. However, the Pi 4 was a poor fit for this benchmark. The main issue wasn't simply that the CPU maxed out, but that the low-power ARM CPU could not keep pace with the primary's faster rate of WAL generation. This lag cascaded into secondary issues—such as rapidly mounting replication lag, TCP buffer queues, and process starvation—which completely dominated the environment and masked any storage-level performance differences.

4. The Final Test: Eliminating the Noise

In the initial runs, the slow mechanical disk combined with standard kernel buffering created a massive source of noise: the sustained background I/O writes from a just-finished test were still flushing to disk when the next test began. Although an LSN cross-check was already in place (waiting for the replica's replay_lsn to catch up to the primary's current WAL LSN), this only verified database-level catch-up, not physical disk queue clearance. The residual write queue in the OS cache severely penalized the subsequent run, creating artificial variance. To isolate the actual replication performance and eliminate this noise, I overhauled the benchmarking methodology:

Interleaved Runs: I interleaved executions (Write, Receive, Write, Receive...) to average out temporal background OS tasks and thermal states.
Longer Runs and More Iterations: I ran 10 iterations per mode at 60 seconds per run (20 runs total).
Aggressive Cache Flushing: I ran sync on both the primary and replica between runs, followed by a 30-second sleep, to flush the OS page cache to physical disk platters and guarantee clean disk queues.

The Final Results:

Mode	Median TPS	Mean TPS	Median Latency
`remote_write`	201.58	200.74	20.663 ms
`remote_receive`	211.44	206.85	19.434 ms

The Results

With the noise eliminated and disk queues flushed, the true behavior emerged: remote_receive outperformed remote_write by ~10 TPS (~4.9% gain) at the median and ~6 TPS (~3.0% gain) at the mean. The small gap is expected: bypassing a RAM-buffered pwrite() on the replica yields microsecond-level gains, while network round-trip time remains the dominant factor.

Conclusion

The remote_receive implementation successfully introduces a granular durability option that guarantees WAL has crossed the network into the standby's memory before committing.

This exercise highlighted a key rule of database benchmarking: the OS page cache masks physical disk latency until it fills up. To accurately benchmark high-variance storage, one must use interleaved runs, higher iteration counts, and rigorous OS-level cache flushes between tests. Otherwise, you are measuring cache behavior and statistical noise rather than raw database throughput.

A Note on the Development Process & Resources: While I dabble in C and C++ from time to time, development at this level within PostgreSQL internals would not have been possible without AI assistance. Additionally, all resources for this project—including the Claude subscription, development machines, laptop, and time—were entirely personal and completely independent of my employer.