Showing posts with label opensource. Show all posts
Showing posts with label opensource. Show all posts

23 Nov 2025

PostgreSQL Buildfarm Members: A status update

The PostgreSQL Buildfarm is a global network of machines that continuously test PostgreSQL across a wide range of operating systems, architectures, compilers, and branches. Over the past few years, I have created and maintained several buildfarm members, each with its own quirks and strengths. In this post, I’ll share a status-update working on the following animals: alligator, dodo, woodpecker, leafhopper, massasauga, parula, and snakefly.

What is the Buildfarm?

The Buildfarm is essential for PostgreSQL development. It helps catch platform-specific bugs early, ensures code quality, and provides confidence that new changes work everywhere. Each member reports results for multiple branches (like master, REL_18_STABLE, etc.), using different OSes, compilers, and hardware.

About the Architectures

The Open Hardware Frontier: RISC-V

RISC-V is an open standard instruction set architecture (ISA) and unlike most other ISAs, RISC-V is provided under open source licenses that do not require fees to use.

  • ovenbird is my first foray into this architecture, running on a VisionFive 2 board.
  • It (hopefully) represents the future of open hardware, and ensuring PostgreSQL compiles and runs correctly on it is a long-term investment in the open-source ecosystem.

Bridging Windows and Linux: WSL2

Windows Subsystem for Linux (WSL) lets developers run a GNU/Linux environment -- including most command-line tools, utilities, and applications -- directly on Windows, unmodified, without the overhead of a traditional virtual machine or dualboot setup.

  • woodpecker runs inside a Debian container on WSL2.
  • This setup is crucial for verifying that PostgreSQL behaves correctly in this increasingly popular development environment, which bridges the gap between Windows and Linux.

Small Scale, Big Impact: Raspberry Pi

Raspberry Pi revolutionized low-cost computing and is a fantastic platform for edge cases (pun intended) :)

  • dodo runs on a Raspberry Pi 4 Model B.
  • It helps identify performance regressions and race conditions that might be masked by faster hardware. It also ensures PostgreSQL remains viable for low-powered, IoT and edge computing use cases.

The Rise of ARM in the Cloud: Graviton

Several of the buildfarm animals I’ve created run on the Graviton processors. Graviton is Amazon’s custom ARM-based CPU family, designed for high performance and energy efficiency in AWS cloud environments.

  • Graviton1 (first generation) was introduced in 2018, bringing ARM64 to AWS EC2.
  • Graviton2 (second generation) launched in 2020, offering major improvements in performance and scalability.
  • Graviton3 (third generation) arrived in 2022, further boosting compute, memory bandwidth, and energy efficiency—making it ideal for demanding workloads like database regression testing.
  • Graviton4 (fourth generation) is the latest, offering even greater performance and efficiency for cloud-native workloads. The buildfarm animal 'leafhopper' is one of the first to test PostgreSQL on Graviton4.

Testing PostgreSQL on these platforms helps ensure the database runs smoothly on modern cloud hardware and takes advantage of ARM’s growing ecosystem.

Disclosure: The Graviton machines are provided by my employer. All other machines (including the WSL2, RISC-V, and Raspberry Pi instances) are my personal machines.

Meet the Buildfarm Animals

Here’s a quick overview of the machines I have created and recently worked on:

alligator

  • OS: Ubuntu 24.04 LTS
  • Arch: x86_64
  • Compiler: gcc experimental (nightly build)
  • Branches: master, REL_18_STABLE, REL_17_STABLE, REL_16_STABLE, REL_15_STABLE, REL_14_STABLE, REL_13_STABLE
  • Notes: Tracks the latest GCC changes, often finds compiler regressions before anyone else.

dodo

  • OS: Raspbian GNU/Linux 10
  • Arch: armv7l
  • Compiler: gcc experimental (nightly build)
  • Branches: master, REL_18_STABLE, REL_17_STABLE, REL_16_STABLE, REL_15_STABLE, REL_14_STABLE, REL_13_STABLE
  • Notes: ARM platform, useful for catching issues on lower-powered hardware.

woodpecker

  • OS: Debian/WSL2@win11 12 (bookworm)
  • Arch: x86_64
  • Compiler: gcc 12.2.0
  • Branches: master, REL_18_STABLE, REL_17_STABLE, REL_16_STABLE, REL_15_STABLE, REL_14_STABLE, REL_13_STABLE
  • Notes: Runs inside WSL2 on Windows 11, great for testing integration with Windows environments.

leafhopper

  • OS: Amazon Linux 2023
  • Arch: aarch64/graviton4/r8g.2xl
  • Compiler: gcc experimental (hourly build)
  • Branches: master, REL_18_STABLE, REL_17_STABLE, REL_16_STABLE, REL_15_STABLE, REL_14_STABLE, REL_13_STABLE
  • Notes: Created and managed in a work-based environment; leafhopper is one of the first buildfarm animals testing PostgreSQL on Graviton4 hardware.

massasauga

  • OS: Amazon Linux 2
  • Arch: aarch64 (Graviton1)
  • Compiler: gcc experimental (nightly build)
  • Branches: master, REL_18_STABLE, REL_17_STABLE, REL_16_STABLE, REL_15_STABLE, REL_14_STABLE, REL_13_STABLE
  • Notes: Created and managed in a work-based environment; Graviton1 machine—one of the earliest ARM64 regression testers in the buildfarm, still running reliably after several years.

parula

  • OS: Amazon Linux 2 (AL2) / Graviton3
  • Arch: aarch64/Graviton3/c7g.2xl
  • Compiler: gcc experimental (nightly build)
  • Branches: master, REL_18_STABLE, REL_17_STABLE, REL_16_STABLE, REL_15_STABLE, REL_14_STABLE, REL_13_STABLE
  • Notes: Created and managed in a work-based environment; focuses on the third generation of AWS Graviton hardware, useful for performance and compatibility.

snakefly

  • OS: AmazonLinux@Graviton2 AL2
  • Arch: aarch64 (Graviton2)
  • Compiler: gcc experimental (nightly build)
  • Branches: master, REL_18_STABLE, REL_17_STABLE, REL_16_STABLE, REL_15_STABLE, REL_14_STABLE, REL_13_STABLE
  • Notes: Created and managed in a work-based environment; Graviton2-based member, helps ensure ARM64 stability across AWS generations.

ovenbird (newest member)

  • OS: Ubuntu 24.04.3 LTS
  • Arch: riscv64
  • Compiler: gcc 13.3.0
  • Branches: master, REL_18_STABLE, REL_17_STABLE, REL_16_STABLE, REL_15_STABLE, REL_14_STABLE, REL_13_STABLE
  • Notes: The newest addition to the family, ovenbird brings riscv64 architecture to the buildfarm, helping ensure PostgreSQL is tested on cutting-edge open hardware.

Challenges and Rewards

Managing these buildfarm animals means keeping up with OS upgrades, compiler changes, hardware failures, and PostgreSQL branch updates. Some of these machines are especially aggressive about GCC: they check for updates from the GCC git repository every few hours, recompile a fresh GCC, and use it for the next buildfarm run. This helps catch compiler regressions and compatibility issues very early.

If you want to read more about how these GCC compiles work and see the open source repository, check out my blog post: Compiling Latest GCC to Test More.

Some of these machines have been running for 3-4 years, and their logs are a treasure trove for debugging tricky platform-specific issues. The diversity of hardware and software helps the PostgreSQL community maintain its reputation for reliability and portability.

Testing with the latest GCC is especially rewarding: it ensures that upstream GCC changes are in tandem with the expectations of the PostgreSQL community, and that PostgreSQL continues to compile and pass tests without surprises. A good example is an upstream GCC bug that was found, reported, and fixed—making sure that no GCC changes adversely affect PostgreSQL in the long run. Read more about this incident here: PostgreSQL mailing list discussion of a GCC bug.

Here's another email thread that exemplifies why testing gcc experimental is helpful in ensuring that PostgreSQL compiles and tests stay green: PostgreSQL mailing list - GCC experimental thread.

However, it is also important to note that aggressive testing of GCC HEAD needs to be balanced against the time of PostgreSQL developers. The current buildfarm system does not explicitly distinguish between "production" and "bleeding edge" machines, meaning failures on experimental setups can sometimes be distracting. As discussed in this mailing list thread, there is an ongoing conversation about how to best handle these "platform not believed stable" scenarios to ensure that transient failures on experimental toolchains don't unnecessarily burden the community.

Speaking of new architectures, a few months back I wrote about [Testing PostgreSQL on Debian/Hurd](https://www.thatguyfromdelhi.com/2025/08/testing-postgresql-on-debianhurd.html) and planned to add a Hurd machine to the buildfarm. It looks like I've been beaten to the punch! A new member, [fruitcrow](https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=fruitcrow&br=master), is already up and running to test PostgreSQL on GNU/Hurd. This is fantastic news—having "competition" in adding diverse buildfarm members is exactly what we want. It shows that more people recognize that a wide array of test environments leads to a more stable PostgreSQL.

Final Thoughts

If you’re interested in contributing to PostgreSQL, running a buildfarm animal is a great way to help. It’s a hands-on way to learn about PostgreSQL internals, compilers, and operating systems, and it’s rewarding to see your machine’s name in the global test results.

14 Aug 2025

Testing PostgreSQL on Debian/Hurd: A Windows + QEMU Adventure

Curiosity often leads to the most interesting technical adventures. This time, I decided to explore something off the beaten path: running Debian GNU/Hurd inside a virtual machine on my Windows 11 host and compiling PostgreSQL from source.

This post is part 1 of a multi-part series documenting the process, challenges, and discoveries along the way. Future parts will dive deeper into advanced topics, automation, and ongoing compatibility work—so if you're interested in PostgreSQL, alternative operating systems, or open source testing, stay tuned!

What is Debian?
Debian is one of the oldest and most respected Linux distributions, known for its stability, vast software repositories, and commitment to free software principles. While most people associate Debian with the Linux kernel, it’s actually a complete operating system that can run on different kernels.

What is GNU/Hurd?
GNU/Hurd is an alternative kernel developed by the GNU Project. Unlike Linux, GNU/Hurd is built on a microkernel architecture (specifically GNU Mach), aiming for greater modularity and flexibility. While GNU/Hurd is still experimental and not as mature or widely used as Linux, it represents a fascinating approach to operating system design.

Debian GNU/Hurd combines the familiar Debian userland (tools, package management, etc.) with the GNU/Hurd kernel, offering a unique environment for open source enthusiasts and OS tinkerers.

My goal for this experiment was to see how far I could get with a modern database stack—specifically, compiling and running PostgreSQL—on this unusual platform.



Setting Up the VM

Instead of the CD image, I used the pre-built disk image available here. After downloading and extracting the .img file, I launched the VM with QEMU using the following command:

qemu-system-x86_64.exe -machine type=pc,accel=whpx,kernel-irqchip=off -boot d -m 4096 -usb -display default,show-cursor=on -drive file=".\debian-hurd-i386-20250807.img",cache=writeback

Explanation of the command:

  • qemu-system-x86_64.exe: Runs QEMU for 64-bit x86 systems (works for 32-bit guests too).
  • -machine type=pc,accel=whpx,kernel-irqchip=off: Specifies a PC-type machine, enables Windows Hypervisor Platform acceleration (WHPX), and disables kernel IRQ chip emulation for compatibility.
  • -boot d: Boots from the first hard disk.
  • -m 4096: Allocates 4GB of RAM to the VM.
  • -usb: Enables USB support.
  • -display default,show-cursor=on: Uses the default display and ensures the mouse cursor is visible.
  • -drive file=".\debian-hurd-i386-20250807.img",cache=writeback: Uses the extracted Hurd disk image as the hard drive and enables writeback caching for better disk performance.

This boots directly into the installed Debian/Hurd system with improved performance and usability on a Windows 11 host.

Preparing to Build PostgreSQL

Debian/hurd is minimal out of the box, so the first step was to install all the build tools and libraries required for compiling PostgreSQL:

sudo apt-get update
sudo apt-get install build-essential git libxml2-dev libxslt-dev autotools-dev automake libreadline-dev zlib1g-dev bison flex libssl-dev libpq-dev ccache

This command installs the compiler, linker, version control tools, XML and SSL libraries, autotools, and all other dependencies PostgreSQL may need for a successful build and test cycle.

Downloading and Compiling PostgreSQL

Instead of downloading a release tarball, I cloned the official PostgreSQL git repository and compiled the master branch:

git clone https://github.com/postgres/postgres.git
cd postgres
./configure --prefix=~/proj/localpg
make
make install

This approach ensures you're building the latest development version of PostgreSQL directly from source, and installs it locally to your user's ~/proj/localpg directory.

Setting Up the Database Cluster

PostgreSQL needs a data directory (cluster) to store its databases. Since the installation was local to my user, I simply initialized the cluster and started the server using the full path to the binaries (since they're not in my PATH):

~/proj/localpg/bin/initdb -D ~/proj/localpg/pgdata
~/proj/localpg/bin/pg_ctl -D ~/proj/localpg/pgdata -l logfile start

Connecting and Creating a Table

With the server running, I connected to the database and created a sample table:

~/proj/localpg/bin/psql -d postgres

Inside psql:

CREATE TABLE test_table (id SERIAL PRIMARY KEY, name TEXT);
INSERT INTO test_table (name) VALUES ('Hello from Debian/Hurd!');
SELECT * FROM test_table;

Example output:

CREATE TABLE
INSERT 0 1
 id |         name         
----+----------------------
  1 | Hello from Debian/Hurd!
(1 row)

Running the Test Suite

To ensure the build was solid, I went back to the source directory and ran:

cd ~/postgres
make check

This runs PostgreSQL's regression tests, verifying that the core features work as expected—even on Hurd. This ran mostly fine (except for a few tests that failed - more to be researched on that failure).

Quick QEMU Tip

When working with QEMU, remember that Ctrl-Alt-G is your friend—it releases the mouse and keyboard from the VM window, making it much easier to switch back to your host system.

Adding a Separate Volume for More Disk Space

The base Debian/Hurd image is quite small and can easily run out of space, especially when compiling large projects or running make check. I frequently hit disk full errors during testing.

Solution:

  1. Shutdown the VM.

  2. Resize the disk image:

    qemu-img resize debian-hurd-i386-20250807.img +10G
    

    This adds 10GB to the existing disk image.

  3. Restart the VM.

  4. Create a new partition:

    • Use fdisk /dev/hd0 (or the appropriate device) to create a new partition in the extra space.
  5. Format the new partition:

    mkfs.ext4 /dev/hd0s3
    

    (Note: On my setup, the original root partition was /dev/hd0s2, so the new partition created for extra space was /dev/hd0s3. Adjust the device name as needed for your configuration.)

    Although the root volume is of ext2 type (!!!), Debian/Hurd works fine with ext4—so feel free to use ext4 for the new partition.

  6. Mount the new volume:

    mkdir -p /mnt/newvol
    mount /dev/hd0s3 /mnt/newvol
    
  7. Grant non-root user access:

    • As root, change ownership:
      chown robins:robins /mnt/newvol
      
    • Now your non-root user (e.g., robins) can use /mnt/newvol for compiling PostgreSQL and running make check without running out of disk space.

Why use a non-root user for PostgreSQL? PostgreSQL is designed to run as a non-root user for security reasons. Running the database server or its tests as root can expose your system to unnecessary risks and may even cause certain operations to fail. Always use a dedicated non-root user for installation, testing, and day-to-day database operations.

This approach made it possible to complete the build and test cycle without disk space issues.

Final Thoughts

Running Debian/Hurd in a VM on Windows 11 was surprisingly smooth, though some packages and features are less mature than on Linux. Compiling PostgreSQL from scratch was a great way to explore the system's capabilities and compatibility. If you're looking for a fun, geeky weekend project, give Debian/Hurd a try!

Next Steps & What's Still Pending

This is only part 1 of a multi-part series. In future installments, I'll cover:

  • Setting up the PostgreSQL buildfarm for automated testing on Debian/Hurd
  • Deeper investigation into SMP/multi-core support (currently not working)
  • More QEMU optimization and compatibility testing
  • Additional performance tuning and disk management strategies
  • Troubleshooting Perl module installation issues (e.g., LWP::Protocol::https, LWP::Simple, Net::SSLeay), which currently fail to install—more research is needed to understand and resolve these problems.
  • Investigating why make check did not complete successfully (failed on a few tests)—this requires further research.

Some features, like multi-core support, full buildfarm integration, reliable Perl module installation, and passing all PostgreSQL regression tests, are not yet working or fully tested. These will be explored in detail in future posts. Stay tuned!

6 Aug 2025

Pi-hole Part 2: Going Fully Independent with Unbound

After my positive first impressions with Pi-hole, I decided to take the next logical step: eliminating my dependence on external DNS resolvers entirely. While Quad9 served me well, there's something unsettling about routing all my DNS queries through a third party, no matter how trustworthy they appear.

The solution? Unbound - a validating, recursive, caching DNS resolver that can operate completely independently, resolving domain names directly from authoritative sources without relying on upstream DNS providers.

Why Make the Switch?

My motivation goes beyond just privacy paranoia (though that's certainly part of it):

Privacy First: Every DNS query reveals your browsing patterns. Even with Quad9's privacy promises, I prefer keeping that data entirely within my network.

Security Enhancement: Unbound performs DNSSEC validation by default, ensuring the authenticity of DNS responses and protecting against DNS spoofing attacks.

Reduced Latency: While counterintuitive, eliminating the round-trip to external resolvers can actually improve response times for frequently accessed domains through better local caching.

True Independence: No more wondering about logging policies, data retention, or potential government requests to DNS providers.

The Downsides to Consider

Before diving in, it's worth acknowledging the trade-offs:

Increased Complexity: You're now responsible for maintaining and troubleshooting your own DNS infrastructure. When things break, you can't just blame your ISP's DNS servers.

Initial Query Delays: Cold cache scenarios will be slower as Unbound has to walk the entire DNS hierarchy from root servers. First visits to new domains will take noticeably longer.

Resource Usage: While minimal, you're now running an additional service that consumes memory and CPU cycles on your Pi.

Potential Connectivity Issues: If your Pi goes down, your entire network loses DNS resolution. External resolvers provide redundancy that you're giving up.

CDN Sub-optimization: Content delivery networks may not route you to the geographically closest servers, potentially affecting streaming and download performance. Many large DNS providers like Cloudflare and Google use Anycast networks, where the same IP address is announced from multiple geographic locations, automatically routing you to the nearest server. When you run your own recursive resolver, you lose this geographic optimization and might connect to CDN endpoints that are further away.

False Sense of Security: While your DNS queries are now private, it's important to remember that all your actual web traffic (HTTP/HTTPS requests) still flows through your ISP and is visible in their logs. You've privatized the "phone book lookup" but not the actual "conversation" - your ISP can still see which IP addresses you're connecting to, just not the domain names that resolved to them.

Maintenance Overhead: The root hints file should be updated periodically (though it changes infrequently), and you're responsible for keeping your Unbound configuration current and secure.

The Verdict: Benefits Outweigh the Costs

After weighing these trade-offs, the privacy and security benefits still made a compelling case for proceeding. The complexity is manageable for anyone comfortable with basic Linux administration, and the performance impacts are largely theoretical for typical home usage. Most importantly, the peace of mind from knowing exactly where my DNS queries go and how they're handled proved worth the additional overhead.

The Setup Process

Installing and configuring Unbound alongside Pi-hole turned out to be surprisingly straightforward.

Step 1: Install Unbound

sudo apt update
sudo apt install unbound -y

Step 2: Configure Unbound for Security and Performance

I created a custom configuration at /etc/unbound/unbound.conf.d/pi-hole.conf:

sudo nano /etc/unbound/unbound.conf.d/pi-hole.conf

Here's my security-focused configuration:

server:
    # Basic settings
    port: 5335
    do-ip4: yes
    do-ip6: yes
    do-udp: yes
    do-tcp: yes
    
    # Security settings
    trust-anchor-file: "/var/lib/unbound/root.key"
    auto-trust-anchor-file: "/var/lib/unbound/root.key"
    val-clean-additional: yes
    val-permissive-mode: no
    val-log-level: 1
    
    # Privacy settings
    hide-identity: yes
    hide-version: yes
    harden-glue: yes
    harden-dnssec-stripped: yes
    harden-below-nxdomain: yes
    harden-referral-path: yes
    use-caps-for-id: yes
    
    # Performance settings (security-first approach)
    cache-min-ttl: 300
    cache-max-ttl: 86400
    prefetch: yes
    prefetch-key: yes
    
    # Interface settings
    interface: 127.0.0.1
    access-control: 127.0.0.1/32 allow
    access-control: ::1 allow
    
    # Logging
    verbosity: 1
    log-queries: no
    log-replies: no
    
    # Root hints
    root-hints: "/var/lib/unbound/root.hints"

Step 3: Root Hints Management

Root hints are essential files that tell Unbound where to find the DNS root servers - the starting point for all DNS resolution. When installing Unbound via package manager, root hints are included and should be updated automatically through regular system updates.

Note: The root hints file changes infrequently (typically a few times per year when root servers are added, removed, or have IP changes). Rather than manual updates, it's best to keep your system updated through your package manager, which will handle root hints updates appropriately - roughly every 6 months is more than sufficient for most users.

Step 4: Initialize DNSSEC Root Key (Usually Optional)

Modern Unbound packages typically handle DNSSEC root key initialization automatically. However, if you encounter DNSSEC validation errors or want to ensure the key is properly configured, you can initialize it manually:

sudo unbound-anchor -a "/var/lib/unbound/root.key"
sudo chown unbound:unbound /var/lib/unbound/root.key

Note: If this step fails or the file already exists, it's likely already configured correctly by the package installation.

Step 5: Start and Enable Unbound

sudo systemctl enable unbound
sudo systemctl start unbound

Step 6: Test Unbound

Before integrating with Pi-hole, I verified Unbound was working with both valid and invalid domain lookups:

Testing with an existing domain:

robins@pi4:~ $ dig @127.0.0.1 -p 5335 google.com | egrep -w 'status|^google'
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58713
google.com.             269     IN      A       142.250.70.206

Testing with a non-existent domain:

dig @127.0.0.1 -p 5335 asdfasdsfasfdfsafasd | egrep -w 'status|flags'
robins@pi4:~ $ dig @127.0.0.1 -p 5335 asdfasdsfasfdfsafasd | egrep -w 'status|flags'
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 19691
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
; EDNS: version: 0, flags:; udp: 1232

Both tests confirm Unbound is working correctly. The existing domain test shows successful resolution with NOERROR status, while the non-existent domain test returns NXDOMAIN (Non-eXistent DOMAIN), which is the proper response when a domain doesn't exist. Note the ad flag in the non-existent domain response, indicating that even failed lookups are DNSSEC-validated - Unbound cryptographically verified that this domain truly doesn't exist rather than just accepting an unvalidated negative response.

Step 7: Configure Pi-hole to Use Unbound

In the Pi-hole admin interface:

  1. Navigate to SettingsDNS
  2. Uncheck all upstream DNS servers
  3. Add 127.0.0.1#5335 as a custom DNS server
  4. Enable DNSSEC validation
  5. Save settings

The Results: Worth the Effort

After 24 hours of operation with the new setup, the benefits are clear:

Complete Privacy: All DNS queries now resolve independently without touching external servers (except for the initial root server queries, which reveal no specific domain information).

DNSSEC Validation: Every response is cryptographically verified, providing protection against DNS manipulation.

Improved Cache Efficiency: Unbound's more sophisticated caching algorithm seems to provide better hit rates for our household's browsing patterns.



Performance Impact: Negligible. The Pi 4 handles both Pi-hole and Unbound without breaking a sweat:

robins@pi4:~ $ uptime
 21:22:25 up 2 days,  9:47,  3 users,  load average: 0.00, 0.00, 0.00

Security Considerations

My configuration prioritizes security over raw performance:

  • Conservative TTL settings: Shorter cache times mean more frequent validation
  • Strict DNSSEC validation: No permissive mode that might accept invalid responses
  • Minimal logging: No query logging to preserve privacy even locally
  • Restricted access: Only localhost can query Unbound directly

Final Thoughts

Setting up Unbound with Pi-hole represents the logical conclusion of taking control over your DNS infrastructure. While the privacy and security benefits are the primary motivators, the technical satisfaction of running a completely self-contained DNS resolution system is considerable.

The setup process is more involved than simply pointing Pi-hole at an external resolver, but the configuration is straightforward and well-documented. For anyone concerned about DNS privacy or wanting to reduce external dependencies, this combination provides an excellent solution.

The Pi 4 continues to prove itself as the perfect platform for this type of network infrastructure project - handling both Pi-hole and Unbound with resources to spare. As I mentioned in part 1, even a Pi 2 would likely suffice for most households, making this an accessible upgrade for anyone looking to enhance their home network's privacy and security posture.

30 Jul 2025

Pi-hole First Impressions: Home DNS Caching Done Right

(This is first of a 2 part series, where I explore pi-hole as a DNS caching solution-Here's Part 2)

I've had a Raspberry Pi 4 sitting idle for months, and finally decided to put it to good use by setting up Pi-hole for network-wide DNS caching and ad blocking. After less than a day of operation, I'm genuinely impressed with both the installation process and the results.

The Installation: Surprisingly Smooth

Setting up Pi-hole turned out to be one of the smoothest software installations I've experienced in recent memory. The entire process is handled by a single bash script:


curl -sSL https://install.pi-hole.net | bash


That's it! What impressed me most wasn't just the simplicity, but the thoroughness of the pre-installation checks. Before the script even attempted to install anything, it ran through a comprehensive validation process:

  • Network connectivity tests
  • DNS resolution verification
  • Package manager status checks
  • System requirements validation
  • Port availability confirmation

This rigorous validation gave me real confidence that the installation would succeed. Too often, installation scripts fail halfway through, leaving you with a partially configured mess. Pi-hole's approach of "check everything first, then install" meant that once the green lights were all showing, the installation proceeded flawlessly.

Interestingly, what took the most time during the entire setup wasn't Pi-hole itself, but my initial attempt to use Docker. I'd assumed Docker would be the quick and easy route (as it usually is), but the Pi had some leftover Docker installations that needed cleaning up first - removing old docker, docker.io, and docker-compose packages was a messy affair. Then the Pi seemed to have issues with the Docker installation process itself. In the end, I abandoned the Docker approach and went with the native installation, which ironically turned out to be much faster and cleaner.

One lesser-known tip that made the physical setup even smoother: if you're running a Google mesh network, the routers with both WAN and LAN ports can easily be used as wifi-to-LAN bridges. Rather than connecting the Pi 4 via WiFi, I simply plugged it into the LAN port of one of the mesh routers. This eliminated the need for additional network switches and reduced latency by a few milliseconds by avoiding the WiFi overhead entirely.

Another major convenience: you don't need to update DNS settings on each individual device. A single configuration change at the router level does the job for the entire household. In Google mesh, I simply changed the DNS setting from "Automatic" to "Custom" and set the Pi 4's IP as the primary DNS server. This immediately moved all devices on the network to use Pi-hole without touching any individual device settings. The only exceptions are devices that explicitly force their own DNS (looking at you, certain smart TVs and streaming devices), but those are relatively rare.

The Results: Better Than Expected

After 23 hours of operation protecting our home network, the statistics are eye-opening:



DNS Query Volume: 100,000 requests in 23 hours

  • That's roughly 4,350 DNS queries per hour from our household
  • Averages out to about 72 queries per minute
  • Shows just how chatty modern devices really are

Cache Performance: 70% cache hit rate

  • Pi-hole is successfully caching DNS responses locally
  • Reduces latency for frequently accessed domains
  • Decreases load on upstream DNS servers
  • Currently using Quad9 (9.9.9.9) as the upstream resolver - the only non-US DNS service I still trust in this day and age

Blocking Effectiveness: 8.5% of requests blocked

  • More than 1 in 12 DNS requests were for ad/tracking domains
  • That's 8,500 blocked requests in just 23 hours
  • Represents a significant reduction in unwanted network traffic

System Performance: Negligible Impact

One concern with running Pi-hole on a Pi 4 was whether it would impact system performance. For context, here are the system specs:


robins@pi4:~ $ uname -a
Linux pi4 6.12.34+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.12.34-1+rpt1~bookworm
  (2025-06-26) aarch64 GNU/Linux

robins@pi4:~ $ free -h
               total        used        free      shared  buff/cache   available
Mem:           3.7Gi       463Mi       1.7Gi        34Mi       1.6Gi       3.3Gi
Swap:          511Mi          0B       511Mi


The performance numbers are reassuring:

CPU Usage: Essentially zero load:


robins@pi4:~ $ uptime
 10:39:33 up 23:04,  3 users,  load average: 0.00, 0.00, 0.00

  • Zero computational overhead even after 23+ hours of operation
  • Pi 4 is more than capable of handling the workload
  • Plenty of headroom for additional services

Memory Usage: Minimal footprint

  • Pi-hole runs efficiently even on modest hardware
  • No noticeable impact on system responsiveness

Honestly, the Pi 4 seems like overkill for this task. Given the minimal resource requirements, a Pi 2 (with the older Cortex-A7) would likely handle Pi-hole just fine, leaving the Pi 4 (with the more capable Cortex-A72) free for more power-hungry side projects. The beauty of Pi-hole is that it's so lightweight, you can run it on practically any Raspberry Pi model and still have resources to spare.

The Web Interface: Polished and Responsive

Pi-hole's web dashboard deserves special mention. It's:

  • Fast and Responsive: Page loads are snappy, even on the Pi 4
  • Live Updates: The dashboard shows real-time query statistics
  • Well-Designed: Clean interface that makes data easy to understand
  • Comprehensive: Detailed logs, statistics, and configuration options



The live dashboard is particularly satisfying to watch - seeing blocked queries in real-time gives you a visceral sense of how much unwanted traffic Pi-hole is filtering out.

Real-World Benefits

Beyond the statistics, the practical benefits are noticeable:

  • Faster Browsing: Pages load quicker without ad network delays
  • Cleaner Experience: Websites feel less cluttered
  • Privacy Improvement: Reduced tracking across devices
  • Bandwidth Savings: Less unwanted traffic on the network

One particular benefit I'm looking forward to testing: my network occasionally suffers from stuttering ping latencies. Instead of the usual 10ms responses, I'll sometimes see 100ms or even 1-second ping times that persist for minutes. During these episodes, streaming mostly works and browsing barely functions, but what really hurts is DNS resolution - especially for websites requiring multiple round-trips. With Pi-hole's 70% cache hit rate, those problematic periods should be noticeably less laggy since most DNS queries won't need to traverse the struggling network connection. More testing needed, but the potential is promising.

The ad filtering aspect wasn't my primary motivation, but it doesn't hurt either. For websites I frequent that provide genuinely good content - like Phoronix with their constant stream of quality tech updates - I've set up pass-throughs so they can rightfully earn the advertising revenue they deserve for their work. For projects like Pi-hole itself, I prefer the direct contribution route via donations. It's about supporting the creators and maintainers who provide value, whether through allowing ads or direct financial support.

Worth Supporting

The Pi-hole project has created something genuinely useful that "just works" out of the box. The quality of both the software and the installation experience makes this a project worth supporting financially. I'll definitely be making a donation to help ensure continued development.

Final Thoughts

Pi-hole represents the best kind of open-source software: it solves a real problem elegantly, installs without drama, and delivers measurable benefits immediately. If you have a spare Raspberry Pi lying around, setting up Pi-hole is an excellent way to put it to work improving your entire network's performance and privacy.

The fact that it blocked 8,500 unwanted requests in less than a day while using virtually no system resources makes it a clear win. Sometimes the best technology is the kind you set up once and then forget about - until you look at the statistics and realize how much work it's quietly doing in the background.

27 Oct 2024

What's in an empty table?


How much storage does an empty table in Postgres take?

This is a post about Postgres tables that store ... well basically ... Nothing.


The idea for this post came from this tweet that hinted that an empty table on most databases today takes 16Kb of storage. Now admittedly Franck was probably reminiscing the good-old days so this is probably quite out of context, but it did get me thinking, and thus this post.


NB: Here's a video showing this in action ! - See video .

A "regular" empty table in Production

Here's a regular small table that could be found in Production. It has a Primary Key, a text column and a JSONB column. Let's check the table size using the pg_total_relation_size() postgres function (you can read more about that function here).

db1=# create table t(id bigint primary key, b text, c jsonb);

CREATE TABLE


db1=# select pg_total_relation_size('t');

 pg_total_relation_size

------------------------

                  16384

(1 row)



Hmmm, so that tweet did have a point. Given how low-cost memory has become over the decades, it is easy to understand why databases today chose to optimize speed over memory efficiency (more on this later) and so even an empty table in Postgres, does consume 16 kb. 

But "where" is the 16kb being used?


db1=# select pg_relation_size('t');
 pg_relation_size
------------------
                0
(1 row)

The relation itself isn't consuming any space! That is good (again more on this later) but then where is the space being used then?

db1=# \d t
                 Table "public.t"
 Column |  Type  | Collation | Nullable | Default
--------+--------+-----------+----------+---------
 id     | bigint |           | not null |
 b      | text   |           |          |
 c      | jsonb  |           |          |
Indexes:
    "t_pkey" PRIMARY KEY, btree (id)


We see that the table has a Primary Key - and thus an index - t_pkey

Does the index is consuming 16kb?

db1=# select pg_relation_size('t_pkey');
 pg_relation_size
------------------
             8192
(1 row)

So the index is using some of it - 8kb - so that is progress - but who's using the other 8kb?


Let's start cutting the table down

Let's start cutting down the columns and see if the disk-usage goes down. 

db1=# DROP TABLE t; CREATE TABLE t (id BIGINT PRIMARY KEY, b JSONB); select pg_total_relation_size('t');
DROP TABLE
CREATE TABLE
 pg_total_relation_size
------------------------
                  16384
(1 row)

db1=# DROP TABLE t; CREATE TABLE t (id BIGINT PRIMARY KEY, b TEXT); select pg_total_relation_size('t');
DROP TABLE
CREATE TABLE
 pg_total_relation_size
------------------------
                  16384
(1 row)


Hmmm, that didn't help at all. Dropping either of the TEXT or JSONB column didn't help. Let's look at the expanded version of this table to see if there's any similarity in the two columns. (I've clipped the output to make it easier to read)


db1=# \d+ t
                                            Table "public.t"
 Column |  Type  | Collation | Nullable | Default | Storage  | ...
--------+--------+-----------+----------+---------+----------+-...
 id     | bigint |           | not null |         | plain    | ...
 b      | text   |           |          |         | extended | ...
 c      | jsonb  |           |          |         | extended | ...
Indexes:
    "t_pkey" PRIMARY KEY, btree (id)
Access method: heap


Clearly, the two column's are "extended" (you can read more about it here), but basically what happened here is that an extended column type resulted in the creation of a TOAST table (you can read more about TOAST here). Let's find out "how" to find the TOAST table for the table "t", and check if that could be consuming the remaining 8kb?


db1=# select oid, relname, reltoastrelid, reltoastrelid::regclass from pg_class where relname = 't';
  oid  | relname | reltoastrelid |      reltoastrelid
-------+---------+---------------+-------------------------
 18300 | t       |         18303 | pg_toast.pg_toast_18300
(1 row)

db1=# select pg_relation_size('pg_toast.pg_toast_18300');
 pg_relation_size
------------------
                0
(1 row)

Hmmm, it's not the TOAST table. But just like the main table "t", could it be that the TOAST table has supporting relations that are to blame?


db1=# select pg_total_relation_size('pg_toast.pg_toast_18300');
 pg_total_relation_size
------------------------
                   8192
(1 row)

db1=# \d pg_toast.pg_toast_18300
TOAST table "pg_toast.pg_toast_18300"
   Column   |  Type
------------+---------
 chunk_id   | oid
 chunk_seq  | integer
 chunk_data | bytea
Owning table: "public.t"
Indexes:
    "pg_toast_18300_index" PRIMARY KEY, btree (chunk_id, chunk_seq)

db1=# select pg_relation_size('pg_toast.pg_toast_18300_index');
 pg_relation_size
------------------
             8192
(1 row)

Yes! So we see above, that the TOAST table implicitly has a primary key (of it's own) that uses an index - which has 1 page (8kb) assigned to it.

In a nutshell, the empty table 't' above, consumes 16kb and the storage allocation takes this shape:

  • Main relation - 0 bytes - 't'
  • Main relation  Index - 8kb - 't_pkey'
  • Toast relation - 0 bytes - 'pg_toast_18300'
  • Toast relation Index - 8kb - 'pg_toast_18300_index'

Cut Cut Cut

Okay, lets see if we can reduce the table size further, by dropping both the Extended columns.


db1=# DROP TABLE t; CREATE TABLE t (id BIGINT PRIMARY KEY); select pg_total_relation_size('t');
DROP TABLE
CREATE TABLE
 pg_total_relation_size
------------------------
                   8192
(1 row)


Okay, that makes sense. Now the main table (heap) is still not consuming anything, but the index still does. 

Let's reduce further.

db1=# DROP TABLE t; CREATE TABLE t (id BIGINT); select pg_total_relation_size('t');
DROP TABLE
CREATE TABLE
 pg_total_relation_size
------------------------

                      0
(1 row)





0 bytes!! 


Nice! But seriously - 0 bytes?

It kind of makes sense, that since Postgres doesn't yet have anything to store - besides the metadata of the table (and since the metadata is stored in the system catalogs - for e.g. pg_catalog schema) - there isn't anything to store in the main relation (heap) as yet. 

Disk Usage - Check filesystem

Hmmm - Nah, what if I don't trust Postgres?

Let's skip Postgres functions and ask the filesystem directly - and see if the table is actually 0 bytes. Here we first find the file path of the table in question using the postgres function pg_relation_filepath() and then ask the file-system for the file-size.


db1=# select pg_relation_filepath('t');
 pg_relation_filepath
----------------------
 base/17727/18356
(1 row)

db1=# \! ls -la /home/robins/proj/localpg/data/base/17727/18356

-rw------- 1 robins robins 0 Sep 23 10:19 /home/robins/proj/localpg/data/base/17727/18356


So the file corresponding to the table, actually is using 0 bytes. Nice!

Now, when a table is created, some entries are added to the system catalog. Let see if the database size is signficantly more than a blank database?


postgres=# create database db1;
CREATE DATABASE

postgres=# \c db1
You are now connected to database "db1" as user "robins".

db1=# CREATE TABLE t (id BIGINT);
CREATE TABLE

db1=# select pg_database_size('db1');
 pg_database_size
------------------
          7482515
(1 row)

db1=# create database db2;
CREATE DATABASE

db1=# \c db2
You are now connected to database "db2" as user "robins".

db2=# select pg_database_size('db2');
 pg_database_size
------------------
          7482515
(1 row)


Good. So this somewhat confirms that a blank database and a database with an "empty" table use the same disk-space. 

Postgres is hiding something

Technically though, I'm lying. Well actually Postgres is lying hiding something from the file-system (i.e. every new table does make the database grow logically - a tad little - just that most of the times, the filesystem doesn't get the memo).

Under the cover, the way postgres stores data in a table (catalog table, or any user table), although it consumes a page of disk-space (often 8Kb), logically it may be consuming only a small part of that page. This is very helpful when more rows need to be stored in the table. When more rows come in, Postgres is able to reuse the same (first) page to now logically store more data - although for the filesystem - no extra pages were requested. This is what I was hinting at earlier that today's database use disk space (and thus memory cache) with larger (8kb) chunks and continue to keep using that page (until a new page is needed). Further below, I show a brief example of how all of this works.

But to summarize, it is then unfair to say that the system catalog did not grow at all (when a new table was added) - since some catalogs are guaranteed to have grown (for e.g. metadata of the new table is stored as an extra row in pg_class etc.) within the disk blocks already allocated as in-use for that catalog.

Let's squeeze a little more? 


db1=# create table a();

CREATE TABLE


Right off the bat, that might seem completely wrong. Does Postgres allow a table with no Columns?

Yes !! 😎

All databases allow creation of an empty table (obviously), but Postgres allows a new table even if there are no columns! Let's verify this from the Postgres Documentation. Although suttle, we can see that in the syntax section of the CREATE TABLE page, the column_name data_type is enclosed with a square braces [] - which implies that columns are in fact, optional. What's more, this "feature" is a part of Postgres at least for the past 20 years!

Now the utility of this table is arguable (we'll explore that below), but it is now clear that this syntax is legal and works just fine. 

Let's see how such a table looks like with psql \d 

db1=# \d a
                Table "public.a"
 Column | Type | Collation | Nullable | Default
--------+------+-----------+----------+---------


That's it! That's the complete output - Since the table has no columns, the output above (rightly) doesn't show anything.


Let's go a little deeper

The obvious next question is - What on earth could a table like this be used for? That is a perfectly good question, and the answer is probably not much. However if you instead ask whether "Squeeze a little more" mean that a no column table takes less storage space? The answer (depends on input data but) is most probably yes. Let's see how does it help Postgres if it knows that you don't want to store any column in the table.


db1=# create table a();
CREATE TABLE
db1=# \dt+ a
                                  List of relations
 Schema | Name | Type  | Owner  | Persistence | Access method |  Size   | Description
--------+------+-------+--------+-------------+---------------+---------+-------------
 public | a    | table | robins | permanent   | heap          | 0 bytes |
(1 row)

db1=# insert into a select;
INSERT 0 1

db1=# \dt+ a
                                    List of relations
 Schema | Name | Type  | Owner  | Persistence | Access method |    Size    | Description
--------+------+-------+--------+-------------+---------------+------------+-------------
 public | a    | table | robins | permanent   | heap          | 8192 bytes |
(1 row)


Nothing new here. We see that although an empty (0 column) table consumes 0 byte for storage. And like a regular table, as soon as the first row is inserted, the table uses 1 page - which in my test database (and probably 99.99% of postgres databases world-wide) consumes 8192 bytes. This is expected, but do note that the storage of logical rows in a postgres page, is a little oddly done (and for good reason). There is lllllooooottttt of detail here - but I wouldn't blame you if you'd want to keep that aside for a cold winter morning - when armed with a cup of hot coffee.

For now, we see below that each row that is inserted into the table, consumes 24 bytes - in that 8kb page.

db1=# create extension pageinspect ;
CREATE EXTENSION

db1=# truncate table a;
TRUNCATE TABLE

db1=# insert into a select FROM generate_series(1,3);
INSERT 0 3

db1=# SELECT * FROM heap_page_items(get_raw_page('a', 0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits
| t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------
+-------+--------
  1 |   8168 |        1 |     24 |  85105 |      0 |        0 | (0,1)  |           0 |       2048 |     24 |
|       | \x
  2 |   8144 |        1 |     24 |  85105 |      0 |        0 | (0,2)  |           0 |       2048 |     24 |
|       | \x
  3 |   8120 |        1 |     24 |  85105 |      0 |        0 | (0,3)  |           0 |       2048 |     24 |
|       | \x
(3 rows)



Still going Deeper - but on a Tangent

So does the above imply that adding more columns to a table would mean Postgres consumes more bytres-per-row? Let's verify:

db1=# drop table t;
DROP TABLE

db1=# create table t(id bigint);
CREATE TABLE

db1=# truncate table t; insert into t select generate_series(1,3); vacuum full t; \dt+ t
TRUNCATE TABLE
INSERT 0 3
VACUUM
                                    List of relations
 Schema | Name | Type  | Owner  | Persistence | Access method |    Size    | Description
--------+------+-------+--------+-------------+---------------+------------+-------------
 public | t    | table | robins | permanent   | heap          | 8192 bytes |
(1 row)

db1=# SELECT * FROM heap_page_items(get_raw_page('t', 0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits
| t_oid |       t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------
+-------+--------------------
  1 |   8160 |        1 |     32 |  85100 |      0 |        0 | (0,1)  |           1 |       2816 |     24 |
|       | \x0100000000000000
  2 |   8128 |        1 |     32 |  85100 |      0 |        0 | (0,2)  |           1 |       2816 |     24 |
|       | \x0200000000000000
  3 |   8096 |        1 |     32 |  85100 |      0 |        0 | (0,3)  |           1 |       2816 |     24 |
|       | \x0300000000000000
(3 rows)


Here we see that each row is now consuming 32 bytes - which is an extra 8 bytes from earlier. Good chances the only column we've added is the reason for the extra 8 bytes. Let's verify that using the pg_column_size() function (you can read more about it here):


db1=# select pg_column_size(1::bigint);
 pg_column_size
----------------
              8
(1 row)


But wait, there's one more twist here:

db1=# INSERT INTO t SELECT;
INSERT 0 1

db1=# SELECT * FROM heap_page_items(get_raw_page('t', 0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff |  t_bits
  | t_oid |       t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------
--+-------+--------------------
  1 |   8160 |        1 |     32 |  85111 |      0 |        0 | (0,1)  |           1 |       2816 |     24 |
  |       | \x0100000000000000
  2 |   8128 |        1 |     32 |  85111 |      0 |        0 | (0,2)  |           1 |       2816 |     24 |
  |       | \x0200000000000000
  3 |   8096 |        1 |     32 |  85111 |      0 |        0 | (0,3)  |           1 |       2816 |     24 |
  |       | \x0300000000000000
  4 |   8072 |        1 |     24 |  85113 |      0 |        0 | (0,4)  |           1 |       2049 |     24 | 0000000
0 |       | \x
(4 rows)

So wait, see row 4. Although the table has a column, just because the column didn't have a value, the row actually consumed only 24 bytes (the minimum)?

Is this scalable? I mean can I have a 5 column table and still Postgres stores a row, but only consume the bare minimum 24 bytes? Let's see:


db1=# drop table h;
DROP TABLE

db1=# create table h(c1 bigint, c2 bigint, c3 bigint, c4 bigint, c5 bigint);
CREATE TABLE

db1=# insert into h select;
INSERT 0 1

db1=# SELECT * FROM heap_page_items(get_raw_page('h', 0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff |  t_bits
  | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------
--+-------+--------
  1 |   8168 |        1 |     24 |  86262 |      0 |        0 | (0,1)  |           5 |       2049 |     24 | 0000000
0 |       | \x
(1 row)


So yes, that does work and it does scale for "many" columns - but with a minor variation. There's more detail in code, but basically for regular columns the header contains 1 bit per column which expands in 8 byte chunks - and so say for 100 column table (with no data) - Postgres consumes ~40 bytes per row. 


db1=# drop table h; create table h(); select 'alter table h add column c' || n || ' bigint;' from generate_series(1,100) e(n); \gexec
.
.
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE

db1=# select count(*) from pg_attribute where attrelid = 'h'::regclass;
 count
-------
   106
(1 row)

db1=# insert into h select; vacuum full h; SELECT * FROM heap_page_items(get_raw_page('h', 0));
INSERT 0 1
VACUUM
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff |
                                          t_bits                                                  | t_oid | t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------
--------------------------------------------------------------------------------------------------+-------+--------
  1 |   8152 |        1 |     40 |  88376 |      0 |        0 | (0,1)  |         100 |       2817 |     40 | 0000000
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 |       | \x
(1 row)

Given the task at hand, I'd say that's still decently crisp.

So does a zero column table, squeeze max rows per page?


Yes, and No. So going back to a 0 column table - let's try to fill the whole page with rows and see how many can be stuffed on a single page. 

If you're running low on coffee - the page-header is 24 bytes and so back-of-the-envelope math suggests that number of rows possible to squeeze into a page should be - ( 8192 bytes in a page - some bytes for page header & footer ) / 24 bytes per row = ~340 rows.


db1=# truncate table a; insert into a select FROM generate_series(1,340); vacuum full a; \dt+ a
TRUNCATE TABLE
INSERT 0 340
VACUUM
                                 List of relations
 Schema | Name | Type  | Owner  | Persistence | Access method | Size  | Description
--------+------+-------+--------+-------------+---------------+-------+-------------
 public | a    | table | robins | permanent   | heap          | 16 kB |
(1 row)

Something's not right! That size should have stayed 8Kb. Why did Postgres use 16Kb (two pages - instead of one)?

That's because of this tiny bit of trivia that the maximum number of tuples that Postgres can squeeze onto a page, is hard-coded to 291 (for an 8kb page) - which was interesting to know - but to clarify, Heap-Only-Tuples (HOT feature) can effectively force a few more rows on a running database, but we'll go deeper into that some other day.

So let's go back and confirm if that understanding is correct - that Postgres can in fact squeeze at max only 291 rows onto the same page. 


db1=# truncate table a; insert into a select FROM generate_series(1,291); vacuum full a; \dt+ a
TRUNCATE TABLE
INSERT 0 291
VACUUM
                                    List of relations
 Schema | Name | Type  | Owner  | Persistence | Access method |    Size    | Description
--------+------+-------+--------+-------------+---------------+------------+-------------
 public | a    | table | robins | permanent   | heap          | 8192 bytes |
(1 row)


db1=# truncate table a; insert into a select FROM generate_series(1,292); vacuum full a; \dt+ a
TRUNCATE TABLE
INSERT 0 292
VACUUM
                                 List of relations
 Schema | Name | Type  | Owner  | Persistence | Access method | Size  | Description
--------+------+-------+--------+-------------+---------------+-------+-------------
 public | a    | table | robins | permanent   | heap          | 16 kB |
(1 row)

Here we see that:
  • When 291 rows are inserted (blue), the table stays at 1 page (8kb)
  • Whereas when 292 (291+1) rows are inserted (green), the table expands to 2 pages (16 kb)

Utility

So, that's all fine, but what's this table good for?

Well beyond understanding Postgres :) not much. This table is unhelpful for most database tasks. It can't store values - it wouldn't allow columns / indexes / selective deletes (let's avoid ctid hacks for now) etc. 

But if I was forced to conjure an idea, a high-contention ticker app (that only needs to store +1s) via postgres functions - may be (and that's a BIG may be) this table could be used to store +1s - with a back-off algorithm on the application side. Good chances if there's a good DBA - this is done much better in many other ways (simplest of which is at the application end, or as a value in a column etc.) but it'd be better than nothing.

There are few other possible use-cases discussed here - for e.g. if (for some reason) you'd want to add columns to a table in a programmatic fashion after a base table is created - like we did above in the 100 column table test, OR, if you want to reserve a table name (in a multi-user setup) months / years in advance etc.

Finally

Much Ado About Nothing... This was a good exercise where we learnt something new about how Postgres stores table data - when ironically - there's nothing to store :) 

Hope you had fun! Comments more than welcome.

16 Jun 2024

Compiling latest gcc to test more architectures

Off late, I've had two separate needs to compile GCC by hand and although my first foray into compiling gcc from git took patience, stumbling over the basics was interesting to say the least.

The first time I realised that an old GCC version could matter, was this feedback [1] that one of my buildfarm members was running an old (for its arch) gcc version, something that I almost never paid much attention to. The other being that that led me to newer architectures (more on that below) and how this could repeat itself if / when I end up playing with more architectures.

So finally, I can say I have a framework that frequently checks / recompiles gcc and ensures all my local tests are using the latest and the greatest gcc :) . (I am happy with how this has taken shape on my home server, and once I am able to port it to my other machines, don't see why this shouldn't land on github).

Now admittedly, compiling gcc on a nightly basis was already an overkill, but then what the heck - I went and did this hourly basis just because well-why-not. My personal ask was to:

  1. Incrementally learn how compiling gcc unfolds
  2. Have some fun scripting while at it
  3. ... but most importantly, see whether I could utilize this experience in other experiments where the idea is to forewarn database developers about upcoming changes.

A little more on point 3 above, I oversee a few machines on the postgres buildfarm and they differ in some aspects:

  1. Different archictures:
    1. aarch64: Gravitons
    2. x86-64: A vanilla off-the-shelf dell workstation
    3. armv7l - Raspberry Pi4
  2. Different GCCs:
    1. 8.3.0 (default in pi4)
    2. 7.3.1 (default on most ALs)
    3. 13.2 (default on Ubuntu)
    4. 14.0.1 (naive attempt at compiling whatever cleared make check)
    5. gcc (experimental nightly)
  3. (Internally I also run some different fuzzing workloads but that's besides the point)
  4. (Future plans - add some form of randomizer to test odd combinations of compilation flags, but more on that in an upcoming post)


Now if all goes to plan, I should also add to the mix, 2 new architectures. They wouldn't be the snappiest processors in the market (at least not in the pricing-level I am after), but hey it should be fun to play with!

  1. loongarch64 - (cough) Recent (but sure as butterflies, a promising) entrant - the MIPS64 Loongson has been around for some time now, but my interest has grown off late owing to sporadic reports that its becoming somewhat competitive, which should be interesting to review.
  2. riscv64 - Another interesting arch that should be fun to try out. Again, am not holding my breath that it'd top any charts, but could still end up being interesting nonetheless.

On the GCC front, getting the setup ready and stable, clears the path to now focus upgrading my buildfarm animals one-by-one and basically start focussing beyond this hurdle.

gcseb02 20240615_1700 - git checkout successful.
gcseb02 20240615_1700 - git pull successful.
gcseb02 20240615_1700 - No change in gcc version. Quitting.

gcsa36f 20240615_1800 - git checkout successful.
gcsa36f 20240615_1800 - git pull successful.
gcsa36f 20240615_1800 - gcc has changed - [471fb092601] vs [57af69d56e7]. Recompiling.
gcsa36f 20240615_1800 - make successful
gcsa36f 20240615_1800 - make install successful.
gcsa36f 20240615_1800 - gcc version string has changed from [15.0.0 20240615 (experimental) - 471fb092601] to [15.0.0 20240615 (experimental) - 57af69d56e7]

gcsf66a 20240615_1900 - git checkout successful.
gcsf66a 20240615_1900 - Unable to git pull. Are we connected? Quitting.
gcsf66a 20240615_1900 - git switched back to 57af69d56e7.

gcs629f 20240615_2000 - git checkout successful.
gcs629f 20240615_2000 - git pull successful.
gcs629f 20240615_2000 - gcc has changed - [57af69d56e7] vs [6762d5738b0]. Recompiling.
gcs629f 20240615_2000 - make successful
gcs629f 20240615_2000 - make install successful.
gcs629f 20240615_2000 - gcc version string has changed from [15.0.0 20240615 (experimental) - 57af69d56e7] to [15.0.0 20240615 (experimental) - 6762d5738b0]

.
.

gcsc115 20240616_0400 - git checkout successful.
gcsc115 20240616_0400 - git pull successful.
gcsc115 20240616_0400 - No change in gcc version. Quitting.

Reference

2. Compilation script source - https://github.com/robins/gcc_compile

3x Faster TID Range Scans - Postgres 19

If you've ever had to scrub significantly large tables—whether updating older records or deleting expired ones—you know the pain of tryi...