That Guy From Delhi: ai

Showing posts with label ai. Show all posts

22 Apr 2026

Local LLM Update - ThinkStation Meets Blackwell

Finally - an update to the AI workbench, and one hell of an update it is :) !

The home server farm has now been blessed with dedicated silicon, and it is an absolute beast: the NVIDIA RTX 5060 Ti 16GB. The old setup—which was painfully grinding through CPU inference at a mere 1 token/sec on 46GB of RAM—has officially been retired. After months of relying on remote APIs, lightning-fast local agentic workflows are finally a reality, all without seriously over-working my legacy workstation (and waiting eons for responses).

To clarify, my paid subscriptions to cloud LLMs will continue. However, with this capable local setup handling the bulk of my daily tasks, I expect to hit the dreaded "credits have expired" message much later in the month—if ever. At the very least, having this local horsepower certainly pushes off the need to upgrade to a pricier "Ultra" plan just to get more API tokens.

The ultimate objective here is to deploy an agentic setup for the home and family. I initially investigated OpenClaw for the orchestration layer, but its security model was simply too porous for a paranoid systems engineer. IronClaw, on the other hand, looks like a solid, secure candidate to serve as the local agent nexus. Before any of that software could run, I needed an inference engine that did not crawl.

The Hardware: The ThinkStation S30 Meets Blackwell

The host machine is my trusty Lenovo ThinkStation S30 (Machine Type 4351). It is a Sandy/Ivy Bridge Xeon platform, firmly anchored in the PCIe 3.0 era. Dropping a brand-new Blackwell-architecture GPU into a system this old is technically a severe mismatch, but the RTX 5060 Ti 16GB is the perfect fit for this specific niche.

Instead of chasing older, power-hungry professional cards like the RTX A4000, or settling for consumer Turing cards with split memory bottlenecks, the 5060 Ti offered exactly what the S30 needed:

16GB VRAM: The absolute minimum needed to comfortably fit modern 7B-9B parameter models.
GDDR7 Bandwidth: Hitting 448 GB/s, drastically improving throughput over older 60-series cards.
Native FP8/FP4 Support: Crucial for running highly quantized models efficiently.

Overcoming Legacy Architecture Limits

Getting a 2026 GPU to speak with a 2013 motherboard required some immediate troubleshooting. Initial boots into Linux Mint resulted in a wall of kernel panics and DMAR (DMA Remapping) faults. The modern GPU's memory management completely clashed with the S30's legacy Intel VT-d (IOMMU) implementation.

I resolved this by disabling Intel VT-d in the BIOS and appending intel_iommu=off to the GRUB bootloader parameters. This bypassed the broken firmware tables and allowed the system to boot stably with the proprietary NVIDIA 580.126.09 drivers.

Another significant bottleneck was the PCIe 3.0 bus itself. When running vLLM, the default CUDA Graph capture and torch.compile phases initially took a grueling 16 minutes to complete due to the slow bus speed. While it worked beautifully after that initial warmup, the long delay posed a problem: because I set up the inference engine as an auto-start systemd service at boot, systemd would assume the process had hung and shoot it down before compilation could finish. To resolve this, I bypassed the compilation overhead by starting vLLM with the --enforce-eager flag, ensuring the service starts up reliably without getting restarted by the OS.

Performance: 40-70 Tokens per Second

Despite the older host system, the 5060 Ti excels at small, localized tasks.

I settled on the Qwen2.5-Coder-7B-Instruct-FP8-Dynamic model. Because it leverages FP8 precision, the model weights consume roughly 8.5GB of the available 15.48 GiB VRAM. This leaves plenty of overhead for the KV cache and a 16k context window without spilling over into system RAM (which, across a PCIe 3.0 bus, would throttle performance down to single-digit tokens per second).

With vLLM 0.19.1 managing the inference (I initially started with Ollama but switched to vLLM—I don't have comparison numbers yet, but that is a topic for a future post), the S30 consistently pushes 40 to 70 tokens/sec for generation, and handles prompt processing at over 700 tokens/sec.

Here is a quick snapshot from the vLLM logs confirming these real-world speeds during a typical code completion task:

(APIServer pid=967222) INFO 04-22 19:09:50 [loggers.py:259] Engine 000: 
  Avg prompt throughput: 709.3 tokens/s, Avg generation throughput: 4.3 tokens/s, 
  Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.9%, 
  Prefix cache hit rate: 0.4%

(APIServer pid=967222) INFO:     127.0.0.1:60326 - "POST /v1/chat/completions 
  HTTP/1.1" 200 OK

(APIServer pid=967222) INFO 04-22 19:10:00 [loggers.py:259] Engine 000: 
  Avg prompt throughput: 52.4 tokens/s, Avg generation throughput: 41.0 tokens/s, 
  Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, 
  Prefix cache hit rate: 0.6%

The inference service now runs automatically via systemd. To securely access the engine from my laptop at home, I rely entirely on an SSH port forward. This elegant solution means that dealing with network-level security or opening firewall ports isn't even a configuration requirement; SSH handles the secure tunnel perfectly. On the client side, this local endpoint works beautifully with VS Code and the Continue extension, providing a seamless and entirely private AI coding experience.

Power, Thermals, and Footprint

One of the best surprises of this Blackwell upgrade is how remarkably quiet and power-efficient the card is. It idles gracefully at 10W and rarely pushes past 20W during my typical small localized tasks. Because it runs so cool (sitting around 38°C with the fans completely off at 0%), stuffing the entire server rig into a cupboard does not trigger any thermal anxiety whatsoever.

For reference, here is the current footprint while loaded:

With the hardware foundation finally stabilized, incredibly power-efficient, and pushing excellent tokens per second, the runway is completely clear to deploy IronClaw and build out the actual home agent capabilities.

Looking Ahead

I am absolutely thrilled to have reached this stage! Pressing my trusty old ThinkStation into service was a gamble that paid off beautifully, saving a cool $2,000 to $3,000 that would have otherwise gone into an entirely new workstation on top of the GPU cost. Better yet, this setup gives me massive flexibility to stagger future upgrades. If I ever need more VRAM, I can simply drop in a second Blackwell card for dual mode—provided I finally upgrade the host machine to support PCIe 5.0 so the interconnect isn't choked by legacy bus speeds.

For now, I am eagerly looking forward to what's next. Having a secure, lightning-fast, and entirely local AI bedrock opens up incredible possibilities for the home network. I'll be diving deep into the IronClaw orchestration very soon, and you can expect more blog posts detailing that agentic journey in the coming weeks!

29 Mar 2026

Understanding Google AI Credits: What You're Actually Paying For

If you've subscribed to Google AI Pro [1] (or are eyeing an upgrade), you've probably noticed the word "credits" appearing everywhere — in your billing dashboard, in the Gemini app, and inside your code editor [5]. But what are they? How do they reset? And if your family is on the plan, who's burning through them? [3]

I spent an afternoon untangling this, and the short version is: Google doesn't have one quota system — it has three, each resetting on a different clock. Here's how it all works.

The Three Clocks

The single biggest source of confusion with Google AI Pro is that there are three independent limits running simultaneously, each with its own reset schedule:

Timer	What Resets?	Applies To
Monthly	1,000 AI Credits	Video / 4K Images / IDE Overages
Weekly	Antigravity Baseline Quota	"Free" high-tier usage in Antigravity
Daily	Prompt & Media Limits	Chat (Thinking/Pro), Images, Music

Understanding which "fuel tank" you're drawing from at any given moment is the key to not running dry at the wrong time.

Monthly AI Credits (The 1,000 Pool)

When you subscribe to AI Pro, Google allocates 1,000 AI Credits [1] at the start of each billing cycle. These are the credits you see in your Google One dashboard with a countdown showing days until reset.

Key facts:

No rollover. If you have 380 credits left with 10 days to go, those 380 vanish on your billing date [1]. You start fresh at 1,000.
These are for premium "heavy" tasks. High-end video generation (Google Flow/Veo 3.1) [4], advanced image creation (Imagen 4 / Nano Banana 2), and IDE overage billing all draw from this pool.
Standard tasks are free. Typing prompts into the Gemini app, generating standard images (Gemini / Imagen 4), and basic music tracks (Lyria 3) do not consume monthly credits up to their daily limits [4].

The Top-Up Exception: If you manually purchase a "Top-up" credit pack, those purchased credits are typically valid for 12 months from the date of purchase and do carry over across billing cycles [1]. Only your subscription credits are use-it-or-lose-it.

What Can You Buy With Credits?

Here are the standard costs for premium generation (2026 pricing) [2, 6]:

Feature	Credit Cost (Approx.)	What 380 credits buys you
Nano Banana 2 (Imagen 4)	1	~380 high-res assets
Veo 3.1 Fast (via Flow)	10	~38 cinematic clips
Veo 3.1 Quality (via Flow)	100	~3 cinematic clips
IDE Overage (per hour)*	15	~25 hours of high-tier use

*Approximation based on typical token consumption in Antigravity.

Tip: If you're close to the end of your billing month with credits to spare, it's the perfect time for high-resolution 4K image generation or experimental Veo video clips. They won't cost you anything "extra" once the new month starts.

Antigravity Weekly Baseline Quota (The IDE Clock)

If you use Google Antigravity (the AI-powered code editor), you'll notice a separate "refresh date" in your settings. This date is typically different from your monthly credit reset.

What it is: Google gives AI Pro users a "Free Baseline" of high-performance model usage (Gemini 3.1 Pro, Claude 4.6 Sonnet, etc.) within Antigravity every week [5, 6].

Key facts:

5-Hour Sprints: Your immediate capacity refreshes every 5 hours [5].
7-Day Hard Cap: There is a weekly baseline limit. If you exhaust this, the 5-hour refresh stops working until your next 7-day reset (the "refresh date" you see) [5].
This quota is individual — each account on your family plan has its own personal IDE baseline.

The Overage Toggle

In Antigravity settings, the "Use AI Credits for Overages" toggle lets you decide what happens when your weekly baseline runs out:

ON: Antigravity draws from your monthly 1,000-credit pool to keep you coding at full speed.
OFF: You're limited to the "Flash" model until the weekly refresh hits.

Daily Quotas (Chat, Music, and Deep Research)

Finally, your day-to-day interactions have their own daily caps that reset at midnight Pacific Time. These never touch your 1,000 monthly credits:

Feature	Limit Per Day
Thinking Model Chat	300 prompts
Pro Model Chat	100 prompts
Nano Banana 2 (Std)	1,000 images
Lyria 3 (Music)	50 tracks [4]
Deep Research Reports	3 reports

Family Plans: What's Shared, What's Not

Feature	Shared with Family?	Reset Cadence
1,000 AI Credits	Yes [3] (shared pool)	Monthly
2 TB Storage	Yes [1] (shared pool)	Monthly
Daily Interaction Limits	No (individual)	Daily
Antigravity Baseline	No [5] (individual)	Weekly

The Credits are the only shared fuel. If your family member generates a few Veo 3.1 videos, they are spending from the same 1,000-credit bucket you use for your IDE overages [3]. You can monitor this in the Google One → AI Credits Activity dashboard.

Important: Family members must be 18+ for high-tier model access. Under-18 accounts are restricted to Gemini Basic.

References

Google One AI Pro: Membership Overview - Plan pricing and 1,000 credit allocation.
Google One - AI Credits Pricing (2026) - Details on credit reset and top-ups.
Google One Help - Managing Family AI Credit Activity - Shared pool tracking and activity history.
Google Cloud - Expanding AI Creativity with Google Flow and Veo 3.1 - Media generation models and limits.
Google DeepMind - Antigravity IDE Baseline Quota Framework - 5-hour refresh and weekly baseline rules.
Anthropic - Claude Sonnet 4.6 on Google Cloud Vertex AI - February 2026 release news.

9 Mar 2026

Display IMDb Ratings on Einthusan

Technical Features

Surfing niche streaming sites without inline film ratings is a recipe for endless tab-opening and "analysis paralysis." To scratch my own itch, I put together a small userscript called Masala Script to fix this exact problem.

For context, Einthusan is a massive streaming directory for South Asian cinema. While it's an excellent digital archive, exploring its thousands of regional films is tedious because it lacks external metadata like IMDb ratings.

To solve this friction, Masala Script (presently just a single Tampermonkey extension file) reads the Einthusan page, interfaces with the free OMDB API, and renders IMDb rating badges right next to the title.

Technical Features

While it might seem trivial to inject an API call onto a DOM load, building this script properly required addressing a few interesting technical hurdles:

Intelligent Fallback Matching via Wikipedia: South Asian movie titles vary wildly in transliteration. An exact-title search against the OMDB API frequently fails for regional films. However, Einthusan usually provides a Wikipedia link for each movie. I built a transparent scrape fallback: if the direct title/year fetch fails, the script fetches the linked Wikipedia page in the background, extracts the definitive ttXXXXXXX IMDb ID using a regular expression, and then repeats the OMDB query using that exact ID for 100% accuracy.
Aggressive Client-Side Local Caching: The free tier of the OMDB API provides 1,000 requests per day. A typical Einthusan browse page can render over 20 movie cards at once. Scrolling through just 50 pages would immediately exhaust the daily allowance and result in rate limits. The script counters this by heavily utilizing the GM_setValue and GM_getValue Tampermonkey APIs—caching successful queries in the browser for 7 days, and failed title lookups for 1 day.
Detailed Error Tooltips: Rather than failing silently, any lookup that ultimately misses (or fails due to API config errors) renders a "Fail" or "N/A" UI badge. When hovered, it provides an exact exception traceback or error string so the user knows exactly why the movie metadata wasn't found.
Single Page Application Navigation Detection: Einthusan manages categories via AJAX updates, utilizing history.pushState and popstate to load new frames. The userscript actively monkey-patches and listens to these navigation boundaries, injecting the DOM observers properly on dynamic content swapping.

The Genesis of Masala Script

Built using modern AI tools, this project demonstrates how rapidly useful and robust browser enhancements can be coded from scratch with minimal manual boilerplating.

Looking at the initial Git history over the course of the day shows how quickly the script escalated from a basic exact-title scraper into a much more mature and intelligent tool:

commit 70027efcd906586cb46928b6da16c46c2402ae25
docs: replace development instructions with a detailed features and notes section in README.

commit 133dbdf7e6123f3a9ac774d820e1a1aa932051f4
Add caching for imdb ratings

commit acb25cc1187884e771efa9e32cf9836461d403c1
feat: Do a fallback search (via Wikipedia), in case first imdb rating fetch fails.

commit 06f3f05be8f6ebd08dfedffb80cf4d53544786f6
feat: Add movie page to the list of pages where this works

Try It Out

If you want to try it out on your next Einthusan movie night:

Install the Tampermonkey browser extension.
Claim a free OMDB API Key.
Install the script by clicking on imdb-einthusan.user.js.

The next time you navigate to any browse or movie page on Einthusan, the script will prompt you for your OMDB API key (only once), and start displaying ratings next to the movie cards!

Although this extension might only interest a very niche subset of users, for those of us who regularly browse Einthusan (and heavily rely on IMDb to filter through the noise), it's a massive quality-of-life add-on to have.

If anyone is looking for new features, has a bug to report, or wants to contribute, feel free to create an issue or submit a pull request on the repository. Happy watching!