Notes from my work what I broke, what I fixed, and what I wish someone had told me sooner. Mostly about Generative AI, RAG, and getting LLM systems to actually behave in production.

GPT-5.1 Unlocked: How I Brought 200-Page Document Generation Down to Under 2 Minutes

March 15, 2026  •  9 min read

The first version of Gov Co-Author took almost 20 minutes to write a 200-page compliance document. Authors hated waiting. We hated shipping it. So I went back in, moved to GPT-5.1, and rebuilt the pipeline piece by piece. It now finishes in under 2 minutes. The model helped, but the real wins were smaller: writing sections in parallel, streaming results as they came, and caching the bits we kept asking for again and again. In this post I walk through what I changed, why, and the numbers behind each call.

GPT-5.1 LLM Engineering Latency Optimization Azure OpenAI Production

Building a Citation Engine from Scratch: Taking RAG Source Attribution to Production

July 22, 2025  •  12 min read

Most RAG tools treat citations like a footnote here is the answer, here are some file names, good luck. For a compliance product, that does not fly. Auditors want to know the exact section, paragraph, even the sentence an answer came from. So I built our own citation engine. It tracks where every chunk lives in the source, links each claim back to its evidence after retrieval, and ships a confidence score with the answer. I will share the design choices I made, the dead ends I ran into, and what it really took to get this running in a regulated setup.

RAG Citation Extraction LangChain Production Compliance AI

How I Cut Document Ingestion Latency by 75% for a Production RAG System

March 8, 2025  •  10 min read

Gov Assistant reads thousands of governing and external regulation documents. Our first ingestion pipeline ran one step at a time, and a full re-index took over 110 minutes. Updating a single document? Same wait. I sat down with a profiler and the answer was clear embedding API calls were blocking everything. I switched to an async client, added smart batching with retries, moved to delta updates so we only re-process what changed, and cleaned up the chunking. Full re-index now takes under 16 minutes. Small updates finish in seconds. Here is how, with the numbers to back it up.

RAG Ingestion Pipeline Async Python Embeddings Performance

RAG vs Fine-Tuning for Enterprise Compliance: Lessons Learned in Production

November 10, 2025  •  8 min read

After shipping two GenAI products in BFSI compliance, the question I keep getting is simple: "Should we use RAG or fine-tune?" My honest answer it depends, and most blog posts make it sound easier than it is. RAG wins when your data keeps changing, when you need real citations, and when you cannot send sensitive data off for training. Fine-tuning wins when you need a steady tone, a fixed output shape, or reasoning that prompts alone cannot pull off. In this post I share the simple checklist I now run through on every project, with real examples from Gov Assistant and Gov Co-Author.

RAG Fine-Tuning LLM Strategy BFSI Architecture
OFF
ON