Where We Started: 20 Minutes for One Document

When Gov Co-Author first went live, one 200-page compliance document took almost 20 minutes to generate. If you have ever waited for a screen to do something, you know how long that feels. People would give up, hit refresh, and start again which only made the queue longer and the bug reports angrier.

The first version was simple. Take a template with about 160 sections. Ask GPT-4o to write each one, one after the other. Stitch them together. Make a PDF. It worked. It was just slow.

Why It Was Slow

Each section needed its own call to the model. The prompt had to carry the right rule, the policy we were updating, links to other sections, and our formatting notes. None of that was small.

One GPT-4o call took 20 to 45 seconds. The math is not kind:

40 sections × 30 seconds each = 20 minutes
+ Fetching templates       = ~1 minute
+ Building the PDF         = ~30 seconds
Total                      = ~21.5 minutes

The model calls were 93% of the wait. Everything else barely mattered.

What I Changed

1. Write Sections in Parallel

The first thing I noticed: most sections do not need each other. "Data Retention" does not have to wait for "Access Control" to finish. They look at different rules, pull different context, and end up in different places in the document.

A few sections do reference earlier ones, so I drew a small dependency map. Sections that did not depend on anyone could run together. Instead of 40 calls in a line, I now had about 6 waves running side by side.

# Dependency-aware parallel generation
async def generate_document(template, context):
    waves = build_dependency_waves(template.sections)
    generated = {}

    for wave in waves:
        tasks = [
            generate_section(section, context, generated)
            for section in wave
        ]
        results = await asyncio.gather(*tasks)
        for section, result in zip(wave, results):
            generated[section.id] = result

    return assemble_document(generated)

That one change took us from 20 minutes to about 5. A 4x win, and the only thing we did was stop waiting in line.

2. Stream the Output, Don't Wait for It

Even with parallel sections, every call still sat there for 20 to 45 seconds before giving us the full answer. The model is sending tokens the whole time we just were not listening until the end.

So I switched to streaming. The moment a section starts producing text, we read it. If a later section only needs the title or first paragraph from an earlier one, it can start almost right away instead of waiting for the full thing.

This took another 40% off the clock, mostly by killing the silent waits in dependency chains.

3. Cache the Bits That Never Change

This one surprised me. Every time we generated a document, we were pulling the same template, the same regulatory references, and the same formatting rules from the knowledge base. Multiple lookups, per section, every single time.

I added a small cache for the parts of the prompt that do not really move. The rules do not change between two runs an hour apart. Only the user's content needs to be fresh. The rest can sit in memory.

# Cache structure per template
cache = {
    "template_v3.2": {
        "section_prompts": {
            "access_control": {
                "system_prompt": "...(cached, rarely changes)...",
                "regulatory_context": "...(cached, changes monthly)...",
                "format_rules": "...(cached, changes quarterly)...",
            }
        },
        "last_refreshed": "2026-03-14T10:00:00Z",
        "ttl_hours": 24
    }
}

That cut about 15 lookups per run and brought the per-section overhead from a few seconds down to under 200 milliseconds.

Then Came GPT-5.1

Everything above was done on GPT-4. Together, those changes got us from 20 minutes to about 3. Good, but not where we wanted to be.

GPT-5.1 made two big differences for us:

  • It is faster. On Azure, it produced tokens roughly twice as quick as GPT-4 for our prompts.
  • It listens better. We could trim our prompts down. Shorter input means less to process, which means less waiting.

On its own, the new model would have given us maybe a 2x boost. Stacked on top of the parallel generation, streaming, and caching work, the gains piled up. Total: under 2 minutes.

The Numbers

Optimization Before After Improvement
Sequential → Parallel waves ~20 min ~5 min 4x
+ Streaming with early deps ~5 min ~3 min 1.7x
+ Template caching ~3 min ~2.5 min 1.2x
+ GPT-5.1 upgrade ~2.5 min ~1.8 min 1.4x

Total: 20 minutes down to 1.8 minutes. About 11 times faster.

The Hard Bits

Keeping sections in sync

When sections run in parallel, none of them can see the others. Section 12 might call something one name; Section 3 calls it another. I added a quick check at the end that looks for clashing terms, mismatched definitions, and broken references. It costs about 10 seconds and has saved us a lot of embarrassment.

When one section breaks mid-stream

You cannot just retry the whole document because one piece failed. I added per-section retries with backoff. If a section still fails after three tries, the document goes out with a placeholder and a note for the user telling them which part needs a second look.

Staying inside the rate limit

Eight sections in flight means eight calls eating from the same token bucket. I had to track tokens across every running call and slow new ones down when we got close to the cap. Without this we would hit 429s and lose more time than we gained.

What I Would Do Differently

If I were starting over today, I would build for parallel from day one. Going sequential first is easier to reason about, but bolting parallelism on later is always more painful than just starting with it.

I would also add proper timing logs from the very beginning. I lost real days adding instrumentation mid-project. "It feels slow" tells you nothing when you have five possible fixes and need to pick one.

Last lesson, and the one I keep coming back to: the model upgrade was the cheapest win we got. Weeks of architecture work, and then a one-line config change to GPT-5.1 gave us another 1.4x. Always check if a newer model fixes your problem before you go and build a whole new pipeline.