Building a Citation Engine from Scratch

Why "Source: document.pdf" Just Doesn't Cut It

Most RAG tools handle citations like a chore. You ask a question, get an answer, and at the bottom there is a list of file names the system looked at. For a casual chatbot, that is fine. For a compliance system where someone has to back up every claim with the actual rule? It is useless.

The people using Gov Assistant are compliance officers and auditors. When the system says "KYC documents must be kept for 7 years," they need to know exactly where that comes from. Not just "KYC.pdf" they want Section 4.2, paragraph 3, sentence 2. If they cannot check the source in under 10 seconds, they do not trust the answer. And if they do not trust it, they stop using it.

What I Had to Build

The needs were straightforward:

Every claim must point to a specific spot in a real document.
Citations must land at the sentence, not the file.
Each one needs a confidence score how sure are we?
It has to handle answers that pull from more than one source.

The Design: Three Layers

Layer 1: Track Where Every Chunk Lives

Good citations start with knowing exactly where each piece of text sits in the original document. So during ingestion, we do not just save the chunk text and embedding. We save the location too.

{
  "chunk_text": "Customer KYC documents must be retained for...",
  "source_doc": "KYC.pdf",
  "section": "4.2 - Document Retention",
  "page": 12,
  "paragraph": 3,
  "char_start": 1847,
  "char_end": 2103,
  "heading_path": ["Chapter 4", "4.2 Document Retention", "4.2.1 Timeframes"]
}

When a chunk comes back from search, we already know where it lives. No second lookup, no fuzzy matching. The index gets a bit bigger, but for compliance work that trade is easy.

Layer 2: Map Claims to Sources After the Answer Is Written

This is where it gets interesting. Once retrieval gives us the relevant chunks and the model writes the answer, we still have to figure out which chunk supports which claim. That is the attribution step, and I tried three things before one stuck.

Try 1: Just ask the model to cite itself. Add an instruction like "cite your sources with [1], [2]." It worked about 60% of the time. The other 40%, the model would invent a citation, point at the wrong chunk, or skip them on anything it had to combine. Not safe for production.

Try 2: Match on similarity. Split the answer into sentences, then find the closest chunk for each one using cosine similarity on embeddings. Better about 80%. But it tripped over paraphrasing and broke completely when an answer pulled from more than one source.

Try 3 (the one that worked): two passes with a checker. The model writes the answer with rough source markers. Then a smaller, dedicated model checks each claim against its source using natural language inference. If the score is too low, the citation is flagged as unsure rather than shown as fact.

# Simplified attribution pipeline
def attribute_answer(answer, retrieved_chunks):
    # Step 1: Split answer into atomic claims
    claims = split_into_claims(answer)

    # Step 2: For each claim, find supporting chunks
    attributed_claims = []
    for claim in claims:
        candidates = rank_chunks_by_relevance(claim, retrieved_chunks)
        best_match = candidates[0]

        # Step 3: Verify entailment
        score = verify_entailment(claim, best_match.text)

        attributed_claims.append({
            "claim": claim,
            "source": best_match.metadata,
            "confidence": score,
            "supported": score > 0.7
        })

    return attributed_claims

Layer 3: Show the Confidence, Don't Hide It

The last piece was making citations actually useful at a glance. We landed on inline citations with a clear signal of how sure we are:

High (above 0.9): green dot, you can trust it.
Medium (0.7 to 0.9): amber dot, take a quick look before you rely on it.
Low (below 0.7): no citation. The claim is marked as not supported.

Being open about uncertainty is what won people over. Users learned fast that green was solid and amber meant "verify." The system became trustworthy because it was honest when it did not know.

Where It Broke (and How I Fixed It)

Answers built from many sources

What do you cite when one sentence pulls from three documents? "Retention is 7 years for KYC (Policy A), 5 for transactions (Policy B), and both must be encrypted (Standard C)." That is one sentence, three sources, sub-sentence precision.

The fix was breaking each compound sentence into smaller, single claims before attribution. Each one then gets its own citation.

The paraphrase gap

The model rarely copies text word for word. "Documents shall be retained for a period not less than seven (7) years" comes back as "KYC documents must be kept for 7 years." Similarity matching gets you in the right neighborhood. The entailment check is what confirms the claim is actually saying the same thing and not quietly inventing details.

Right section, wrong paragraph

Early on we would cite the correct section but the wrong line inside it. "Section 4" of a 50-page policy is not really a citation. We fixed this by saving finer location metadata during ingestion and using smaller chunks (150 to 200 tokens) for compliance documents specifically.

Running It in Production

The two-pass approach adds time. About 1 to 3 extra seconds per answer. For compliance work that is fine accuracy beats speed every day. We also cache the attribution when the same question hits against the same source version, so repeats are basically free.

The entailment model runs on its own GPU. We use a fine-tuned DeBERTa for NLI small, cheap to run, sharp enough for our domain. The fine-tuning data was 2,000 claim-source pairs labeled by our own compliance team.

What Changed After Shipping

Once the citation engine was live:

Citation accuracy: 94% of high-confidence citations checked out, up from around 60% with self-citation.
Trust: auditors moved from "I double-check everything" to "I only check the amber ones."
Audit time: verifying an AI-written answer dropped from 5 minutes to under 30 seconds on average.

The takeaway: in a regulated setting, the citation often matters more than the answer. Getting it right is what turned our system from a fun demo into something auditors actually use every day.

Building a Citation Engine from Scratch: Taking RAG Source Attribution to Production