How I'd Build a Production RAG System for Legal Documents
A design essay, not a tutorial. Starting from the constraints a law firm would actually impose — hallucinations as malpractice, jurisdictional context, privilege protection — and working backward to the architecture that survives them.
What this is: a design essay describing how I'd approach a production legal-RAG system from first principles. It's not based on a system I've shipped — it's based on what I learned building QuitTxt, which is another high-stakes domain where confident-but-wrong is the worst failure mode. Read it as architectural reasoning rather than implementation experience.
Most "RAG for legal documents" tutorials get the demo working in 30 minutes by pointing LlamaIndex at a folder of PDFs and calling it done. The demo is impressive until you imagine a first-year associate asking it "can we use this precedent in our 9th Circuit filing next week?" and the system confidently citing a 5th Circuit case from 2018 that was overturned in 2022. That single failure mode is why every demo of legal RAG never ships.
This post is about the architecture I'd build if that scenario was the thing I had to make impossible. The exact same principles I used for QuitTxt's clinical RAG apply — refusal as a first-class path, strict grounding, explicit evaluation — but legal has some twists the clinical domain doesn't.
The constraints that actually matter
Before designing anything, I'd write down what the system has to do and — more importantly — what it has to never do. For legal, the list looks something like this:
Must:
- Cite every factual claim to a specific source (case, statute, regulation, filing).
- Respect jurisdictional boundaries (a 9th Circuit attorney does not want 5th Circuit case law surfaced without it being flagged).
- Respect temporal validity (overturned cases, repealed statutes, and superseded regulations must be visibly marked as stale).
- Preserve attorney-client privilege — internal matter documents must never leak into responses for different clients.
Must never:
- Generate a legal conclusion without traceable authority.
- Present stale or overturned authority as current without a warning.
- Hallucinate a case name or citation (this is not a theoretical failure mode — it has already gotten real lawyers sanctioned in 2023–2024).
- Cross-contaminate between client matters.
Notice how many of these are nevers. In most RAG domains, you're optimizing for recall. In legal, you're optimizing for the absence of specific failure modes. That framing flips a lot of architectural decisions.
Layer 1: corpus hygiene (60% of the work)
The corpus is the system. Bad corpus, no amount of prompt engineering saves you. For legal, I'd invest heavily in three corpus-level guarantees before a single embedding gets computed.
Jurisdictional tagging. Every document in the index gets a structured metadata tag for the jurisdiction(s) it applies to — not as free text, but as a controlled vocabulary pulled from a canonical list (US federal circuits, state courts, regulatory bodies). A case in the 9th Circuit gets tagged us.ca9. A California Supreme Court case gets tagged us.ca.supreme. A statute gets tagged with both the enacting jurisdiction and any jurisdictions that have adopted it by reference. This tagging is the filter for every retrieval — you don't search the whole corpus, you search within the jurisdictional scope the user is working in.
Temporal state. Every document gets an effective-date and an optional sunset-date. For case law, there's an additional overturned_by field that points to the overturning case. The retrieval layer treats overturned documents differently: they're not deleted (you might want to cite them historically) but they're never returned as the primary authority for a claim. The prompt tells the LLM: "if you must cite this document, preface the citation with [OVERTURNED by X]."
Privilege boundaries. The corpus is physically partitioned by matter. Client A's documents live in a logically separate index from client B's. There is no shared ranker, no shared cache, no shared vocabulary. When a user authenticates, they get a scope — a list of matter IDs they have access to — and every query is executed only against the indexes in that scope. This is the same pattern healthcare uses for HIPAA-compliant search: physical separation is the only guarantee that survives a prompt-injection attack.
If I had to rank these by impact: jurisdictional tagging does the most work, temporal state is the most expensive to maintain, and privilege boundaries are the non-negotiable that gates shipping to a real firm at all.
Layer 2: retrieval with hard filters before soft ranking
Standard RAG advice is "retrieve semantically, then re-rank." That's the wrong order for legal. The right order is filter hard, then rank soft.
1. Query arrives with (user_id, jurisdiction, matter_id, question)
2. Hard filter:
- indexes accessible to user_id
- documents tagged with jurisdiction (or adopted_by jurisdiction)
- documents effective as of today
- NOT overturned
→ candidate set, possibly empty
3. If candidate set is empty → REFUSE with explanation
4. Soft rank on candidate set:
- semantic similarity (embedding)
- BM25 lexical match (legal queries have lots of exact phrases)
- citation graph weight (cases that cite the query-relevant statute get a boost)
5. Return top-k with hard metadata attached
The hard filter is non-negotiable and runs before any embedding call. It is literally a WHERE clause in Postgres — fast, deterministic, explainable. If no documents survive the filter, the system refuses the query and tells the user why ("no precedent found in your jurisdiction within current temporal scope"), not just that.
The ranker is where I'd experiment. Pure semantic search is weaker than you'd expect in legal — lawyers often know the exact case name they want to verify, and "Miranda v. Arizona" as a query should return Miranda as the #1 result. So I'd use hybrid retrieval: BM25 + dense embeddings + a citation-graph signal (cases that already cite the statutes involved in the query get a boost because they're more likely to be the thing the user wants).
Layer 3: generation with a citation schema
Once retrieval is solid, generation is almost the easy part — but only if you design the prompt around a structured citation contract.
The model is told: "Every factual claim you make must be followed by an inline citation referencing one of the retrieved chunks by its numeric ID. If you cannot ground a claim in a retrieved chunk, either omit the claim or refuse to answer. Do not synthesize citation strings; use the exact chunk IDs provided."
The output is then post-processed: every citation in the response is validated against the retrieved chunks. If the model hallucinated a chunk ID (which happens even under strict instructions), that sentence gets flagged. If more than, say, 5% of sentences are flagged, the whole response is rejected and the system refuses with a message like "I was unable to produce a grounded answer to this question."
On top of this, I'd add a separate "confidence" pass — a cheaper model reads the response and the retrieved chunks and estimates whether the response's claims are actually supported. If the confidence pass disagrees with the main response, the system either falls back to a more conservative output or refuses. This is belt-and-suspenders: two independent checks that catch different failure modes.
Layer 4: observable refusal as the safety valve
I wrote about this in the QuitTxt RAG deep-dive and it applies even more here: refusal is a first-class output, not an error.
A legal RAG system that refuses 20% of queries is a better system than one that answers 99% confidently with a 2% hallucination rate. The former creates friction and keeps users honest — they have to look things up when the tool won't. The latter creates a malpractice incident every few thousand queries.
The refusal path has its own UX requirements:
- Explain why. "I couldn't find authority on this specific question within the US-9th-Circuit scope you're working in. I did find related authority in US-2nd-Circuit — would you like me to surface those with a cross-jurisdiction warning?"
- Log every refusal. Refusals are the most informative signal for improving the corpus — they tell you exactly where your jurisdictional or temporal coverage is thin.
- Offer a human escalation path. For a firm, this might be "send this question to your knowledge management librarian." Refusal should feel like a handoff, not a dead end.
Layer 5: evaluation against adversarial queries
The eval set for a legal RAG system cannot just be "question, expected answer" pairs. It has to include:
- Out-of-jurisdiction traps. Queries designed to be answerable in one jurisdiction but asked from a different one. The system must refuse or cross-flag.
- Temporal traps. Queries where the relevant authority has been overturned. The system must either refuse or prepend the
[OVERTURNED]warning. - Fabrication traps. Queries designed to tempt the model into inventing a case name or citation. Success is a refusal; failure is any fabricated citation.
- Privilege bleed traps. Queries designed to probe whether one client's data leaks into another client's responses. Any leak is a critical failure.
- The normal happy-path eval. Just so you know the system still does the thing it's supposed to do.
I'd want 80%+ recall on the happy-path eval, 99%+ refusal on the fabrication traps, and zero tolerance on the privilege bleed traps — a single leak on a 500-query eval is a ship-blocker, not a regression.
What I wouldn't do
I wouldn't use LangChain for this. The abstractions are too leaky for a domain where you need exact control over the retrieval filter + prompt + citation validation pipeline. I'd write it in straightforward FastAPI with Postgres + pgvector for retrieval, a structured schema for chunks + metadata, and explicit Python code for each step. More code to own, but every failure is traceable.
I wouldn't rely on a single model. Frontier models get better and worse in ways that affect legal response quality unpredictably. The generation path should accept any compatible model and the eval set should gate upgrades — no new model goes live until it clears the refusal and fabrication-trap thresholds.
I wouldn't ship without a "red team" period. Before any real lawyer touches it, I'd spend two weeks adversarially probing the system with the fabrication and privilege-bleed traps. Every failure I find becomes a new test in the eval set. This is non-negotiable — it's cheaper than a malpractice incident.
The meta-lesson
The architecture here isn't exotic. It's the same RAG pattern you'd use for any other domain — hard filter, semantic + lexical ranking, strict grounding, refusal-as-feature, adversarial eval. What changes is the threshold for acceptable failure, and that threshold changes the design decisions.
Clinical RAG: "don't hallucinate medical advice" → 0.72 cosine threshold, strict refusal, explicit citation-only generation.
Legal RAG: "don't cite nonexistent cases, don't leak privilege, don't miss jurisdiction" → hard filters before retrieval, citation schema validation, partitioned indexes, 99%+ fabrication refusal rate.
Same shape, different severity. The useful skill isn't knowing the pattern — it's knowing how to tune the thresholds when the cost of a false positive becomes malpractice.
If you're building a RAG system for a high-stakes domain (legal, medical, finance, safety-critical) and want a second set of eyes on your architecture, reach out — about page.