Production AI pipeline for legal document analysis

TL;DR — We turn 300-page financial contracts into structured covenant data in ~20 minutes. A human credit analyst takes 1–3 days. The pipeline is multi-agent and stochastic, so the interesting work isn't the pipeline itself — it's the research loop that keeps it state-of-the-art: every run leaves Logfire spans + a local JSONL trace, and a Claude sub-agent reads them, walks the trace upstream-first, files GitHub issues with span IDs attached, and proposes the next experiment. Production stays deterministic; research stays expressive.

Stack: Python · FastAPI · Claude / OpenAI · Logfire + OpenTelemetry · Pydantic · PostgreSQL · Claude Code sub-agents

Research loop architecture

Context — SOTA AI for legal-document analysis

We build AI that reads complex financial contracts — the long-form legal agreements that govern corporate debt — and emits the structured data a credit analyst actually needs to make a decision: financial constraints, conditional clauses, numeric thresholds, cross-referenced definitions. Documents routinely run 200–400 pages of dense legal prose with internal references that resolve only by walking the document tree.

What used to be 1–3 days of senior-analyst review per agreement is a 20-minute pipeline run. That's the stake. It's also why correctness matters more than throughput: the wrong number on a leverage ratio or basket size isn't a typo, it's a mispriced trade.

The core pipeline is a multi-agent DAG — section discovery → parallel classifier agents → iterative extractor → reconciliation → cross-reference validation, with optional retrieval against external data sources. The hard part is staying ahead of the frontier: every model release shifts the cost / correctness frontier, every prompt edit creates regression risk somewhere across the dozen stages. SOTA isn't a release, it's a rate of improvement — and the rate is set by the research loop.

Our north-star KPI is a single ratio:

(correctness × coverage) / (time × dollars)

Every change must improve this ratio. A prompt that boosts correctness 4% but doubles cost is a regression unless we accept the trade explicitly. A change that flattens cost but loses 2% coverage is a regression too. Optimizing one number is easy. Optimizing the joint ratio is the whole job.

The problem — SOTA is a rate, not a release

A single document run fans out into hundreds of LLM calls across a dozen stages. Regressions hide as +3% cost here, –2% coverage there — by the time you spot them in aggregate metrics, you've shipped two more changes on top. Multiply that by frequent model releases, prompt edits, and corpus additions, and the only way to stay state-of-the-art is to make every change reviewable in minutes, not days.

What we needed wasn't a dashboard. We needed every experiment to leave enough trace artifacts that a reviewer — human or agent — can attribute the delta, and that review to be cheap enough to run after every meaningful change.

Instrument every LLM call, exactly once

The first thing the pipeline does on startup is wrap the LLM SDKs:

client = wrap_anthropic_client(Anthropic(...))
client = wrap_openai_client(OpenAI(...))

Those wrappers emit a canonical llm_call span with the same shape regardless of provider:

model, prompt hash, cached + completion tokens, dollar cost
input summary (message count, system-prompt snippet, user-message preview)
output summary (text preview, tool calls, finish reason)
OpenTelemetry GenAI semantic-convention attributes for the UI to render token icons

Logfire's auto-instrumentation also ships, but those spans nest underneath our wrapper as detail — they don't compete with it. Cost math reads only name = 'llm_call'. No double counting, no rollup join.

This is the load-bearing detail. If you let two different layers both claim authorship over "this is an LLM call," your aggregate cost is silently wrong forever.

Dual export — same data, two readers

Every run writes to two sinks:

Logfire cloud — the analyst UI. Humans drill into nested spans, search by attribute, build SQL views.
Local OTLP JSONL — output/traces/<run_id>/spans.jsonl. Plain newline-delimited JSON, no service required.

The local file is the agent-consumable artifact. Sub-agents read it directly with jq or a Python query API; no API key, no network, no rate limit. It's also what we commit alongside experiments for offline replay months later, after the cloud retention window has rolled off.

Failure taxonomy on top of OTel status

OTel gives you OK / ERROR. That's not enough to tell whether the same bug is striking 30 docs or 30 different bugs are striking 30 docs.

Each pipeline stage layers on a short failure code (timeout, validation, unclaimed-anchor, schema-violation, …) emitted as a span attribute. A failure-mode histogram becomes a one-line SQL query:

SELECT failure_code, COUNT(*)
FROM spans
WHERE otel_status_code = 'ERROR'
GROUP BY failure_code
ORDER BY 2 DESC

When a regression lights up, the histogram tells you which class of bug got worse, before you open a single trace.

Reproducible experiments

Every meaningful change runs as a numbered experiment:

evals/<domain>/experiments/E-0021-<slug>/
├── manifest.yaml          # hypothesis, corpus subset, git SHA + dirty flag, config hash, contributor
├── metrics.yaml           # tier-1 structural / tier-2 reviewer / cost / speed
├── per_doc/<doc_id>/
│   └── trace/
│       ├── spans.jsonl
│       ├── rollup.json
│       └── summary.md     # auto-generated triage artifact
├── findings/<doc_id>.md   # reviewer notes (human or agent)
└── diff_vs_<baseline>.md  # optional comparison

Everything is plain YAML / JSONL / Markdown. It commits cleanly, diffs in PRs, and you can re-derive any aggregate from the raw spans months later.

The research loop — two tracks

The loop runs on two strict tracks. Crossing them is a bug.

Track 1 — code only. The experiment runner runs the corpus, emits spans, generates a summary.md per doc with aggregate metrics, stage breakdown, error list, top-10 cost calls, and model mix. No LLMs. Deterministic. Reproducible from a git SHA.

Track 2 — agentic. A Claude Code sub-agent reads summary.md + spans.jsonl + the pipeline output, walks the trace upstream-first, writes a markdown triage report, and optionally files tagged GitHub issues with repro info. The agent calls a small Python query API rather than learning jq:

query.list_experiments()
query.load_experiment("E-0021-foo")
query.compare_experiments("E-0021-foo", "E-0019-baseline")
query.span_query(sql, backend="logfire")  # or backend="local"

Both data planes — the filesystem snapshot and the cloud spans — sit behind the same surface. The agent doesn't care which.

The split matters. No auto-LLM QA inside the pipeline. The production path stays deterministic: code produces the artifacts, agents review them. We get expressive review without conceding correctness to non-determinism in the hot path.

Why upstream-first matters

Late-stage failures usually have early-stage causes. A section misclassified at discovery → a downstream extractor sees the wrong chunk → reconciliation fails → the agent sees a "reconciliation error" at the bottom of the trace.

Walking the trace bottom-up surfaces symptoms. Walking it top-down surfaces causes. The sub-agent's first job is to enforce that direction so reviewers don't waste time chasing the loudest error in the log.

What it lets us do

Detect cost regressions during the run, not at the invoice. The wrapper emits dollars per span; the rollup catches outliers immediately.
Compare two experiments and read out per-stage Δ for cost, error rate, and token efficiency without writing a single line of analysis code.
File GitHub issues with the spans pre-attached, in the time it takes to run the experiment.
Ship prompt changes as ablation studies, not vibes. Every prompt edit is an E-NNNN with a hypothesis and a baseline diff.

What I'd do differently

Local OTLP JSONL is great for agent reads but expensive on disk for big corpora. We need rotation for older runs.
More of Track 2 should graduate to Track 1 once a review pattern stabilizes. The discipline is: an LLM gets to do something agentically until we know how to do it deterministically — then it moves to code. Automation that earned its place.
The failure taxonomy started as ad-hoc and got formalized halfway through. Worth doing earlier next time — it's cheap to define and pays for itself the first time you regress.

The general lesson: in multi-agent systems, the artifacts you leave behind matter more than the model you picked. Pick a tracing layer that emits a clean canonical span, give your agents a query surface they can actually use, and the research loop closes itself.