Design Notes
Key decisions and tradeoffs
API target: own implementation
Instead of wrapping a third-party fake API, the client wraps this project's own FastAPI backend. This means the client and the API are co-designed β the typed models on both sides stay in sync by design. The tradeoff: less realistic than wrapping an external API you don't control, but the test surface is richer and the integration tests verify real business logic, not just HTTP plumbing.
Two-layer evaluation (L1 live / L2 batch)
L1 runs on every query inline (~1-2s overhead). L2 runs offline against a golden dataset. The split is a deliberate latency/depth tradeoff: LLM-judged metrics (contextual precision, reverse-question relevancy) add 30+ seconds per pair β unacceptable live, fine in batch. The golden dataset is the contract; L2 is the regression gate.
Deterministic chain_terminology over LLM judge
The terminology check is a dict lookup, not a model call. Zero latency, zero cost, zero false negatives on known mappings. The tradeoff: it only catches terms in the catalog β novel terminology drift goes undetected. An LLM judge would catch drift but would introduce latency and non-determinism into a metric that must be auditable.
In-memory retrieval over vector database
KB size is 8-9 docs per domain. Encoding them at startup and doing cosine search at query time adds ~2ms retrieval overhead with no infrastructure dependency. A vector DB (Chroma, pgvector) would add operational complexity with zero retrieval quality gain at this scale.
httpx + tenacity for the client
httpx is the modern alternative to requests: native async support if needed
later, cleaner timeout API, better type annotations. tenacity separates retry
policy from request logic cleanly β the retry decorator is readable and testable
independently from the HTTP code.
Integration tests are read-only by design
The API has no mutable state: queries don't persist, no records are created or deleted. Cleanup is therefore trivially satisfied β there is nothing to clean up. This is called out explicitly because it's a deliberate architectural choice, not an oversight. A stateful API (task creation, deletion) would require explicit teardown fixtures.
Alternative judge approaches considered
Ollama (local LLM judge)
Ollama can run Llama 3 / Mistral locally, making it a zero-cost alternative to
HF Inference API for both generation and LLM-as-judge evaluation. Tradeoffs:
requires local GPU or accepts slower CPU inference; no external API rate limits;
outputs are fully reproducible since the model version is pinned. For the
faithfulness judge specifically, a local llama3 via Ollama would remove the
dependency on HF token entirely and allow offline eval runs.
Prometheus (LLM eval framework)
Prometheus-2 is a
7B model fine-tuned specifically for evaluation tasks β outputs a score + rationale
in a structured format designed for rubric-based grading. It's a drop-in replacement
for GPT-4/Claude as eval judge, runs via Ollama or HF Inference, and is purpose-built
for the kind of faithfulness + relevancy scoring done in eval/metrics.py.
The tradeoff vs. the current Vectara HHEM v2 approach: Prometheus is slower (7B vs
purpose-built cross-encoder) but produces a human-readable rationale alongside the score,
which is more interpretable for audit and debugging.
Why not used here: HHEM v2 runs faster and requires no prompt engineering. Prometheus would be the right choice if rationale logging is a compliance requirement.
What another 4 hours would add
eval/metrics.pyβ L2 LLM metrics: contextual precision (chunk ranking), contextual recall (coverage), and answer correctness against full reference answers. Currently only keyphrase coverage is used as a proxy.- Async client:
httpx.AsyncClientvariant for high-concurrency load testing. - Property-based tests:
hypothesisto fuzzcheck_terminologyand graders with generated strings β catches edge cases the golden dataset doesn't cover. - CI pipeline: GitHub Actions running
make lint,make type-check,make teston every PR. Integration tests gated on a self-hosted runner with the API running. - Threshold calibration report:
eval/calibrate.pyexists and runs graders against golden-dataset expected answers β threshold calibration is now a single command, not a missing feature. Actual threshold adjustments require reviewing the output against real query distributions.
Gate 5 audit gaps addressed
- Faithfulness false negatives on refusals:
_is_refusal()detects "I don't have enough information" responses and returns score=1.0 β no factual claims, trivially faithful. - Partial grounding blind spot: faithfulness now uses sentence-level min-score (weakest link wins) instead of max-score across chunks. A response with one hallucinated sentence now fails even if other sentences are grounded.
- No escalation path:
overall_pass=Falsenow emits a structuredEVAL_FAILWARNING log entry and setsflagged: truein the response payload. UI shows a red banner. - Cold-start latency: embedder and NLI model pre-warmed at startup in the FastAPI lifespan.
- Happy-path-only golden dataset: 4 adversarial pairs added (vague query, rival-term prompt injection, multi-doc synthesis, hallucination bait).
Where LLM assistance helped and where it misled
Helped:
- Scaffolding the full project structure (backend, client, tests, config) in a single session without losing consistency across files.
- Writing the faithfulness prompt in a way that reliably returns structured JSON β the few-shot JSON format in the prompt was a suggested pattern that works.
- Catching that
except Exceptionin the faithfulness grader was too broad and replacing it with(json.JSONDecodeError, anthropic.APIError). - Identifying that
_build_index_by_domainwas defined twice in pipeline.py (duplicate introduced during an edit session) β caught during code review.
Misled or required correction:
- Initially used
lru_cacheon a function that takes aSentenceTransformerinstance as an argument β unhashable, so the cache silently failed. Required switching to a module-level dict cache. - Generated a dead loop in
rosetta.py(iterating over terms withcontinuebut no code after the continue branch) that did nothing. The logic existed in a comment describing intent but was never implemented. Caught in review. - Suggested a fictional client name that conflicted with a real company. Required renaming before the repo went public.