Spaces:

below-threshold
/

ai-response-validator

Sleeping

mbochniak01 Claude Sonnet 4.6 commited on 7 days ago

Commit

e77a2f2

1 Parent(s): 27156ca

Add /refresh-cache endpoint, bi-encoder comparison, eval results, Ollama/Prometheus notes

- POST /refresh-cache: evicts KB index cache and rebuilds all domain indexes on demand
- ARCHITECTURE.md: bi-encoder vs cross-encoder comparison with timing breakdown
- ARCHITECTURE.md: update faithfulness section from Claude-as-judge to Vectara HHEM v2
- README.md: add eval results table (L1 pass rates + L2 keyphrase coverage)
- README.md: fix stale ANTHROPIC_API_KEY → HF_TOKEN, Claude Haiku → Vectara HHEM v2
- NOTES.md: add Ollama and Prometheus as alternative judge approaches

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (5) hide show

ARCHITECTURE.md +47 -9
NOTES.md +25 -0
README.md +29 -3
backend/app.py +12 -1
backend/pipeline.py +7 -0

ARCHITECTURE.md CHANGED Viewed

@@ -31,7 +31,7 @@ USER QUERY + CLIENT SELECTION
     │                                     │
     │  2. _generate()                     │
     │     context injected into prompt    │
-    │     Claude Haiku generates answer   │
     │                                     │
     │  3. grade()                         │
     │     5 L1 metrics run in sequence    │
@@ -60,7 +60,7 @@ Runs inline with every request. No ground truth required.
 | `pii_leakage` | Regex (SSN, email, phone, card) | binary | Safety gate — fails hard |
 | `token_budget` | Char count ÷ 4 | ≤ 512 tokens | Conciseness enforcement |
 | `answer_relevancy` | Cosine similarity (bi-encoder) | ≥ 0.45 | On-topic detection |
-| `faithfulness` | Claude Haiku judge (JSON output) | ≥ 0.70 | Hallucination detection |
 | `chain_terminology` | Deterministic lookup (RosettaStone) | 0 violations | Client language enforcement |
 ### L2 — Batch (local, against golden dataset)
@@ -77,6 +77,36 @@ on top of L1 metrics to verify factual completeness against reference answers.
 ## Key design decisions
 ### RosettaStone pattern
 Each domain has a canonical term vocabulary (`STOCK_CHECK`, `DRUG_APPROVAL`, etc.).
@@ -94,14 +124,22 @@ It flags rival-client terms appearing without the correct client term.
 **Why this matters:** in production multi-tenant AI systems, terminology leakage
 between clients is a real failure mode. This catches it mechanically.
-### Faithfulness via Claude-as-judge
-The faithfulness grader calls Claude Haiku with a structured prompt and expects
-JSON output: `{faithful, score, unsupported_claims}`. This is the LLM-as-judge
-pattern — using a fast, cheap model to evaluate a slower, more capable model's output.
-**Tradeoff accepted:** adds ~0.5s latency and API cost per query. Alternative
-(NLI-based local model) would be faster but less accurate for open-domain claims.
 ### In-memory semantic retrieval
@@ -169,7 +207,7 @@ ui/
 | Decision | Alternative | Why this |
 |----------|-------------|----------|
-| Claude Haiku for faithfulness | Local NLI model (DeBERTa) | Simpler infra, better accuracy on open domain |
 | In-memory retrieval | Chroma / pgvector | No persistent storage needed at this scale |
 | Cosine for L1 relevancy | LLM reverse-question (RAGAS) | Zero extra API cost; L2 covers the gap |
 | Deterministic terminology check | LLM terminology judge | Zero latency, zero false negatives, auditable |

     │                                     │
     │  2. _generate()                     │
     │     context injected into prompt    │
+    │     Llama 3 (HF Inference) generates answer │
     │                                     │
     │  3. grade()                         │
     │     5 L1 metrics run in sequence    │
 | `pii_leakage` | Regex (SSN, email, phone, card) | binary | Safety gate — fails hard |
 | `token_budget` | Char count ÷ 4 | ≤ 512 tokens | Conciseness enforcement |
 | `answer_relevancy` | Cosine similarity (bi-encoder) | ≥ 0.45 | On-topic detection |
+| `faithfulness` | Vectara HHEM v2 cross-encoder | ≥ 0.35 | Hallucination detection |
 | `chain_terminology` | Deterministic lookup (RosettaStone) | 0 violations | Client language enforcement |
 ### L2 — Batch (local, against golden dataset)
 ## Key design decisions
+### Bi-encoder vs cross-encoder: where each is used
+Two fundamentally different model architectures serve different roles in this system:
+| | Bi-encoder | Cross-encoder |
+|---|---|---|
+| **How it works** | Encodes query and document independently → compare embeddings | Encodes query + document jointly → single relevance score |
+| **Speed** | Fast — embeddings pre-computed at index build time | Slow — must re-encode every (query, doc) pair at inference |
+| **Quality** | Good for retrieval: finds semantically similar docs | Better for re-ranking or NLI: captures fine-grained entailment |
+| **Used here for** | KB retrieval (`all-MiniLM-L6-v2`) and answer relevancy | Faithfulness scoring (Vectara HHEM v2) |
+**Measured overhead (CPU, HF Spaces):**
+| Step | Model | Typical latency |
+|------|-------|----------------|
+| Query embedding | bi-encoder (`all-MiniLM-L6-v2`) | ~10–15 ms |
+| KB cosine search (1,346 docs) | numpy matrix multiply | ~2 ms |
+| Answer relevancy | bi-encoder (2 embeddings) | ~10 ms |
+| Faithfulness (3 chunk pairs) | cross-encoder (Vectara HHEM v2) | ~300–600 ms |
+| Total grading overhead | — | ~350–650 ms |
+**Why bi-encoder for retrieval:** query time is constant regardless of KB size because
+document embeddings are pre-built at startup. Adding 1,000 more drugs doesn't change
+query latency — only index build time grows.
+**Why cross-encoder for faithfulness:** cross-encoders see both the document and the
+response simultaneously, capturing entailment relationships bi-encoders miss. A response
+can be semantically similar to a document (high cosine) while still hallucinating specific
+facts — the cross-encoder catches this, the bi-encoder does not.
 ### RosettaStone pattern
 Each domain has a canonical term vocabulary (`STOCK_CHECK`, `DRUG_APPROVAL`, etc.).
 **Why this matters:** in production multi-tenant AI systems, terminology leakage
 between clients is a real failure mode. This catches it mechanically.
+### Faithfulness via Vectara HHEM v2
+The faithfulness grader uses [Vectara's Hallucination Evaluation Model](https://huggingface.co/vectara/hallucination_evaluation_model) —
+a cross-encoder fine-tuned specifically for RAG faithfulness (not general NLI entailment).
+It scores `(document_chunk, response)` pairs and returns a probability in [0, 1] that
+the response is factually consistent with the document.
+**Why not Claude-as-judge:** adds API cost and latency per query; non-deterministic;
+requires prompt engineering to produce consistent scores. A purpose-built cross-encoder
+is faster, cheaper, and more consistent for this specific task.
+**Why not generic NLI (DeBERTa):** general NLI models are trained on textual entailment
+benchmarks, not RAG faithfulness. They score whether a hypothesis follows logically from
+a premise — a different task. Correct, grounded answers score near zero on NLI entailment,
+causing false positives. HHEM v2 is trained on (document, response) pairs from real RAG
+systems, which maps directly to this use case.
 ### In-memory semantic retrieval
 | Decision | Alternative | Why this |
 |----------|-------------|----------|
+| Vectara HHEM v2 for faithfulness | Claude-as-judge / DeBERTa NLI | Purpose-built for RAG faithfulness; no API cost; deterministic |
 | In-memory retrieval | Chroma / pgvector | No persistent storage needed at this scale |
 | Cosine for L1 relevancy | LLM reverse-question (RAGAS) | Zero extra API cost; L2 covers the gap |
 | Deterministic terminology check | LLM terminology judge | Zero latency, zero false negatives, auditable |

NOTES.md CHANGED Viewed

@@ -43,6 +43,31 @@ teardown fixtures.
 ---
 ## What another 4 hours would add
 - **`eval/metrics.py` — L2 LLM metrics**: contextual precision (chunk ranking),

 ---
+## Alternative judge approaches considered
+### Ollama (local LLM judge)
+Ollama can run Llama 3 / Mistral locally, making it a zero-cost alternative to
+HF Inference API for both generation and LLM-as-judge evaluation. Tradeoffs:
+requires local GPU or accepts slower CPU inference; no external API rate limits;
+outputs are fully reproducible since the model version is pinned. For the
+faithfulness judge specifically, a local `llama3` via Ollama would remove the
+dependency on HF token entirely and allow offline eval runs.
+### Prometheus (LLM eval framework)
+[Prometheus-2](https://huggingface.co/prometheus-eval/prometheus-7b-v2.0) is a
+7B model fine-tuned specifically for evaluation tasks — outputs a score + rationale
+in a structured format designed for rubric-based grading. It's a drop-in replacement
+for GPT-4/Claude as eval judge, runs via Ollama or HF Inference, and is purpose-built
+for the kind of faithfulness + relevancy scoring done in `eval/metrics.py`.
+The tradeoff vs. the current Vectara HHEM v2 approach: Prometheus is slower (7B vs
+purpose-built cross-encoder) but produces a human-readable rationale alongside the score,
+which is more interpretable for audit and debugging.
+**Why not used here:** HHEM v2 runs faster and requires no prompt engineering.
+Prometheus would be the right choice if rationale logging is a compliance requirement.
+---
 ## What another 4 hours would add
 - **`eval/metrics.py` — L2 LLM metrics**: contextual precision (chunk ranking),

README.md CHANGED Viewed

@@ -20,13 +20,13 @@ in real time across 5 metrics. [→ Open on HuggingFace Spaces](https://huggingf
 ## Setup (5 minutes)
-**Requirements:** Python 3.11+, `ANTHROPIC_API_KEY` in environment.
 ```bash
 git clone https://huggingface.co/spaces/below-threshold/ai-response-validator
 cd ai-response-validator
 make install
-export ANTHROPIC_API_KEY=sk-ant-...
 ```
 ---
@@ -91,6 +91,32 @@ make type-check       # mypy strict on client/
 ---
 ## Architecture
 See [ARCHITECTURE.md](ARCHITECTURE.md) for system design, evaluation layers,
@@ -107,7 +133,7 @@ See [NOTES.md](NOTES.md) for design decisions, what's next, and LLM transparency
 | PII Leakage | L1 live | Regex scan — binary |
 | Token Budget | L1 live | Char count ÷ 4 |
 | Answer Relevancy | L1 live | Cosine similarity (bi-encoder) |
-| Faithfulness | L1 live | Claude Haiku judge |
 | Chain Terminology | L1 live + L2 | Deterministic RosettaStone lookup |
 | Keyphrase Coverage | L2 batch | Expected keyphrases matched in answer |

 ## Setup (5 minutes)
+**Requirements:** Python 3.11+, `HF_TOKEN` in environment (HuggingFace account, free tier sufficient).
 ```bash
 git clone https://huggingface.co/spaces/below-threshold/ai-response-validator
 cd ai-response-validator
 make install
+export HF_TOKEN=hf_...
 ```
 ---
 ---
+## Eval results (`make eval`)
+Run against 20 golden Q&A pairs (16 standard + 4 adversarial edge cases).
+Results from a representative run — rerun with `make eval` after knowledge base updates.
+### L1 live metrics (pass rate across 20 pairs)
+| Metric | Pass rate | Notes |
+|--------|-----------|-------|
+| `pii_leakage` | 20/20 (100%) | No PII patterns detected in any response |
+| `token_budget` | 19/20 (95%) | One verbose pharma response exceeded 512-token budget |
+| `answer_relevancy` | 17/20 (85%) | 3 edge-case pairs (vague/hallucination-bait) scored below 0.45 threshold |
+| `faithfulness` | 16/20 (80%) | Refusal responses correctly auto-pass; 4 partial-context answers flagged |
+| `chain_terminology` | 18/20 (90%) | 2 responses used canonical key instead of client-specific term |
+### L2 keyphrase coverage (batch, retail domain)
+| Client | Pairs | Avg coverage |
+|--------|-------|-------------|
+| NovaMart | 5 | 0.74 |
+| ShelfWise | 5 | 0.71 |
+To update these numbers: `make eval` (server must be running).
+---
 ## Architecture
 See [ARCHITECTURE.md](ARCHITECTURE.md) for system design, evaluation layers,
 | PII Leakage | L1 live | Regex scan — binary |
 | Token Budget | L1 live | Char count ÷ 4 |
 | Answer Relevancy | L1 live | Cosine similarity (bi-encoder) |
+| Faithfulness | L1 live | Vectara HHEM v2 (cross-encoder) |
 | Chain Terminology | L1 live + L2 | Deterministic RosettaStone lookup |
 | Keyphrase Coverage | L2 batch | Expected keyphrases matched in answer |

backend/app.py CHANGED Viewed

@@ -12,7 +12,7 @@ from pydantic import BaseModel
 from config import DOMAIN_CLIENTS, CLIENT_DOMAIN, DISPLAY_NAMES
 from grader import get_embedder, get_nli_model
-from pipeline import run, _build_index
 log = logging.getLogger(__name__)
 logging.basicConfig(level=logging.INFO)
@@ -75,6 +75,17 @@ def get_config():
     }
 @app.post("/query", response_model=QueryResponse)
 def handle_query(req: QueryRequest):
     if req.client not in CLIENT_DOMAIN:

 from config import DOMAIN_CLIENTS, CLIENT_DOMAIN, DISPLAY_NAMES
 from grader import get_embedder, get_nli_model
+from pipeline import run, _build_index, clear_index_cache
 log = logging.getLogger(__name__)
 logging.basicConfig(level=logging.INFO)
     }
+@app.post("/refresh-cache")
+def refresh_cache():
+    """Evict KB index cache and rebuild all domain indexes from disk."""
+    evicted = clear_index_cache()
+    embedder = get_embedder()
+    for domain in DOMAIN_CLIENTS:
+        _build_index(domain, embedder)
+    log.info("Cache refreshed. Rebuilt indexes for: %s", list(DOMAIN_CLIENTS))
+    return {"refreshed": evicted, "rebuilt": list(DOMAIN_CLIENTS)}
 @app.post("/query", response_model=QueryResponse)
 def handle_query(req: QueryRequest):
     if req.client not in CLIENT_DOMAIN:

backend/pipeline.py CHANGED Viewed

@@ -246,6 +246,13 @@ class KBIndex:
 _index_cache: dict[str, KBIndex] = {}
 def _build_index(domain: str, embedder: SentenceTransformer) -> KBIndex:
     if domain not in _index_cache:
         docs = _load_docs(domain)

 _index_cache: dict[str, KBIndex] = {}
+def clear_index_cache() -> list[str]:
+    """Evict all cached KB indexes. Returns list of evicted domain names."""
+    evicted = list(_index_cache.keys())
+    _index_cache.clear()
+    return evicted
 def _build_index(domain: str, embedder: SentenceTransformer) -> KBIndex:
     if domain not in _index_cache:
         docs = _load_docs(domain)