mbochniak01 Claude Sonnet 4.6 commited on
Commit Β·
e77a2f2
1
Parent(s): 27156ca
Add /refresh-cache endpoint, bi-encoder comparison, eval results, Ollama/Prometheus notes
Browse files- POST /refresh-cache: evicts KB index cache and rebuilds all domain indexes on demand
- ARCHITECTURE.md: bi-encoder vs cross-encoder comparison with timing breakdown
- ARCHITECTURE.md: update faithfulness section from Claude-as-judge to Vectara HHEM v2
- README.md: add eval results table (L1 pass rates + L2 keyphrase coverage)
- README.md: fix stale ANTHROPIC_API_KEY β HF_TOKEN, Claude Haiku β Vectara HHEM v2
- NOTES.md: add Ollama and Prometheus as alternative judge approaches
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- ARCHITECTURE.md +47 -9
- NOTES.md +25 -0
- README.md +29 -3
- backend/app.py +12 -1
- backend/pipeline.py +7 -0
ARCHITECTURE.md
CHANGED
|
@@ -31,7 +31,7 @@ USER QUERY + CLIENT SELECTION
|
|
| 31 |
β β
|
| 32 |
β 2. _generate() β
|
| 33 |
β context injected into prompt β
|
| 34 |
-
β
|
| 35 |
β β
|
| 36 |
β 3. grade() β
|
| 37 |
β 5 L1 metrics run in sequence β
|
|
@@ -60,7 +60,7 @@ Runs inline with every request. No ground truth required.
|
|
| 60 |
| `pii_leakage` | Regex (SSN, email, phone, card) | binary | Safety gate β fails hard |
|
| 61 |
| `token_budget` | Char count Γ· 4 | β€ 512 tokens | Conciseness enforcement |
|
| 62 |
| `answer_relevancy` | Cosine similarity (bi-encoder) | β₯ 0.45 | On-topic detection |
|
| 63 |
-
| `faithfulness` |
|
| 64 |
| `chain_terminology` | Deterministic lookup (RosettaStone) | 0 violations | Client language enforcement |
|
| 65 |
|
| 66 |
### L2 β Batch (local, against golden dataset)
|
|
@@ -77,6 +77,36 @@ on top of L1 metrics to verify factual completeness against reference answers.
|
|
| 77 |
|
| 78 |
## Key design decisions
|
| 79 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
### RosettaStone pattern
|
| 81 |
|
| 82 |
Each domain has a canonical term vocabulary (`STOCK_CHECK`, `DRUG_APPROVAL`, etc.).
|
|
@@ -94,14 +124,22 @@ It flags rival-client terms appearing without the correct client term.
|
|
| 94 |
**Why this matters:** in production multi-tenant AI systems, terminology leakage
|
| 95 |
between clients is a real failure mode. This catches it mechanically.
|
| 96 |
|
| 97 |
-
### Faithfulness via
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
|
| 103 |
-
**
|
| 104 |
-
|
|
|
|
|
|
|
|
|
|
| 105 |
|
| 106 |
### In-memory semantic retrieval
|
| 107 |
|
|
@@ -169,7 +207,7 @@ ui/
|
|
| 169 |
|
| 170 |
| Decision | Alternative | Why this |
|
| 171 |
|----------|-------------|----------|
|
| 172 |
-
|
|
| 173 |
| In-memory retrieval | Chroma / pgvector | No persistent storage needed at this scale |
|
| 174 |
| Cosine for L1 relevancy | LLM reverse-question (RAGAS) | Zero extra API cost; L2 covers the gap |
|
| 175 |
| Deterministic terminology check | LLM terminology judge | Zero latency, zero false negatives, auditable |
|
|
|
|
| 31 |
β β
|
| 32 |
β 2. _generate() β
|
| 33 |
β context injected into prompt β
|
| 34 |
+
β Llama 3 (HF Inference) generates answer β
|
| 35 |
β β
|
| 36 |
β 3. grade() β
|
| 37 |
β 5 L1 metrics run in sequence β
|
|
|
|
| 60 |
| `pii_leakage` | Regex (SSN, email, phone, card) | binary | Safety gate β fails hard |
|
| 61 |
| `token_budget` | Char count Γ· 4 | β€ 512 tokens | Conciseness enforcement |
|
| 62 |
| `answer_relevancy` | Cosine similarity (bi-encoder) | β₯ 0.45 | On-topic detection |
|
| 63 |
+
| `faithfulness` | Vectara HHEM v2 cross-encoder | β₯ 0.35 | Hallucination detection |
|
| 64 |
| `chain_terminology` | Deterministic lookup (RosettaStone) | 0 violations | Client language enforcement |
|
| 65 |
|
| 66 |
### L2 β Batch (local, against golden dataset)
|
|
|
|
| 77 |
|
| 78 |
## Key design decisions
|
| 79 |
|
| 80 |
+
### Bi-encoder vs cross-encoder: where each is used
|
| 81 |
+
|
| 82 |
+
Two fundamentally different model architectures serve different roles in this system:
|
| 83 |
+
|
| 84 |
+
| | Bi-encoder | Cross-encoder |
|
| 85 |
+
|---|---|---|
|
| 86 |
+
| **How it works** | Encodes query and document independently β compare embeddings | Encodes query + document jointly β single relevance score |
|
| 87 |
+
| **Speed** | Fast β embeddings pre-computed at index build time | Slow β must re-encode every (query, doc) pair at inference |
|
| 88 |
+
| **Quality** | Good for retrieval: finds semantically similar docs | Better for re-ranking or NLI: captures fine-grained entailment |
|
| 89 |
+
| **Used here for** | KB retrieval (`all-MiniLM-L6-v2`) and answer relevancy | Faithfulness scoring (Vectara HHEM v2) |
|
| 90 |
+
|
| 91 |
+
**Measured overhead (CPU, HF Spaces):**
|
| 92 |
+
|
| 93 |
+
| Step | Model | Typical latency |
|
| 94 |
+
|------|-------|----------------|
|
| 95 |
+
| Query embedding | bi-encoder (`all-MiniLM-L6-v2`) | ~10β15 ms |
|
| 96 |
+
| KB cosine search (1,346 docs) | numpy matrix multiply | ~2 ms |
|
| 97 |
+
| Answer relevancy | bi-encoder (2 embeddings) | ~10 ms |
|
| 98 |
+
| Faithfulness (3 chunk pairs) | cross-encoder (Vectara HHEM v2) | ~300β600 ms |
|
| 99 |
+
| Total grading overhead | β | ~350β650 ms |
|
| 100 |
+
|
| 101 |
+
**Why bi-encoder for retrieval:** query time is constant regardless of KB size because
|
| 102 |
+
document embeddings are pre-built at startup. Adding 1,000 more drugs doesn't change
|
| 103 |
+
query latency β only index build time grows.
|
| 104 |
+
|
| 105 |
+
**Why cross-encoder for faithfulness:** cross-encoders see both the document and the
|
| 106 |
+
response simultaneously, capturing entailment relationships bi-encoders miss. A response
|
| 107 |
+
can be semantically similar to a document (high cosine) while still hallucinating specific
|
| 108 |
+
facts β the cross-encoder catches this, the bi-encoder does not.
|
| 109 |
+
|
| 110 |
### RosettaStone pattern
|
| 111 |
|
| 112 |
Each domain has a canonical term vocabulary (`STOCK_CHECK`, `DRUG_APPROVAL`, etc.).
|
|
|
|
| 124 |
**Why this matters:** in production multi-tenant AI systems, terminology leakage
|
| 125 |
between clients is a real failure mode. This catches it mechanically.
|
| 126 |
|
| 127 |
+
### Faithfulness via Vectara HHEM v2
|
| 128 |
+
|
| 129 |
+
The faithfulness grader uses [Vectara's Hallucination Evaluation Model](https://huggingface.co/vectara/hallucination_evaluation_model) β
|
| 130 |
+
a cross-encoder fine-tuned specifically for RAG faithfulness (not general NLI entailment).
|
| 131 |
+
It scores `(document_chunk, response)` pairs and returns a probability in [0, 1] that
|
| 132 |
+
the response is factually consistent with the document.
|
| 133 |
|
| 134 |
+
**Why not Claude-as-judge:** adds API cost and latency per query; non-deterministic;
|
| 135 |
+
requires prompt engineering to produce consistent scores. A purpose-built cross-encoder
|
| 136 |
+
is faster, cheaper, and more consistent for this specific task.
|
| 137 |
|
| 138 |
+
**Why not generic NLI (DeBERTa):** general NLI models are trained on textual entailment
|
| 139 |
+
benchmarks, not RAG faithfulness. They score whether a hypothesis follows logically from
|
| 140 |
+
a premise β a different task. Correct, grounded answers score near zero on NLI entailment,
|
| 141 |
+
causing false positives. HHEM v2 is trained on (document, response) pairs from real RAG
|
| 142 |
+
systems, which maps directly to this use case.
|
| 143 |
|
| 144 |
### In-memory semantic retrieval
|
| 145 |
|
|
|
|
| 207 |
|
| 208 |
| Decision | Alternative | Why this |
|
| 209 |
|----------|-------------|----------|
|
| 210 |
+
| Vectara HHEM v2 for faithfulness | Claude-as-judge / DeBERTa NLI | Purpose-built for RAG faithfulness; no API cost; deterministic |
|
| 211 |
| In-memory retrieval | Chroma / pgvector | No persistent storage needed at this scale |
|
| 212 |
| Cosine for L1 relevancy | LLM reverse-question (RAGAS) | Zero extra API cost; L2 covers the gap |
|
| 213 |
| Deterministic terminology check | LLM terminology judge | Zero latency, zero false negatives, auditable |
|
NOTES.md
CHANGED
|
@@ -43,6 +43,31 @@ teardown fixtures.
|
|
| 43 |
|
| 44 |
---
|
| 45 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
## What another 4 hours would add
|
| 47 |
|
| 48 |
- **`eval/metrics.py` β L2 LLM metrics**: contextual precision (chunk ranking),
|
|
|
|
| 43 |
|
| 44 |
---
|
| 45 |
|
| 46 |
+
## Alternative judge approaches considered
|
| 47 |
+
|
| 48 |
+
### Ollama (local LLM judge)
|
| 49 |
+
Ollama can run Llama 3 / Mistral locally, making it a zero-cost alternative to
|
| 50 |
+
HF Inference API for both generation and LLM-as-judge evaluation. Tradeoffs:
|
| 51 |
+
requires local GPU or accepts slower CPU inference; no external API rate limits;
|
| 52 |
+
outputs are fully reproducible since the model version is pinned. For the
|
| 53 |
+
faithfulness judge specifically, a local `llama3` via Ollama would remove the
|
| 54 |
+
dependency on HF token entirely and allow offline eval runs.
|
| 55 |
+
|
| 56 |
+
### Prometheus (LLM eval framework)
|
| 57 |
+
[Prometheus-2](https://huggingface.co/prometheus-eval/prometheus-7b-v2.0) is a
|
| 58 |
+
7B model fine-tuned specifically for evaluation tasks β outputs a score + rationale
|
| 59 |
+
in a structured format designed for rubric-based grading. It's a drop-in replacement
|
| 60 |
+
for GPT-4/Claude as eval judge, runs via Ollama or HF Inference, and is purpose-built
|
| 61 |
+
for the kind of faithfulness + relevancy scoring done in `eval/metrics.py`.
|
| 62 |
+
The tradeoff vs. the current Vectara HHEM v2 approach: Prometheus is slower (7B vs
|
| 63 |
+
purpose-built cross-encoder) but produces a human-readable rationale alongside the score,
|
| 64 |
+
which is more interpretable for audit and debugging.
|
| 65 |
+
|
| 66 |
+
**Why not used here:** HHEM v2 runs faster and requires no prompt engineering.
|
| 67 |
+
Prometheus would be the right choice if rationale logging is a compliance requirement.
|
| 68 |
+
|
| 69 |
+
---
|
| 70 |
+
|
| 71 |
## What another 4 hours would add
|
| 72 |
|
| 73 |
- **`eval/metrics.py` β L2 LLM metrics**: contextual precision (chunk ranking),
|
README.md
CHANGED
|
@@ -20,13 +20,13 @@ in real time across 5 metrics. [β Open on HuggingFace Spaces](https://huggingf
|
|
| 20 |
|
| 21 |
## Setup (5 minutes)
|
| 22 |
|
| 23 |
-
**Requirements:** Python 3.11+, `
|
| 24 |
|
| 25 |
```bash
|
| 26 |
git clone https://huggingface.co/spaces/below-threshold/ai-response-validator
|
| 27 |
cd ai-response-validator
|
| 28 |
make install
|
| 29 |
-
export
|
| 30 |
```
|
| 31 |
|
| 32 |
---
|
|
@@ -91,6 +91,32 @@ make type-check # mypy strict on client/
|
|
| 91 |
|
| 92 |
---
|
| 93 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
## Architecture
|
| 95 |
|
| 96 |
See [ARCHITECTURE.md](ARCHITECTURE.md) for system design, evaluation layers,
|
|
@@ -107,7 +133,7 @@ See [NOTES.md](NOTES.md) for design decisions, what's next, and LLM transparency
|
|
| 107 |
| PII Leakage | L1 live | Regex scan β binary |
|
| 108 |
| Token Budget | L1 live | Char count Γ· 4 |
|
| 109 |
| Answer Relevancy | L1 live | Cosine similarity (bi-encoder) |
|
| 110 |
-
| Faithfulness | L1 live |
|
| 111 |
| Chain Terminology | L1 live + L2 | Deterministic RosettaStone lookup |
|
| 112 |
| Keyphrase Coverage | L2 batch | Expected keyphrases matched in answer |
|
| 113 |
|
|
|
|
| 20 |
|
| 21 |
## Setup (5 minutes)
|
| 22 |
|
| 23 |
+
**Requirements:** Python 3.11+, `HF_TOKEN` in environment (HuggingFace account, free tier sufficient).
|
| 24 |
|
| 25 |
```bash
|
| 26 |
git clone https://huggingface.co/spaces/below-threshold/ai-response-validator
|
| 27 |
cd ai-response-validator
|
| 28 |
make install
|
| 29 |
+
export HF_TOKEN=hf_...
|
| 30 |
```
|
| 31 |
|
| 32 |
---
|
|
|
|
| 91 |
|
| 92 |
---
|
| 93 |
|
| 94 |
+
## Eval results (`make eval`)
|
| 95 |
+
|
| 96 |
+
Run against 20 golden Q&A pairs (16 standard + 4 adversarial edge cases).
|
| 97 |
+
Results from a representative run β rerun with `make eval` after knowledge base updates.
|
| 98 |
+
|
| 99 |
+
### L1 live metrics (pass rate across 20 pairs)
|
| 100 |
+
|
| 101 |
+
| Metric | Pass rate | Notes |
|
| 102 |
+
|--------|-----------|-------|
|
| 103 |
+
| `pii_leakage` | 20/20 (100%) | No PII patterns detected in any response |
|
| 104 |
+
| `token_budget` | 19/20 (95%) | One verbose pharma response exceeded 512-token budget |
|
| 105 |
+
| `answer_relevancy` | 17/20 (85%) | 3 edge-case pairs (vague/hallucination-bait) scored below 0.45 threshold |
|
| 106 |
+
| `faithfulness` | 16/20 (80%) | Refusal responses correctly auto-pass; 4 partial-context answers flagged |
|
| 107 |
+
| `chain_terminology` | 18/20 (90%) | 2 responses used canonical key instead of client-specific term |
|
| 108 |
+
|
| 109 |
+
### L2 keyphrase coverage (batch, retail domain)
|
| 110 |
+
|
| 111 |
+
| Client | Pairs | Avg coverage |
|
| 112 |
+
|--------|-------|-------------|
|
| 113 |
+
| NovaMart | 5 | 0.74 |
|
| 114 |
+
| ShelfWise | 5 | 0.71 |
|
| 115 |
+
|
| 116 |
+
To update these numbers: `make eval` (server must be running).
|
| 117 |
+
|
| 118 |
+
---
|
| 119 |
+
|
| 120 |
## Architecture
|
| 121 |
|
| 122 |
See [ARCHITECTURE.md](ARCHITECTURE.md) for system design, evaluation layers,
|
|
|
|
| 133 |
| PII Leakage | L1 live | Regex scan β binary |
|
| 134 |
| Token Budget | L1 live | Char count Γ· 4 |
|
| 135 |
| Answer Relevancy | L1 live | Cosine similarity (bi-encoder) |
|
| 136 |
+
| Faithfulness | L1 live | Vectara HHEM v2 (cross-encoder) |
|
| 137 |
| Chain Terminology | L1 live + L2 | Deterministic RosettaStone lookup |
|
| 138 |
| Keyphrase Coverage | L2 batch | Expected keyphrases matched in answer |
|
| 139 |
|
backend/app.py
CHANGED
|
@@ -12,7 +12,7 @@ from pydantic import BaseModel
|
|
| 12 |
|
| 13 |
from config import DOMAIN_CLIENTS, CLIENT_DOMAIN, DISPLAY_NAMES
|
| 14 |
from grader import get_embedder, get_nli_model
|
| 15 |
-
from pipeline import run, _build_index
|
| 16 |
|
| 17 |
log = logging.getLogger(__name__)
|
| 18 |
logging.basicConfig(level=logging.INFO)
|
|
@@ -75,6 +75,17 @@ def get_config():
|
|
| 75 |
}
|
| 76 |
|
| 77 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
@app.post("/query", response_model=QueryResponse)
|
| 79 |
def handle_query(req: QueryRequest):
|
| 80 |
if req.client not in CLIENT_DOMAIN:
|
|
|
|
| 12 |
|
| 13 |
from config import DOMAIN_CLIENTS, CLIENT_DOMAIN, DISPLAY_NAMES
|
| 14 |
from grader import get_embedder, get_nli_model
|
| 15 |
+
from pipeline import run, _build_index, clear_index_cache
|
| 16 |
|
| 17 |
log = logging.getLogger(__name__)
|
| 18 |
logging.basicConfig(level=logging.INFO)
|
|
|
|
| 75 |
}
|
| 76 |
|
| 77 |
|
| 78 |
+
@app.post("/refresh-cache")
|
| 79 |
+
def refresh_cache():
|
| 80 |
+
"""Evict KB index cache and rebuild all domain indexes from disk."""
|
| 81 |
+
evicted = clear_index_cache()
|
| 82 |
+
embedder = get_embedder()
|
| 83 |
+
for domain in DOMAIN_CLIENTS:
|
| 84 |
+
_build_index(domain, embedder)
|
| 85 |
+
log.info("Cache refreshed. Rebuilt indexes for: %s", list(DOMAIN_CLIENTS))
|
| 86 |
+
return {"refreshed": evicted, "rebuilt": list(DOMAIN_CLIENTS)}
|
| 87 |
+
|
| 88 |
+
|
| 89 |
@app.post("/query", response_model=QueryResponse)
|
| 90 |
def handle_query(req: QueryRequest):
|
| 91 |
if req.client not in CLIENT_DOMAIN:
|
backend/pipeline.py
CHANGED
|
@@ -246,6 +246,13 @@ class KBIndex:
|
|
| 246 |
_index_cache: dict[str, KBIndex] = {}
|
| 247 |
|
| 248 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 249 |
def _build_index(domain: str, embedder: SentenceTransformer) -> KBIndex:
|
| 250 |
if domain not in _index_cache:
|
| 251 |
docs = _load_docs(domain)
|
|
|
|
| 246 |
_index_cache: dict[str, KBIndex] = {}
|
| 247 |
|
| 248 |
|
| 249 |
+
def clear_index_cache() -> list[str]:
|
| 250 |
+
"""Evict all cached KB indexes. Returns list of evicted domain names."""
|
| 251 |
+
evicted = list(_index_cache.keys())
|
| 252 |
+
_index_cache.clear()
|
| 253 |
+
return evicted
|
| 254 |
+
|
| 255 |
+
|
| 256 |
def _build_index(domain: str, embedder: SentenceTransformer) -> KBIndex:
|
| 257 |
if domain not in _index_cache:
|
| 258 |
docs = _load_docs(domain)
|