File size: 7,138 Bytes
7312837
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
# Production Grade RAG β€” Developer Documentation Assistant
## Implementation Plan

---

## Project Overview

A domain-specific "Ask my Docs" system built for developer documentation.
Users can query any SDK/API/framework documentation and receive accurate answers with citations.

**Initial Target Domain:** FastAPI Documentation

## Locked MVP Choices

The MVP is intentionally constrained to one implementation path. Any alternatives listed in earlier notes are out of scope until the baseline system is working and evaluated.

| Area | MVP Choice |
|---|---|
| Documentation source | Official FastAPI documentation pages only |
| Ingestion method | `BeautifulSoup` crawler + local normalized markdown/json output |
| Extra sources | No GitHub `/docs` sync, no PDF upload, no arbitrary `.md` ingestion in MVP |
| Chunking | Structure-aware chunking with LangChain `RecursiveCharacterTextSplitter` |
| Embeddings | OpenAI `text-embedding-3-small` |
| Vector store | `ChromaDB` only |
| Keyword retrieval | `rank_bm25` |
| Fusion | Reciprocal Rank Fusion (RRF) |
| Re-ranker | Local `cross-encoder/ms-marco-MiniLM-L-6-v2` |
| Answer model | OpenAI `gpt-4o-mini` |
| API layer | `FastAPI` |
| Evaluation | `Ragas` + a hand-written FastAPI eval set |
| CI gate | Lightweight retrieval/regression checks on PRs; full eval run on demand or nightly |

## MVP Scope Boundaries

- Index only the official FastAPI docs corpus.
- Treat the MVP as a single-tenant local/dev system.
- Optimize for answer quality and citation correctness before UI polish.
- Defer multi-doc search, version-aware retrieval, PDF ingestion, and production vector DB migration until after MVP validation.

---

## Phase 1: Fundamentals (Data Ingestion Pipeline)

### Step 1 β€” Document Collection
- Crawl official FastAPI documentation using `BeautifulSoup`
- Normalize each page into local markdown/json before chunking
- Store raw HTML plus cleaned content locally for reproducibility
- Exclude GitHub docs sync, PDF upload, and arbitrary file ingestion from MVP
- Acceptance criteria:
  - All target FastAPI pages are fetched successfully
  - Each page is stored with `source_url`, `page_title`, and crawl timestamp
  - Re-running ingestion updates changed pages without duplicating records

### Step 2 β€” Chunking Strategy
- Use LangChain `RecursiveCharacterTextSplitter`
- Chunk size target: **700 tokens**
- Overlap: **100 tokens**
- Preserve code blocks and heading boundaries whenever possible
- Tag each chunk with metadata:
  - `source_url`
  - `page_title`
  - `section_title`
  - `chunk_id`
  - `doc_version`
- Acceptance criteria:
  - No chunk splits in the middle of fenced code blocks
  - Every chunk can be traced back to an exact source page and section
  - Chunk output is deterministic for unchanged source documents

### Step 3 β€” Embeddings + Vector Store
- Embedding model: OpenAI `text-embedding-3-small`
- Store embeddings in **ChromaDB**
- Each vector carries full metadata from Step 2
- Acceptance criteria:
  - Full FastAPI corpus is embedded successfully
  - Re-indexing does not create duplicate vectors for unchanged chunks
  - Querying Chroma returns chunk metadata required for citation display

---

## Phase 2: Hybrid Retrieval

### Step 4 β€” BM25 + Semantic Search
- BM25 via `rank_bm25` library for keyword matching
- Vector similarity search via ChromaDB
- Merge both result sets using **Reciprocal Rank Fusion (RRF)**
- Retrieve top 10 BM25 hits and top 10 vector hits before fusion
- Acceptance criteria:
  - Known keyword-heavy queries surface exact-term matches
  - Known semantic queries surface relevant conceptual matches
  - Fused retrieval performs better than vector-only on the seed eval set

### Step 5 β€” Cross Encoder Re-ranker
- Use local `cross-encoder/ms-marco-MiniLM-L-6-v2`
- Re-rank top 20 results β†’ pass top 5 to LLM
- Biggest quality boost in the pipeline
- Acceptance criteria:
  - Re-ranking is applied on every answer path
  - Top 5 contexts are logged for debugging and eval review
  - Reranked results improve answer grounding on the seed eval set

### Step 6 β€” Citation Enforcement
- Each retrieved chunk carries `source_url` + `section_title`
- Prompt the LLM to answer only using provided context
- Force structured output: answer + list of cited chunks
- If no relevant chunk found β†’ return "I don't know" (no hallucination)
- Validate that every cited chunk ID exists in the retrieved context set
- Acceptance criteria:
  - API response returns answer text plus machine-readable citations
  - Unsupported answers are rejected or converted to "I don't know"
  - Manual spot checks confirm citations map to relevant evidence

---

## Phase 3: Evaluation Pipeline

### Step 7 β€” Build Eval Dataset
- Manually write **30-50 Q&A pairs** from FastAPI docs for the initial eval set
- Cover: factual questions, code questions, comparison questions
- Store as JSON/CSV in the repo
- Expand to 100+ only after the first reliable baseline is in place
- Acceptance criteria:
  - Eval set covers multiple sections of the docs
  - Each question has an expected answer and supporting source reference
  - The dataset is stable enough to compare runs over time

### Step 8 β€” Ragas Evaluation Metrics
| Metric | What it measures |
|---|---|
| `faithfulness` | Does the answer match retrieved context? |
| `answer_relevancy` | Is the answer on-topic? |
| `context_precision` | Are the right chunks being retrieved? |
| `context_recall` | Are all relevant chunks being found? |

### Step 9 β€” CI Integration
- Run lightweight regression checks on every PR via **GitHub Actions**
- Reserve full Ragas evaluation for scheduled or manually triggered runs
- Use an initial target such as `faithfulness >= 0.85` as a baseline, not a permanent hard-coded gate
- Store eval results over time to track regression
- Acceptance criteria:
  - PR workflow catches obvious retrieval or API regressions quickly
  - Full eval workflow produces repeatable metrics artifacts
  - Baseline metrics are documented before strict CI thresholds are enforced

---

## Tech Stack

| Layer | Tool |
|---|---|
| Orchestration | LangChain |
| Vector Store (dev) | ChromaDB |
| Vector Store (prod) | Deferred until post-MVP |
| Reranker | `cross-encoder/ms-marco-MiniLM-L-6-v2` |
| Evaluation | Ragas |
| LLM | `gpt-4o-mini` |
| Scraping | BeautifulSoup |
| CI/CD | GitHub Actions |
| API Layer | FastAPI |
| Frontend (optional) | Deferred until post-MVP |

---

## Suggested Build Timeline

| Week | Focus |
|---|---|
| Week 1 | Phase 1 β€” Ingest, chunk, embed, store |
| Week 2 | Phase 2 β€” BM25 + rerank + citation enforcement |
| Week 3 | Phase 3 β€” Eval dataset + Ragas + CI pipeline |
| Week 4 | API hardening + README + baseline evaluation review |

---

## Bonus Features (Post-MVP)

- **Version-aware retrieval** β€” "In v2 vs v3, how does X work?"
- **Code snippet extraction** β€” return relevant code blocks alongside answers
- **Multi-doc search** β€” compare two frameworks side by side
- **Query rewriting** β€” auto-expand vague queries before retrieval

---

*Based on project notes from January 10–11, 2026*