Spaces:
Paused
Paused
File size: 7,138 Bytes
7312837 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | # Production Grade RAG β Developer Documentation Assistant
## Implementation Plan
---
## Project Overview
A domain-specific "Ask my Docs" system built for developer documentation.
Users can query any SDK/API/framework documentation and receive accurate answers with citations.
**Initial Target Domain:** FastAPI Documentation
## Locked MVP Choices
The MVP is intentionally constrained to one implementation path. Any alternatives listed in earlier notes are out of scope until the baseline system is working and evaluated.
| Area | MVP Choice |
|---|---|
| Documentation source | Official FastAPI documentation pages only |
| Ingestion method | `BeautifulSoup` crawler + local normalized markdown/json output |
| Extra sources | No GitHub `/docs` sync, no PDF upload, no arbitrary `.md` ingestion in MVP |
| Chunking | Structure-aware chunking with LangChain `RecursiveCharacterTextSplitter` |
| Embeddings | OpenAI `text-embedding-3-small` |
| Vector store | `ChromaDB` only |
| Keyword retrieval | `rank_bm25` |
| Fusion | Reciprocal Rank Fusion (RRF) |
| Re-ranker | Local `cross-encoder/ms-marco-MiniLM-L-6-v2` |
| Answer model | OpenAI `gpt-4o-mini` |
| API layer | `FastAPI` |
| Evaluation | `Ragas` + a hand-written FastAPI eval set |
| CI gate | Lightweight retrieval/regression checks on PRs; full eval run on demand or nightly |
## MVP Scope Boundaries
- Index only the official FastAPI docs corpus.
- Treat the MVP as a single-tenant local/dev system.
- Optimize for answer quality and citation correctness before UI polish.
- Defer multi-doc search, version-aware retrieval, PDF ingestion, and production vector DB migration until after MVP validation.
---
## Phase 1: Fundamentals (Data Ingestion Pipeline)
### Step 1 β Document Collection
- Crawl official FastAPI documentation using `BeautifulSoup`
- Normalize each page into local markdown/json before chunking
- Store raw HTML plus cleaned content locally for reproducibility
- Exclude GitHub docs sync, PDF upload, and arbitrary file ingestion from MVP
- Acceptance criteria:
- All target FastAPI pages are fetched successfully
- Each page is stored with `source_url`, `page_title`, and crawl timestamp
- Re-running ingestion updates changed pages without duplicating records
### Step 2 β Chunking Strategy
- Use LangChain `RecursiveCharacterTextSplitter`
- Chunk size target: **700 tokens**
- Overlap: **100 tokens**
- Preserve code blocks and heading boundaries whenever possible
- Tag each chunk with metadata:
- `source_url`
- `page_title`
- `section_title`
- `chunk_id`
- `doc_version`
- Acceptance criteria:
- No chunk splits in the middle of fenced code blocks
- Every chunk can be traced back to an exact source page and section
- Chunk output is deterministic for unchanged source documents
### Step 3 β Embeddings + Vector Store
- Embedding model: OpenAI `text-embedding-3-small`
- Store embeddings in **ChromaDB**
- Each vector carries full metadata from Step 2
- Acceptance criteria:
- Full FastAPI corpus is embedded successfully
- Re-indexing does not create duplicate vectors for unchanged chunks
- Querying Chroma returns chunk metadata required for citation display
---
## Phase 2: Hybrid Retrieval
### Step 4 β BM25 + Semantic Search
- BM25 via `rank_bm25` library for keyword matching
- Vector similarity search via ChromaDB
- Merge both result sets using **Reciprocal Rank Fusion (RRF)**
- Retrieve top 10 BM25 hits and top 10 vector hits before fusion
- Acceptance criteria:
- Known keyword-heavy queries surface exact-term matches
- Known semantic queries surface relevant conceptual matches
- Fused retrieval performs better than vector-only on the seed eval set
### Step 5 β Cross Encoder Re-ranker
- Use local `cross-encoder/ms-marco-MiniLM-L-6-v2`
- Re-rank top 20 results β pass top 5 to LLM
- Biggest quality boost in the pipeline
- Acceptance criteria:
- Re-ranking is applied on every answer path
- Top 5 contexts are logged for debugging and eval review
- Reranked results improve answer grounding on the seed eval set
### Step 6 β Citation Enforcement
- Each retrieved chunk carries `source_url` + `section_title`
- Prompt the LLM to answer only using provided context
- Force structured output: answer + list of cited chunks
- If no relevant chunk found β return "I don't know" (no hallucination)
- Validate that every cited chunk ID exists in the retrieved context set
- Acceptance criteria:
- API response returns answer text plus machine-readable citations
- Unsupported answers are rejected or converted to "I don't know"
- Manual spot checks confirm citations map to relevant evidence
---
## Phase 3: Evaluation Pipeline
### Step 7 β Build Eval Dataset
- Manually write **30-50 Q&A pairs** from FastAPI docs for the initial eval set
- Cover: factual questions, code questions, comparison questions
- Store as JSON/CSV in the repo
- Expand to 100+ only after the first reliable baseline is in place
- Acceptance criteria:
- Eval set covers multiple sections of the docs
- Each question has an expected answer and supporting source reference
- The dataset is stable enough to compare runs over time
### Step 8 β Ragas Evaluation Metrics
| Metric | What it measures |
|---|---|
| `faithfulness` | Does the answer match retrieved context? |
| `answer_relevancy` | Is the answer on-topic? |
| `context_precision` | Are the right chunks being retrieved? |
| `context_recall` | Are all relevant chunks being found? |
### Step 9 β CI Integration
- Run lightweight regression checks on every PR via **GitHub Actions**
- Reserve full Ragas evaluation for scheduled or manually triggered runs
- Use an initial target such as `faithfulness >= 0.85` as a baseline, not a permanent hard-coded gate
- Store eval results over time to track regression
- Acceptance criteria:
- PR workflow catches obvious retrieval or API regressions quickly
- Full eval workflow produces repeatable metrics artifacts
- Baseline metrics are documented before strict CI thresholds are enforced
---
## Tech Stack
| Layer | Tool |
|---|---|
| Orchestration | LangChain |
| Vector Store (dev) | ChromaDB |
| Vector Store (prod) | Deferred until post-MVP |
| Reranker | `cross-encoder/ms-marco-MiniLM-L-6-v2` |
| Evaluation | Ragas |
| LLM | `gpt-4o-mini` |
| Scraping | BeautifulSoup |
| CI/CD | GitHub Actions |
| API Layer | FastAPI |
| Frontend (optional) | Deferred until post-MVP |
---
## Suggested Build Timeline
| Week | Focus |
|---|---|
| Week 1 | Phase 1 β Ingest, chunk, embed, store |
| Week 2 | Phase 2 β BM25 + rerank + citation enforcement |
| Week 3 | Phase 3 β Eval dataset + Ragas + CI pipeline |
| Week 4 | API hardening + README + baseline evaluation review |
---
## Bonus Features (Post-MVP)
- **Version-aware retrieval** β "In v2 vs v3, how does X work?"
- **Code snippet extraction** β return relevant code blocks alongside answers
- **Multi-doc search** β compare two frameworks side by side
- **Query rewriting** β auto-expand vague queries before retrieval
---
*Based on project notes from January 10β11, 2026*
|