Dev-Docs-Rag / dev_docs_rag_implementation_plan.md
rishitbhowmick's picture
feat: MVP for a developer documents analyzer.
7312837
# Production Grade RAG β€” Developer Documentation Assistant
## Implementation Plan
---
## Project Overview
A domain-specific "Ask my Docs" system built for developer documentation.
Users can query any SDK/API/framework documentation and receive accurate answers with citations.
**Initial Target Domain:** FastAPI Documentation
## Locked MVP Choices
The MVP is intentionally constrained to one implementation path. Any alternatives listed in earlier notes are out of scope until the baseline system is working and evaluated.
| Area | MVP Choice |
|---|---|
| Documentation source | Official FastAPI documentation pages only |
| Ingestion method | `BeautifulSoup` crawler + local normalized markdown/json output |
| Extra sources | No GitHub `/docs` sync, no PDF upload, no arbitrary `.md` ingestion in MVP |
| Chunking | Structure-aware chunking with LangChain `RecursiveCharacterTextSplitter` |
| Embeddings | OpenAI `text-embedding-3-small` |
| Vector store | `ChromaDB` only |
| Keyword retrieval | `rank_bm25` |
| Fusion | Reciprocal Rank Fusion (RRF) |
| Re-ranker | Local `cross-encoder/ms-marco-MiniLM-L-6-v2` |
| Answer model | OpenAI `gpt-4o-mini` |
| API layer | `FastAPI` |
| Evaluation | `Ragas` + a hand-written FastAPI eval set |
| CI gate | Lightweight retrieval/regression checks on PRs; full eval run on demand or nightly |
## MVP Scope Boundaries
- Index only the official FastAPI docs corpus.
- Treat the MVP as a single-tenant local/dev system.
- Optimize for answer quality and citation correctness before UI polish.
- Defer multi-doc search, version-aware retrieval, PDF ingestion, and production vector DB migration until after MVP validation.
---
## Phase 1: Fundamentals (Data Ingestion Pipeline)
### Step 1 β€” Document Collection
- Crawl official FastAPI documentation using `BeautifulSoup`
- Normalize each page into local markdown/json before chunking
- Store raw HTML plus cleaned content locally for reproducibility
- Exclude GitHub docs sync, PDF upload, and arbitrary file ingestion from MVP
- Acceptance criteria:
- All target FastAPI pages are fetched successfully
- Each page is stored with `source_url`, `page_title`, and crawl timestamp
- Re-running ingestion updates changed pages without duplicating records
### Step 2 β€” Chunking Strategy
- Use LangChain `RecursiveCharacterTextSplitter`
- Chunk size target: **700 tokens**
- Overlap: **100 tokens**
- Preserve code blocks and heading boundaries whenever possible
- Tag each chunk with metadata:
- `source_url`
- `page_title`
- `section_title`
- `chunk_id`
- `doc_version`
- Acceptance criteria:
- No chunk splits in the middle of fenced code blocks
- Every chunk can be traced back to an exact source page and section
- Chunk output is deterministic for unchanged source documents
### Step 3 β€” Embeddings + Vector Store
- Embedding model: OpenAI `text-embedding-3-small`
- Store embeddings in **ChromaDB**
- Each vector carries full metadata from Step 2
- Acceptance criteria:
- Full FastAPI corpus is embedded successfully
- Re-indexing does not create duplicate vectors for unchanged chunks
- Querying Chroma returns chunk metadata required for citation display
---
## Phase 2: Hybrid Retrieval
### Step 4 β€” BM25 + Semantic Search
- BM25 via `rank_bm25` library for keyword matching
- Vector similarity search via ChromaDB
- Merge both result sets using **Reciprocal Rank Fusion (RRF)**
- Retrieve top 10 BM25 hits and top 10 vector hits before fusion
- Acceptance criteria:
- Known keyword-heavy queries surface exact-term matches
- Known semantic queries surface relevant conceptual matches
- Fused retrieval performs better than vector-only on the seed eval set
### Step 5 β€” Cross Encoder Re-ranker
- Use local `cross-encoder/ms-marco-MiniLM-L-6-v2`
- Re-rank top 20 results β†’ pass top 5 to LLM
- Biggest quality boost in the pipeline
- Acceptance criteria:
- Re-ranking is applied on every answer path
- Top 5 contexts are logged for debugging and eval review
- Reranked results improve answer grounding on the seed eval set
### Step 6 β€” Citation Enforcement
- Each retrieved chunk carries `source_url` + `section_title`
- Prompt the LLM to answer only using provided context
- Force structured output: answer + list of cited chunks
- If no relevant chunk found β†’ return "I don't know" (no hallucination)
- Validate that every cited chunk ID exists in the retrieved context set
- Acceptance criteria:
- API response returns answer text plus machine-readable citations
- Unsupported answers are rejected or converted to "I don't know"
- Manual spot checks confirm citations map to relevant evidence
---
## Phase 3: Evaluation Pipeline
### Step 7 β€” Build Eval Dataset
- Manually write **30-50 Q&A pairs** from FastAPI docs for the initial eval set
- Cover: factual questions, code questions, comparison questions
- Store as JSON/CSV in the repo
- Expand to 100+ only after the first reliable baseline is in place
- Acceptance criteria:
- Eval set covers multiple sections of the docs
- Each question has an expected answer and supporting source reference
- The dataset is stable enough to compare runs over time
### Step 8 β€” Ragas Evaluation Metrics
| Metric | What it measures |
|---|---|
| `faithfulness` | Does the answer match retrieved context? |
| `answer_relevancy` | Is the answer on-topic? |
| `context_precision` | Are the right chunks being retrieved? |
| `context_recall` | Are all relevant chunks being found? |
### Step 9 β€” CI Integration
- Run lightweight regression checks on every PR via **GitHub Actions**
- Reserve full Ragas evaluation for scheduled or manually triggered runs
- Use an initial target such as `faithfulness >= 0.85` as a baseline, not a permanent hard-coded gate
- Store eval results over time to track regression
- Acceptance criteria:
- PR workflow catches obvious retrieval or API regressions quickly
- Full eval workflow produces repeatable metrics artifacts
- Baseline metrics are documented before strict CI thresholds are enforced
---
## Tech Stack
| Layer | Tool |
|---|---|
| Orchestration | LangChain |
| Vector Store (dev) | ChromaDB |
| Vector Store (prod) | Deferred until post-MVP |
| Reranker | `cross-encoder/ms-marco-MiniLM-L-6-v2` |
| Evaluation | Ragas |
| LLM | `gpt-4o-mini` |
| Scraping | BeautifulSoup |
| CI/CD | GitHub Actions |
| API Layer | FastAPI |
| Frontend (optional) | Deferred until post-MVP |
---
## Suggested Build Timeline
| Week | Focus |
|---|---|
| Week 1 | Phase 1 β€” Ingest, chunk, embed, store |
| Week 2 | Phase 2 β€” BM25 + rerank + citation enforcement |
| Week 3 | Phase 3 β€” Eval dataset + Ragas + CI pipeline |
| Week 4 | API hardening + README + baseline evaluation review |
---
## Bonus Features (Post-MVP)
- **Version-aware retrieval** β€” "In v2 vs v3, how does X work?"
- **Code snippet extraction** β€” return relevant code blocks alongside answers
- **Multi-doc search** β€” compare two frameworks side by side
- **Query rewriting** β€” auto-expand vague queries before retrieval
---
*Based on project notes from January 10–11, 2026*