# Production Grade RAG — Developer Documentation Assistant ## Implementation Plan --- ## Project Overview A domain-specific "Ask my Docs" system built for developer documentation. Users can query any SDK/API/framework documentation and receive accurate answers with citations. **Initial Target Domain:** FastAPI Documentation ## Locked MVP Choices The MVP is intentionally constrained to one implementation path. Any alternatives listed in earlier notes are out of scope until the baseline system is working and evaluated. | Area | MVP Choice | |---|---| | Documentation source | Official FastAPI documentation pages only | | Ingestion method | `BeautifulSoup` crawler + local normalized markdown/json output | | Extra sources | No GitHub `/docs` sync, no PDF upload, no arbitrary `.md` ingestion in MVP | | Chunking | Structure-aware chunking with LangChain `RecursiveCharacterTextSplitter` | | Embeddings | OpenAI `text-embedding-3-small` | | Vector store | `ChromaDB` only | | Keyword retrieval | `rank_bm25` | | Fusion | Reciprocal Rank Fusion (RRF) | | Re-ranker | Local `cross-encoder/ms-marco-MiniLM-L-6-v2` | | Answer model | OpenAI `gpt-4o-mini` | | API layer | `FastAPI` | | Evaluation | `Ragas` + a hand-written FastAPI eval set | | CI gate | Lightweight retrieval/regression checks on PRs; full eval run on demand or nightly | ## MVP Scope Boundaries - Index only the official FastAPI docs corpus. - Treat the MVP as a single-tenant local/dev system. - Optimize for answer quality and citation correctness before UI polish. - Defer multi-doc search, version-aware retrieval, PDF ingestion, and production vector DB migration until after MVP validation. --- ## Phase 1: Fundamentals (Data Ingestion Pipeline) ### Step 1 — Document Collection - Crawl official FastAPI documentation using `BeautifulSoup` - Normalize each page into local markdown/json before chunking - Store raw HTML plus cleaned content locally for reproducibility - Exclude GitHub docs sync, PDF upload, and arbitrary file ingestion from MVP - Acceptance criteria: - All target FastAPI pages are fetched successfully - Each page is stored with `source_url`, `page_title`, and crawl timestamp - Re-running ingestion updates changed pages without duplicating records ### Step 2 — Chunking Strategy - Use LangChain `RecursiveCharacterTextSplitter` - Chunk size target: **700 tokens** - Overlap: **100 tokens** - Preserve code blocks and heading boundaries whenever possible - Tag each chunk with metadata: - `source_url` - `page_title` - `section_title` - `chunk_id` - `doc_version` - Acceptance criteria: - No chunk splits in the middle of fenced code blocks - Every chunk can be traced back to an exact source page and section - Chunk output is deterministic for unchanged source documents ### Step 3 — Embeddings + Vector Store - Embedding model: OpenAI `text-embedding-3-small` - Store embeddings in **ChromaDB** - Each vector carries full metadata from Step 2 - Acceptance criteria: - Full FastAPI corpus is embedded successfully - Re-indexing does not create duplicate vectors for unchanged chunks - Querying Chroma returns chunk metadata required for citation display --- ## Phase 2: Hybrid Retrieval ### Step 4 — BM25 + Semantic Search - BM25 via `rank_bm25` library for keyword matching - Vector similarity search via ChromaDB - Merge both result sets using **Reciprocal Rank Fusion (RRF)** - Retrieve top 10 BM25 hits and top 10 vector hits before fusion - Acceptance criteria: - Known keyword-heavy queries surface exact-term matches - Known semantic queries surface relevant conceptual matches - Fused retrieval performs better than vector-only on the seed eval set ### Step 5 — Cross Encoder Re-ranker - Use local `cross-encoder/ms-marco-MiniLM-L-6-v2` - Re-rank top 20 results → pass top 5 to LLM - Biggest quality boost in the pipeline - Acceptance criteria: - Re-ranking is applied on every answer path - Top 5 contexts are logged for debugging and eval review - Reranked results improve answer grounding on the seed eval set ### Step 6 — Citation Enforcement - Each retrieved chunk carries `source_url` + `section_title` - Prompt the LLM to answer only using provided context - Force structured output: answer + list of cited chunks - If no relevant chunk found → return "I don't know" (no hallucination) - Validate that every cited chunk ID exists in the retrieved context set - Acceptance criteria: - API response returns answer text plus machine-readable citations - Unsupported answers are rejected or converted to "I don't know" - Manual spot checks confirm citations map to relevant evidence --- ## Phase 3: Evaluation Pipeline ### Step 7 — Build Eval Dataset - Manually write **30-50 Q&A pairs** from FastAPI docs for the initial eval set - Cover: factual questions, code questions, comparison questions - Store as JSON/CSV in the repo - Expand to 100+ only after the first reliable baseline is in place - Acceptance criteria: - Eval set covers multiple sections of the docs - Each question has an expected answer and supporting source reference - The dataset is stable enough to compare runs over time ### Step 8 — Ragas Evaluation Metrics | Metric | What it measures | |---|---| | `faithfulness` | Does the answer match retrieved context? | | `answer_relevancy` | Is the answer on-topic? | | `context_precision` | Are the right chunks being retrieved? | | `context_recall` | Are all relevant chunks being found? | ### Step 9 — CI Integration - Run lightweight regression checks on every PR via **GitHub Actions** - Reserve full Ragas evaluation for scheduled or manually triggered runs - Use an initial target such as `faithfulness >= 0.85` as a baseline, not a permanent hard-coded gate - Store eval results over time to track regression - Acceptance criteria: - PR workflow catches obvious retrieval or API regressions quickly - Full eval workflow produces repeatable metrics artifacts - Baseline metrics are documented before strict CI thresholds are enforced --- ## Tech Stack | Layer | Tool | |---|---| | Orchestration | LangChain | | Vector Store (dev) | ChromaDB | | Vector Store (prod) | Deferred until post-MVP | | Reranker | `cross-encoder/ms-marco-MiniLM-L-6-v2` | | Evaluation | Ragas | | LLM | `gpt-4o-mini` | | Scraping | BeautifulSoup | | CI/CD | GitHub Actions | | API Layer | FastAPI | | Frontend (optional) | Deferred until post-MVP | --- ## Suggested Build Timeline | Week | Focus | |---|---| | Week 1 | Phase 1 — Ingest, chunk, embed, store | | Week 2 | Phase 2 — BM25 + rerank + citation enforcement | | Week 3 | Phase 3 — Eval dataset + Ragas + CI pipeline | | Week 4 | API hardening + README + baseline evaluation review | --- ## Bonus Features (Post-MVP) - **Version-aware retrieval** — "In v2 vs v3, how does X work?" - **Code snippet extraction** — return relevant code blocks alongside answers - **Multi-doc search** — compare two frameworks side by side - **Query rewriting** — auto-expand vague queries before retrieval --- *Based on project notes from January 10–11, 2026*