Spaces:

rishitbhowmick
/

Dev-Docs-Rag

Paused

App Files Files Community

Dev-Docs-Rag / dev_docs_rag_implementation_plan.md

rishitbhowmick

feat: MVP for a developer documents analyzer.

7312837 about 2 months ago

preview code

raw

history blame contribute delete

7.14 kB

	# Production Grade RAG — Developer Documentation Assistant
	## Implementation Plan

	---

	## Project Overview

	A domain-specific "Ask my Docs" system built for developer documentation.
	Users can query any SDK/API/framework documentation and receive accurate answers with citations.

	Initial Target Domain: FastAPI Documentation

	## Locked MVP Choices

	The MVP is intentionally constrained to one implementation path. Any alternatives listed in earlier notes are out of scope until the baseline system is working and evaluated.

	\| Area \| MVP Choice \|
	\|---\|---\|
	\| Documentation source \| Official FastAPI documentation pages only \|
	\| Ingestion method \| `BeautifulSoup` crawler + local normalized markdown/json output \|
	\| Extra sources \| No GitHub `/docs` sync, no PDF upload, no arbitrary `.md` ingestion in MVP \|
	\| Chunking \| Structure-aware chunking with LangChain `RecursiveCharacterTextSplitter` \|
	\| Embeddings \| OpenAI `text-embedding-3-small` \|
	\| Vector store \| `ChromaDB` only \|
	\| Keyword retrieval \| `rank_bm25` \|
	\| Fusion \| Reciprocal Rank Fusion (RRF) \|
	\| Re-ranker \| Local `cross-encoder/ms-marco-MiniLM-L-6-v2` \|
	\| Answer model \| OpenAI `gpt-4o-mini` \|
	\| API layer \| `FastAPI` \|
	\| Evaluation \| `Ragas` + a hand-written FastAPI eval set \|
	\| CI gate \| Lightweight retrieval/regression checks on PRs; full eval run on demand or nightly \|

	## MVP Scope Boundaries

	- Index only the official FastAPI docs corpus.
	- Treat the MVP as a single-tenant local/dev system.
	- Optimize for answer quality and citation correctness before UI polish.
	- Defer multi-doc search, version-aware retrieval, PDF ingestion, and production vector DB migration until after MVP validation.

	---

	## Phase 1: Fundamentals (Data Ingestion Pipeline)

	### Step 1 — Document Collection
	- Crawl official FastAPI documentation using `BeautifulSoup`
	- Normalize each page into local markdown/json before chunking
	- Store raw HTML plus cleaned content locally for reproducibility
	- Exclude GitHub docs sync, PDF upload, and arbitrary file ingestion from MVP
	- Acceptance criteria:
	- All target FastAPI pages are fetched successfully
	- Each page is stored with `source_url`, `page_title`, and crawl timestamp
	- Re-running ingestion updates changed pages without duplicating records

	### Step 2 — Chunking Strategy
	- Use LangChain `RecursiveCharacterTextSplitter`
	- Chunk size target: 700 tokens
	- Overlap: 100 tokens
	- Preserve code blocks and heading boundaries whenever possible
	- Tag each chunk with metadata:
	- `source_url`
	- `page_title`
	- `section_title`
	- `chunk_id`
	- `doc_version`
	- Acceptance criteria:
	- No chunk splits in the middle of fenced code blocks
	- Every chunk can be traced back to an exact source page and section
	- Chunk output is deterministic for unchanged source documents

	### Step 3 — Embeddings + Vector Store
	- Embedding model: OpenAI `text-embedding-3-small`
	- Store embeddings in ChromaDB
	- Each vector carries full metadata from Step 2
	- Acceptance criteria:
	- Full FastAPI corpus is embedded successfully
	- Re-indexing does not create duplicate vectors for unchanged chunks
	- Querying Chroma returns chunk metadata required for citation display

	---

	## Phase 2: Hybrid Retrieval

	### Step 4 — BM25 + Semantic Search
	- BM25 via `rank_bm25` library for keyword matching
	- Vector similarity search via ChromaDB
	- Merge both result sets using Reciprocal Rank Fusion (RRF)
	- Retrieve top 10 BM25 hits and top 10 vector hits before fusion
	- Acceptance criteria:
	- Known keyword-heavy queries surface exact-term matches
	- Known semantic queries surface relevant conceptual matches
	- Fused retrieval performs better than vector-only on the seed eval set

	### Step 5 — Cross Encoder Re-ranker
	- Use local `cross-encoder/ms-marco-MiniLM-L-6-v2`
	- Re-rank top 20 results → pass top 5 to LLM
	- Biggest quality boost in the pipeline
	- Acceptance criteria:
	- Re-ranking is applied on every answer path
	- Top 5 contexts are logged for debugging and eval review
	- Reranked results improve answer grounding on the seed eval set

	### Step 6 — Citation Enforcement
	- Each retrieved chunk carries `source_url` + `section_title`
	- Prompt the LLM to answer only using provided context
	- Force structured output: answer + list of cited chunks
	- If no relevant chunk found → return "I don't know" (no hallucination)
	- Validate that every cited chunk ID exists in the retrieved context set
	- Acceptance criteria:
	- API response returns answer text plus machine-readable citations
	- Unsupported answers are rejected or converted to "I don't know"
	- Manual spot checks confirm citations map to relevant evidence

	---

	## Phase 3: Evaluation Pipeline

	### Step 7 — Build Eval Dataset
	- Manually write 30-50 Q&A pairs from FastAPI docs for the initial eval set
	- Cover: factual questions, code questions, comparison questions
	- Store as JSON/CSV in the repo
	- Expand to 100+ only after the first reliable baseline is in place
	- Acceptance criteria:
	- Eval set covers multiple sections of the docs
	- Each question has an expected answer and supporting source reference
	- The dataset is stable enough to compare runs over time

	### Step 8 — Ragas Evaluation Metrics
	\| Metric \| What it measures \|
	\|---\|---\|
	\| `faithfulness` \| Does the answer match retrieved context? \|
	\| `answer_relevancy` \| Is the answer on-topic? \|
	\| `context_precision` \| Are the right chunks being retrieved? \|
	\| `context_recall` \| Are all relevant chunks being found? \|

	### Step 9 — CI Integration
	- Run lightweight regression checks on every PR via GitHub Actions
	- Reserve full Ragas evaluation for scheduled or manually triggered runs
	- Use an initial target such as `faithfulness >= 0.85` as a baseline, not a permanent hard-coded gate
	- Store eval results over time to track regression
	- Acceptance criteria:
	- PR workflow catches obvious retrieval or API regressions quickly
	- Full eval workflow produces repeatable metrics artifacts
	- Baseline metrics are documented before strict CI thresholds are enforced

	---

	## Tech Stack

	\| Layer \| Tool \|
	\|---\|---\|
	\| Orchestration \| LangChain \|
	\| Vector Store (dev) \| ChromaDB \|
	\| Vector Store (prod) \| Deferred until post-MVP \|
	\| Reranker \| `cross-encoder/ms-marco-MiniLM-L-6-v2` \|
	\| Evaluation \| Ragas \|
	\| LLM \| `gpt-4o-mini` \|
	\| Scraping \| BeautifulSoup \|
	\| CI/CD \| GitHub Actions \|
	\| API Layer \| FastAPI \|
	\| Frontend (optional) \| Deferred until post-MVP \|

	---

	## Suggested Build Timeline

	\| Week \| Focus \|
	\|---\|---\|
	\| Week 1 \| Phase 1 — Ingest, chunk, embed, store \|
	\| Week 2 \| Phase 2 — BM25 + rerank + citation enforcement \|
	\| Week 3 \| Phase 3 — Eval dataset + Ragas + CI pipeline \|
	\| Week 4 \| API hardening + README + baseline evaluation review \|

	---

	## Bonus Features (Post-MVP)

	- Version-aware retrieval — "In v2 vs v3, how does X work?"
	- Code snippet extraction — return relevant code blocks alongside answers
	- Multi-doc search — compare two frameworks side by side
	- Query rewriting — auto-expand vague queries before retrieval

	---

	Based on project notes from January 10–11, 2026