Spaces:

rishitbhowmick
/

Dev-Docs-Rag

Paused

App Files Files Community

Dev-Docs-Rag / dev_docs_rag_implementation_plan.md

rishitbhowmick

feat: MVP for a developer documents analyzer.

7312837 about 2 months ago

preview code

raw

history blame contribute delete

7.14 kB

Production Grade RAG — Developer Documentation Assistant

Implementation Plan

Project Overview

A domain-specific "Ask my Docs" system built for developer documentation. Users can query any SDK/API/framework documentation and receive accurate answers with citations.

Initial Target Domain: FastAPI Documentation

Locked MVP Choices

The MVP is intentionally constrained to one implementation path. Any alternatives listed in earlier notes are out of scope until the baseline system is working and evaluated.

Area	MVP Choice
Documentation source	Official FastAPI documentation pages only
Ingestion method	`BeautifulSoup` crawler + local normalized markdown/json output
Extra sources	No GitHub `/docs` sync, no PDF upload, no arbitrary `.md` ingestion in MVP
Chunking	Structure-aware chunking with LangChain `RecursiveCharacterTextSplitter`
Embeddings	OpenAI `text-embedding-3-small`
Vector store	`ChromaDB` only
Keyword retrieval	`rank_bm25`
Fusion	Reciprocal Rank Fusion (RRF)
Re-ranker	Local `cross-encoder/ms-marco-MiniLM-L-6-v2`
Answer model	OpenAI `gpt-4o-mini`
API layer	`FastAPI`
Evaluation	`Ragas` + a hand-written FastAPI eval set
CI gate	Lightweight retrieval/regression checks on PRs; full eval run on demand or nightly

MVP Scope Boundaries

Index only the official FastAPI docs corpus.
Treat the MVP as a single-tenant local/dev system.
Optimize for answer quality and citation correctness before UI polish.
Defer multi-doc search, version-aware retrieval, PDF ingestion, and production vector DB migration until after MVP validation.

Phase 1: Fundamentals (Data Ingestion Pipeline)

Step 1 — Document Collection

Crawl official FastAPI documentation using BeautifulSoup
Normalize each page into local markdown/json before chunking
Store raw HTML plus cleaned content locally for reproducibility
Exclude GitHub docs sync, PDF upload, and arbitrary file ingestion from MVP
Acceptance criteria:
- All target FastAPI pages are fetched successfully
- Each page is stored with source_url, page_title, and crawl timestamp
- Re-running ingestion updates changed pages without duplicating records

Step 2 — Chunking Strategy

Use LangChain RecursiveCharacterTextSplitter
Chunk size target: 700 tokens
Overlap: 100 tokens
Preserve code blocks and heading boundaries whenever possible
Tag each chunk with metadata:
- source_url
- page_title
- section_title
- chunk_id
- doc_version
Acceptance criteria:
- No chunk splits in the middle of fenced code blocks
- Every chunk can be traced back to an exact source page and section
- Chunk output is deterministic for unchanged source documents

Step 3 — Embeddings + Vector Store

Embedding model: OpenAI text-embedding-3-small
Store embeddings in ChromaDB
Each vector carries full metadata from Step 2
Acceptance criteria:
- Full FastAPI corpus is embedded successfully
- Re-indexing does not create duplicate vectors for unchanged chunks
- Querying Chroma returns chunk metadata required for citation display

Phase 2: Hybrid Retrieval

Step 4 — BM25 + Semantic Search

BM25 via rank_bm25 library for keyword matching
Vector similarity search via ChromaDB
Merge both result sets using Reciprocal Rank Fusion (RRF)
Retrieve top 10 BM25 hits and top 10 vector hits before fusion
Acceptance criteria:
- Known keyword-heavy queries surface exact-term matches
- Known semantic queries surface relevant conceptual matches
- Fused retrieval performs better than vector-only on the seed eval set

Step 5 — Cross Encoder Re-ranker

Use local cross-encoder/ms-marco-MiniLM-L-6-v2
Re-rank top 20 results → pass top 5 to LLM
Biggest quality boost in the pipeline
Acceptance criteria:
- Re-ranking is applied on every answer path
- Top 5 contexts are logged for debugging and eval review
- Reranked results improve answer grounding on the seed eval set

Step 6 — Citation Enforcement

Each retrieved chunk carries source_url + section_title
Prompt the LLM to answer only using provided context
Force structured output: answer + list of cited chunks
If no relevant chunk found → return "I don't know" (no hallucination)
Validate that every cited chunk ID exists in the retrieved context set
Acceptance criteria:
- API response returns answer text plus machine-readable citations
- Unsupported answers are rejected or converted to "I don't know"
- Manual spot checks confirm citations map to relevant evidence

Phase 3: Evaluation Pipeline

Step 7 — Build Eval Dataset

Manually write 30-50 Q&A pairs from FastAPI docs for the initial eval set
Cover: factual questions, code questions, comparison questions
Store as JSON/CSV in the repo
Expand to 100+ only after the first reliable baseline is in place
Acceptance criteria:
- Eval set covers multiple sections of the docs
- Each question has an expected answer and supporting source reference
- The dataset is stable enough to compare runs over time

Step 8 — Ragas Evaluation Metrics

Metric	What it measures
`faithfulness`	Does the answer match retrieved context?
`answer_relevancy`	Is the answer on-topic?
`context_precision`	Are the right chunks being retrieved?
`context_recall`	Are all relevant chunks being found?

Step 9 — CI Integration

Run lightweight regression checks on every PR via GitHub Actions
Reserve full Ragas evaluation for scheduled or manually triggered runs
Use an initial target such as faithfulness >= 0.85 as a baseline, not a permanent hard-coded gate
Store eval results over time to track regression
Acceptance criteria:
- PR workflow catches obvious retrieval or API regressions quickly
- Full eval workflow produces repeatable metrics artifacts
- Baseline metrics are documented before strict CI thresholds are enforced

Tech Stack

Layer	Tool
Orchestration	LangChain
Vector Store (dev)	ChromaDB
Vector Store (prod)	Deferred until post-MVP
Reranker	`cross-encoder/ms-marco-MiniLM-L-6-v2`
Evaluation	Ragas
LLM	`gpt-4o-mini`
Scraping	BeautifulSoup
CI/CD	GitHub Actions
API Layer	FastAPI
Frontend (optional)	Deferred until post-MVP

Suggested Build Timeline

Week	Focus
Week 1	Phase 1 — Ingest, chunk, embed, store
Week 2	Phase 2 — BM25 + rerank + citation enforcement
Week 3	Phase 3 — Eval dataset + Ragas + CI pipeline
Week 4	API hardening + README + baseline evaluation review

Bonus Features (Post-MVP)

Version-aware retrieval — "In v2 vs v3, how does X work?"
Code snippet extraction — return relevant code blocks alongside answers
Multi-doc search — compare two frameworks side by side
Query rewriting — auto-expand vague queries before retrieval

Based on project notes from January 10–11, 2026