Spaces:

rishitbhowmick
/

Dev-Docs-Rag

Paused

File size: 7,138 Bytes
# Production Grade RAG — Developer Documentation Assistant
## Implementation Plan

---

## Project Overview

A domain-specific "Ask my Docs" system built for developer documentation.
Users can query any SDK/API/framework documentation and receive accurate answers with citations.

**Initial Target Domain:** FastAPI Documentation

## Locked MVP Choices

The MVP is intentionally constrained to one implementation path. Any alternatives listed in earlier notes are out of scope until the baseline system is working and evaluated.

| Area | MVP Choice |
|---|---|
| Documentation source | Official FastAPI documentation pages only |
| Ingestion method | `BeautifulSoup` crawler + local normalized markdown/json output |
| Extra sources | No GitHub `/docs` sync, no PDF upload, no arbitrary `.md` ingestion in MVP |
| Chunking | Structure-aware chunking with LangChain `RecursiveCharacterTextSplitter` |
| Embeddings | OpenAI `text-embedding-3-small` |
| Vector store | `ChromaDB` only |
| Keyword retrieval | `rank_bm25` |
| Fusion | Reciprocal Rank Fusion (RRF) |
| Re-ranker | Local `cross-encoder/ms-marco-MiniLM-L-6-v2` |
| Answer model | OpenAI `gpt-4o-mini` |
| API layer | `FastAPI` |
| Evaluation | `Ragas` + a hand-written FastAPI eval set |
| CI gate | Lightweight retrieval/regression checks on PRs; full eval run on demand or nightly |

## MVP Scope Boundaries

- Index only the official FastAPI docs corpus.
- Treat the MVP as a single-tenant local/dev system.
- Optimize for answer quality and citation correctness before UI polish.
- Defer multi-doc search, version-aware retrieval, PDF ingestion, and production vector DB migration until after MVP validation.

---

## Phase 1: Fundamentals (Data Ingestion Pipeline)

### Step 1 — Document Collection
- Crawl official FastAPI documentation using `BeautifulSoup`
- Normalize each page into local markdown/json before chunking
- Store raw HTML plus cleaned content locally for reproducibility
- Exclude GitHub docs sync, PDF upload, and arbitrary file ingestion from MVP
- Acceptance criteria:
  - All target FastAPI pages are fetched successfully
  - Each page is stored with `source_url`, `page_title`, and crawl timestamp
  - Re-running ingestion updates changed pages without duplicating records

### Step 2 — Chunking Strategy
- Use LangChain `RecursiveCharacterTextSplitter`
- Chunk size target: **700 tokens**
- Overlap: **100 tokens**
- Preserve code blocks and heading boundaries whenever possible
- Tag each chunk with metadata:
  - `source_url`
  - `page_title`
  - `section_title`
  - `chunk_id`
  - `doc_version`
- Acceptance criteria:
  - No chunk splits in the middle of fenced code blocks
  - Every chunk can be traced back to an exact source page and section
  - Chunk output is deterministic for unchanged source documents

### Step 3 — Embeddings + Vector Store
- Embedding model: OpenAI `text-embedding-3-small`
- Store embeddings in **ChromaDB**
- Each vector carries full metadata from Step 2
- Acceptance criteria:
  - Full FastAPI corpus is embedded successfully
  - Re-indexing does not create duplicate vectors for unchanged chunks
  - Querying Chroma returns chunk metadata required for citation display

---

## Phase 2: Hybrid Retrieval

### Step 4 — BM25 + Semantic Search
- BM25 via `rank_bm25` library for keyword matching
- Vector similarity search via ChromaDB
- Merge both result sets using **Reciprocal Rank Fusion (RRF)**
- Retrieve top 10 BM25 hits and top 10 vector hits before fusion
- Acceptance criteria:
  - Known keyword-heavy queries surface exact-term matches
  - Known semantic queries surface relevant conceptual matches
  - Fused retrieval performs better than vector-only on the seed eval set

### Step 5 — Cross Encoder Re-ranker
- Use local `cross-encoder/ms-marco-MiniLM-L-6-v2`
- Re-rank top 20 results → pass top 5 to LLM
- Biggest quality boost in the pipeline
- Acceptance criteria:
  - Re-ranking is applied on every answer path
  - Top 5 contexts are logged for debugging and eval review
  - Reranked results improve answer grounding on the seed eval set

### Step 6 — Citation Enforcement
- Each retrieved chunk carries `source_url` + `section_title`
- Prompt the LLM to answer only using provided context
- Force structured output: answer + list of cited chunks
- If no relevant chunk found → return "I don't know" (no hallucination)
- Validate that every cited chunk ID exists in the retrieved context set
- Acceptance criteria:
  - API response returns answer text plus machine-readable citations
  - Unsupported answers are rejected or converted to "I don't know"
  - Manual spot checks confirm citations map to relevant evidence

---

## Phase 3: Evaluation Pipeline

### Step 7 — Build Eval Dataset
- Manually write **30-50 Q&A pairs** from FastAPI docs for the initial eval set
- Cover: factual questions, code questions, comparison questions
- Store as JSON/CSV in the repo
- Expand to 100+ only after the first reliable baseline is in place
- Acceptance criteria:
  - Eval set covers multiple sections of the docs
  - Each question has an expected answer and supporting source reference
  - The dataset is stable enough to compare runs over time

### Step 8 — Ragas Evaluation Metrics
| Metric | What it measures |
|---|---|
| `faithfulness` | Does the answer match retrieved context? |
| `answer_relevancy` | Is the answer on-topic? |
| `context_precision` | Are the right chunks being retrieved? |
| `context_recall` | Are all relevant chunks being found? |

### Step 9 — CI Integration
- Run lightweight regression checks on every PR via **GitHub Actions**
- Reserve full Ragas evaluation for scheduled or manually triggered runs
- Use an initial target such as `faithfulness >= 0.85` as a baseline, not a permanent hard-coded gate
- Store eval results over time to track regression
- Acceptance criteria:
  - PR workflow catches obvious retrieval or API regressions quickly
  - Full eval workflow produces repeatable metrics artifacts
  - Baseline metrics are documented before strict CI thresholds are enforced

---

## Tech Stack

| Layer | Tool |
|---|---|
| Orchestration | LangChain |
| Vector Store (dev) | ChromaDB |
| Vector Store (prod) | Deferred until post-MVP |
| Reranker | `cross-encoder/ms-marco-MiniLM-L-6-v2` |
| Evaluation | Ragas |
| LLM | `gpt-4o-mini` |
| Scraping | BeautifulSoup |
| CI/CD | GitHub Actions |
| API Layer | FastAPI |
| Frontend (optional) | Deferred until post-MVP |

---

## Suggested Build Timeline

| Week | Focus |
|---|---|
| Week 1 | Phase 1 — Ingest, chunk, embed, store |
| Week 2 | Phase 2 — BM25 + rerank + citation enforcement |
| Week 3 | Phase 3 — Eval dataset + Ragas + CI pipeline |
| Week 4 | API hardening + README + baseline evaluation review |

---

## Bonus Features (Post-MVP)

- **Version-aware retrieval** — "In v2 vs v3, how does X work?"
- **Code snippet extraction** — return relevant code blocks alongside answers
- **Multi-doc search** — compare two frameworks side by side
- **Query rewriting** — auto-expand vague queries before retrieval

---

*Based on project notes from January 10–11, 2026*