DocsBot / README.md
BabaK07's picture
fix: restore HF Spaces front-matter in README
10eb017
metadata
title: DocsQA Smart Research Assistant
emoji: 📄
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
app_port: 7860

DocsQA Smart Research Assistant

This is my take-home submission for the ABSTRABIT AI/ML Engineer assignment: a RAG-powered assistant where users upload PDFs, ask questions, and get grounded answers with citations.

Live Project

  • Live app (Railway): https://docsbot-web-production.up.railway.app
  • GitHub: https://github.com/KBaba7/DocsBot
  • Loom walkthrough: add your link here

What I Built

The app supports authentication, PDF upload (up to 5 files and 10 pages per file), document chunking + vector indexing, and a chat experience that answers from uploaded documents first.
If the uploaded documents are not enough, the agent falls back to web search and cites those sources too.

Stack

  • FastAPI + SQLAlchemy
  • LangGraph agent
  • Groq chat model
  • Jina embeddings + Jina reranker
  • Supabase Postgres + pgvector
  • Railway deployment

How Retrieval Works

Uploaded PDFs are parsed page by page and split into chunks.
Each chunk is stored with metadata (document, page number, chunk index) and embedded into pgvector.

At question time:

  1. LLM-based document filtering selects relevant documents from user's library
  2. Vector search retrieves relevant chunks from selected documents
  3. Jina reranking reorders the retrieved chunks for better final relevance
  4. The agent answers from those chunks when possible
  5. If evidence is weak, the agent uses web search and cites external URLs

Chunking Strategy

  • Splitter: LangChain RecursiveCharacterTextSplitter
  • Chunk size: 1000
  • Overlap: 150

Why this setup:

  • It prefers breaking on paragraphs and sentence boundaries before falling back to smaller separators.
  • It preserves more coherent chunks for contracts, specs, and structured PDFs.
  • A smaller overlap keeps recall while reducing duplicated context in retrieval.

Retrieval Approach

I use cosine similarity search in pgvector, then apply Jina reranking for better final ordering.
The system uses an LLM-based retrieval planner to choose:

  • the final number of chunks to keep
  • the candidate pool to rerank

Those values are clamped to safe bounds before retrieval runs.

The UI shows:

  • document name
  • page number
  • chunk excerpt

for retrieved document sources.

Agent Routing Logic

The agent is prompted to prefer document context first.

  • If retrieved document context is sufficient: answer from documents with citations.
  • If not sufficient: clearly say docs are insufficient and use web search tool.

This is implemented as tool-based behavior in LangGraph rather than a static fallback message.

Source Citations

Each turn stores/returns source metadata separately from the answer body.

  • Vector source cards include:
    • document name
    • page number
    • snippet (short snippet from retrieved chunk)
  • Web source cards include:
    • title
    • URL

Conversation Memory

Conversation history is maintained within session scope, so follow-ups like “tell me more about that” work as expected. The frontend also preserves the visible chat thread per session, so upload-triggered page refreshes do not wipe the current conversation view.

Streaming UX

Answers are streamed into the chat UI progressively.

  • the visible response is rendered chunk by chunk
  • source cards are attached after the answer completes
  • a slight pacing delay is added so the stream feels live to the user

The streaming route is separate from the standard JSON /ask response path.

Bonus Feature

I added hash-based deduplicated ingestion:

  • If the same PDF is uploaded again, processing/indexing is reused.
  • Access control is still user-scoped via ownership mapping.

Why I chose this:

  • saves compute/time,
  • avoids duplicate indexing,
  • keeps retrieval secure per user.

I also implemented LLM-based document filtering:

  • The system sends all user documents (filename, summary, preview) to the LLM
  • LLM semantically analyzes and selects only truly relevant documents for the query
  • Returns a JSON array of relevant file hashes
  • It is not forced to return a capped number of documents
  • Fallback returns all candidate document hashes if the LLM call fails

Challenges I Ran Into

  1. Heavy embedding dependencies made deployment images too large.
    • I standardized on Jina API embeddings/reranking to keep the runtime lighter while preserving retrieval quality.
  2. Source rendering got messy across multiple chat turns.
    • I separated answer text from source payloads and extracted sources per turn.
  3. Intermittent DB DNS/pooler issues during deployment.
    • I improved connection handling and standardized Supabase transaction-pooler config.
  4. UI state was getting lost after document uploads.
    • I persisted the active chat thread in session storage so the current conversation remains visible after refresh.

If I Had More Time

  • Add conversation history UI to display past chat sessions
  • Add automated citation-faithfulness checks
  • Add Alembic migrations for cleaner schema evolution
  • Add stronger eval/observability for routing and retrieval quality

Local Setup

cp .env.example .env
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
uvicorn app.main:app --reload

Open: http://127.0.0.1:8000

Important Environment Variables

Required:

  • GROQ_API_KEY
  • SECRET_KEY
  • DATABASE_URL
  • JINA_API_KEY

Embeddings:

  • JINA_API_BASE (default: https://api.jina.ai/v1/embeddings)
  • JINA_EMBEDDING_MODEL (default: jina-embeddings-v3)
  • JINA_RERANKER_API_BASE (default: https://api.jina.ai/v1/rerank)
  • JINA_RERANKER_MODEL (default: jina-reranker-v3)
  • EMBEDDING_DIMENSIONS (default: 1024)
  • RETRIEVAL_K (default minimum final context size: 4)
  • RERANK_CANDIDATE_K (default minimum rerank candidate pool: 12)

Storage:

  • STORAGE_BACKEND=local|supabase
  • SUPABASE_URL
  • SUPABASE_SERVICE_ROLE_KEY
  • SUPABASE_STORAGE_BUCKET
  • SUPABASE_STORAGE_PREFIX

Web search:

  • WEB_SEARCH_PROVIDER=duckduckgo|tavily
  • TAVILY_API_KEY (if using Tavily)

Auth:

  • ACCESS_TOKEN_EXPIRE_MINUTES (default: 720)
  • For local development, lowering this can make login/logout testing easier

API Endpoints

  • POST /register
  • POST /login
  • POST /logout
  • POST /upload
  • GET /documents
  • DELETE /documents/{document_id}
  • GET /documents/{document_id}/pdf
  • POST /ask
  • POST /ask/stream

Sample Documents

As requested in the assignment, sample PDFs are included in test_documents/.

Railway Deployment

railway login
railway link
railway up

Set the same env vars in Railway service settings before deploying.