Spaces:

KBaba7
/

DocsBot

Sleeping

App Files Files Community

DocsBot / README.md

BabaK07

fix: restore HF Spaces front-matter in README

10eb017 14 days ago

preview code

raw

history blame contribute delete

6.73 kB

metadata

title: DocsQA Smart Research Assistant
emoji: 📄
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
app_port: 7860

DocsQA Smart Research Assistant

This is my take-home submission for the ABSTRABIT AI/ML Engineer assignment: a RAG-powered assistant where users upload PDFs, ask questions, and get grounded answers with citations.

Live Project

Live app (Railway): https://docsbot-web-production.up.railway.app
GitHub: https://github.com/KBaba7/DocsBot
Loom walkthrough: add your link here

What I Built

The app supports authentication, PDF upload (up to 5 files and 10 pages per file), document chunking + vector indexing, and a chat experience that answers from uploaded documents first.
If the uploaded documents are not enough, the agent falls back to web search and cites those sources too.

Stack

FastAPI + SQLAlchemy
LangGraph agent
Groq chat model
Jina embeddings + Jina reranker
Supabase Postgres + pgvector
Railway deployment

How Retrieval Works

Uploaded PDFs are parsed page by page and split into chunks.
Each chunk is stored with metadata (document, page number, chunk index) and embedded into pgvector.

At question time:

LLM-based document filtering selects relevant documents from user's library
Vector search retrieves relevant chunks from selected documents
Jina reranking reorders the retrieved chunks for better final relevance
The agent answers from those chunks when possible
If evidence is weak, the agent uses web search and cites external URLs

Chunking Strategy

Splitter: LangChain RecursiveCharacterTextSplitter
Chunk size: 1000
Overlap: 150

Why this setup:

It prefers breaking on paragraphs and sentence boundaries before falling back to smaller separators.
It preserves more coherent chunks for contracts, specs, and structured PDFs.
A smaller overlap keeps recall while reducing duplicated context in retrieval.

Retrieval Approach

I use cosine similarity search in pgvector, then apply Jina reranking for better final ordering.
The system uses an LLM-based retrieval planner to choose:

the final number of chunks to keep
the candidate pool to rerank

Those values are clamped to safe bounds before retrieval runs.

The UI shows:

document name
page number
chunk excerpt

for retrieved document sources.

Agent Routing Logic

The agent is prompted to prefer document context first.

If retrieved document context is sufficient: answer from documents with citations.
If not sufficient: clearly say docs are insufficient and use web search tool.

This is implemented as tool-based behavior in LangGraph rather than a static fallback message.

Source Citations

Each turn stores/returns source metadata separately from the answer body.

Vector source cards include:
- document name
- page number
- snippet (short snippet from retrieved chunk)
Web source cards include:
- title
- URL

Conversation Memory

Conversation history is maintained within session scope, so follow-ups like “tell me more about that” work as expected. The frontend also preserves the visible chat thread per session, so upload-triggered page refreshes do not wipe the current conversation view.

Streaming UX

Answers are streamed into the chat UI progressively.

the visible response is rendered chunk by chunk
source cards are attached after the answer completes
a slight pacing delay is added so the stream feels live to the user

The streaming route is separate from the standard JSON /ask response path.

Bonus Feature

I added hash-based deduplicated ingestion:

If the same PDF is uploaded again, processing/indexing is reused.
Access control is still user-scoped via ownership mapping.

Why I chose this:

saves compute/time,
avoids duplicate indexing,
keeps retrieval secure per user.

I also implemented LLM-based document filtering:

The system sends all user documents (filename, summary, preview) to the LLM
LLM semantically analyzes and selects only truly relevant documents for the query
Returns a JSON array of relevant file hashes
It is not forced to return a capped number of documents
Fallback returns all candidate document hashes if the LLM call fails

Challenges I Ran Into

Heavy embedding dependencies made deployment images too large.
- I standardized on Jina API embeddings/reranking to keep the runtime lighter while preserving retrieval quality.
Source rendering got messy across multiple chat turns.
- I separated answer text from source payloads and extracted sources per turn.
Intermittent DB DNS/pooler issues during deployment.
- I improved connection handling and standardized Supabase transaction-pooler config.
UI state was getting lost after document uploads.
- I persisted the active chat thread in session storage so the current conversation remains visible after refresh.

If I Had More Time

Add conversation history UI to display past chat sessions
Add automated citation-faithfulness checks
Add Alembic migrations for cleaner schema evolution
Add stronger eval/observability for routing and retrieval quality

Local Setup

cp .env.example .env
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
uvicorn app.main:app --reload

Open: http://127.0.0.1:8000

Important Environment Variables

Required:

GROQ_API_KEY
SECRET_KEY
DATABASE_URL
JINA_API_KEY

Embeddings:

JINA_API_BASE (default: https://api.jina.ai/v1/embeddings)
JINA_EMBEDDING_MODEL (default: jina-embeddings-v3)
JINA_RERANKER_API_BASE (default: https://api.jina.ai/v1/rerank)
JINA_RERANKER_MODEL (default: jina-reranker-v3)
EMBEDDING_DIMENSIONS (default: 1024)
RETRIEVAL_K (default minimum final context size: 4)
RERANK_CANDIDATE_K (default minimum rerank candidate pool: 12)

Storage:

STORAGE_BACKEND=local|supabase
SUPABASE_URL
SUPABASE_SERVICE_ROLE_KEY
SUPABASE_STORAGE_BUCKET
SUPABASE_STORAGE_PREFIX

Web search:

WEB_SEARCH_PROVIDER=duckduckgo|tavily
TAVILY_API_KEY (if using Tavily)

Auth:

ACCESS_TOKEN_EXPIRE_MINUTES (default: 720)
For local development, lowering this can make login/logout testing easier

API Endpoints

POST /register
POST /login
POST /logout
POST /upload
GET /documents
DELETE /documents/{document_id}
GET /documents/{document_id}/pdf
POST /ask
POST /ask/stream

Sample Documents

As requested in the assignment, sample PDFs are included in test_documents/.

Railway Deployment

railway login
railway link
railway up

Set the same env vars in Railway service settings before deploying.