BabaK07 commited on
Commit
90b7bb0
·
1 Parent(s): 8c66fe6

Integrate Jina embeddings and refresh assignment README

Browse files
Files changed (5) hide show
  1. .env.example +3 -0
  2. README.md +111 -133
  3. app/config.py +4 -1
  4. app/services/vector_store.py +60 -14
  5. pyproject.toml +1 -2
.env.example CHANGED
@@ -11,5 +11,8 @@ ACCESS_TOKEN_EXPIRE_MINUTES=720
11
  MODEL_NAME=llama-3.1-8b-instant
12
  EMBEDDING_MODEL=mixedbread-ai/mxbai-embed-large-v1
13
  EMBEDDING_DIMENSIONS=1024
 
 
 
14
  WEB_SEARCH_PROVIDER=duckduckgo
15
  TAVILY_API_KEY=
 
11
  MODEL_NAME=llama-3.1-8b-instant
12
  EMBEDDING_MODEL=mixedbread-ai/mxbai-embed-large-v1
13
  EMBEDDING_DIMENSIONS=1024
14
+ JINA_API_KEY=
15
+ JINA_API_BASE=https://api.jina.ai/v1/embeddings
16
+ JINA_EMBEDDING_MODEL=jina-embeddings-v3
17
  WEB_SEARCH_PROVIDER=duckduckgo
18
  TAVILY_API_KEY=
README.md CHANGED
@@ -1,112 +1,108 @@
1
- # DocsQA LangGraph Assignment
2
 
3
- RAG-powered research assistant with:
4
 
5
- - Auth (register/login/logout) using HTTP-only cookie sessions
6
- - Multi-file PDF upload (up to 5 files/request, max 10 pages/file)
7
- - Duplicate detection by SHA-256 hash with cross-user document reuse
8
- - Vector indexing in Supabase Postgres + `pgvector`
9
- - LangGraph agent with document retrieval + web search fallback
10
- - Session conversation memory for follow-up questions
11
- - Source citations in answers for both document and web evidence
12
- - Chat-style UI with markdown rendering
13
 
14
- ## Architecture
 
 
15
 
16
- - Backend: FastAPI + SQLAlchemy
17
- - Agent: LangGraph ReAct agent
18
- - LLM: Groq chat model
19
- - Vector store: Supabase Postgres with `pgvector`
20
- - Search fallback: Tavily (preferred) or DuckDuckGo when available
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ## Chunking Strategy
23
 
24
- - Splitter: recursive character splitter (`chunk_size=1200`, `chunk_overlap=200`)
25
- - Why:
26
- - 1200 keeps enough local context for legal/business clauses
27
- - 200 overlap reduces boundary loss between adjacent chunks
28
- - good balance for retrieval accuracy vs. embedding cost
29
- - Indexing is page-aware: each stored chunk carries `page_number` metadata.
 
30
 
31
  ## Retrieval Approach
32
 
33
- - Retrieval method: cosine similarity search in `pgvector`
34
- - Pipeline:
35
- - determine relevant user-owned document hashes
36
- - embed query
37
- - retrieve top-k chunks across selected docs
38
- - Returned evidence includes:
39
- - document filename
40
- - page number
41
- - excerpt text
42
- - Final assistant answer is instructed to cite these in a human-friendly source section.
43
 
44
  ## Agent Routing Logic
45
 
46
- - Default behavior: prefer `vector_search` for questions answerable from uploaded docs.
47
- - If document evidence is insufficient, agent can call `web_search` tool.
48
- - Web search output is normalized to citation-friendly rows (title, URL, snippet).
49
- - Prompt requires:
50
- - vector citations: document + page + excerpt
51
- - web citations: website title + URL
52
-
53
- ## Bonus Feature
54
 
55
- **Implemented bonus:** User-scoped retrieval with automatic document dedup reuse.
 
56
 
57
- - If two users upload the same file, processing/indexing is reused by file hash.
58
- - Ownership is still enforced via `user_documents` mapping, so retrieval stays user-scoped.
59
- - Why chosen: materially improves performance/cost while preserving access boundaries.
60
 
61
- ## Problems Faced and Fixes
62
 
63
- - Dependency mismatch (`transformers`/`sentence-transformers`/`torch`) causing startup errors.
64
- - Added robust local fallback embedding path to keep app functional.
65
- - Optional web-search dependency (`ddgs`) missing.
66
- - Added graceful web tool fallback and Tavily direct tool support.
67
- - Passlib bcrypt backend issues.
68
- - Switched new password hashing to `pbkdf2_sha256` while retaining bcrypt verify compatibility.
69
- - Template/render and response UX issues.
70
- - Reworked frontend into a stable chat-style UI with clean result handling.
71
 
72
- ## If I Had More Time
 
 
 
 
 
 
73
 
74
- - Add proper migration tooling (Alembic) instead of startup `ALTER TABLE`.
75
- - Add reranking for higher retrieval precision on long multi-document queries.
76
- - Add persistent server-side conversation storage (Redis/Postgres) for multi-worker deployments.
77
- - Add automated evaluation suite for citation faithfulness and retrieval quality.
78
 
79
- ## Environment Setup
80
 
81
- ```bash
82
- cp .env.example .env
83
- ```
84
-
85
- Required:
86
 
87
- - `GROQ_API_KEY`
88
- - `SECRET_KEY`
89
- - `DATABASE_URL` (Supabase transaction pooler recommended)
90
 
91
- Optional:
 
92
 
93
- - `TAVILY_API_KEY` (for Tavily web search)
 
 
 
94
 
95
- Storage (optional, recommended for deployment):
96
 
97
- - `STORAGE_BACKEND=local` or `supabase`
98
- - `SUPABASE_URL`
99
- - `SUPABASE_SERVICE_ROLE_KEY`
100
- - `SUPABASE_STORAGE_BUCKET` (default: `documents`)
101
- - `SUPABASE_STORAGE_PREFIX` (default: `docsqa`)
 
102
 
103
- Recommended `DATABASE_URL` format:
104
 
105
- `postgresql+psycopg://<user>:<password>@<pooler-host>:6543/postgres?sslmode=require`
 
 
 
106
 
107
- ## Install and Run
108
 
109
  ```bash
 
110
  python3 -m venv .venv
111
  source .venv/bin/activate
112
  pip install -e .
@@ -115,10 +111,29 @@ uvicorn app.main:app --reload
115
 
116
  Open: `http://127.0.0.1:8000`
117
 
118
- ## File Storage Mode
119
 
120
- - Local dev default: `STORAGE_BACKEND=local` (writes under `UPLOAD_DIRECTORY`).
121
- - Deployment recommendation: `STORAGE_BACKEND=supabase` to store PDFs in Supabase Storage instead of local disk.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
 
123
  ## API Endpoints
124
 
@@ -127,57 +142,20 @@ Open: `http://127.0.0.1:8000`
127
  - `POST /logout`
128
  - `POST /upload`
129
  - `GET /documents`
 
 
130
  - `POST /ask`
131
 
132
- ## test_documents
133
-
134
- Sample PDFs used during development are in `test_documents/`.
135
-
136
- ## Deployment and Loom
137
-
138
- - Live deployed URL: _add your deployed link here_
139
- - Loom walkthrough (<5 min): _add your Loom link here_
140
-
141
- ## Deploy on Render
142
-
143
- This repo now includes a `render.yaml` Blueprint.
144
-
145
- 1. Push the latest `main` branch to GitHub.
146
- 2. In Render, click **New +** -> **Blueprint**.
147
- 3. Connect GitHub and select this repository.
148
- 4. Render will detect `render.yaml` and create a `docsbot` web service.
149
- 5. Set required secret env vars in Render:
150
- - `SECRET_KEY`
151
- - `DATABASE_URL`
152
- - `GROQ_API_KEY`
153
- - `SUPABASE_URL`
154
- - `SUPABASE_SERVICE_ROLE_KEY`
155
- - optionally `TAVILY_API_KEY`
156
- 6. Deploy and open the generated Render URL.
157
-
158
- Render uses:
159
- - Build command: `pip install -e .`
160
- - Start command: `uvicorn app.main:app --host 0.0.0.0 --port $PORT`
161
-
162
- ## Deploy on Fly.io
163
-
164
- This repo includes `Dockerfile` and `fly.toml`.
165
-
166
- 1. Install Fly CLI:
167
- - macOS: `brew install flyctl`
168
- 2. Login:
169
- - `fly auth login`
170
- 3. If app name `docsbot-kbaba7` is unavailable, change `app` in `fly.toml`.
171
- 4. Create app (first time only):
172
- - `fly apps create docsbot-kbaba7`
173
- 5. Set secrets:
174
- - `fly secrets set SECRET_KEY=...`
175
- - `fly secrets set DATABASE_URL=...`
176
- - `fly secrets set GROQ_API_KEY=...`
177
- - `fly secrets set SUPABASE_URL=...`
178
- - `fly secrets set SUPABASE_SERVICE_ROLE_KEY=...`
179
- - optional: `fly secrets set TAVILY_API_KEY=...`
180
- 6. Deploy:
181
- - `fly deploy`
182
- 7. Open app:
183
- - `fly open`
 
1
+ # DocsQA Smart Research Assistant
2
 
3
+ This is my take-home submission for the ABSTRABIT AI/ML Engineer assignment: a RAG-powered assistant where users upload PDFs, ask questions, and get grounded answers with citations.
4
 
5
+ ## Live Project
 
 
 
 
 
 
 
6
 
7
+ - Live app (Railway): `https://docsbot-web-production.up.railway.app`
8
+ - GitHub: `https://github.com/KBaba7/DocsBot`
9
+ - Loom walkthrough: _add your link here_
10
 
11
+ ## What I Built
12
+
13
+ The app supports authentication, PDF upload (up to 5 files and 10 pages per file), document chunking + vector indexing, and a chat experience that answers from uploaded documents first.
14
+ If the uploaded documents are not enough, the agent falls back to web search and cites those sources too.
15
+
16
+ ## Stack
17
+
18
+ - FastAPI + SQLAlchemy
19
+ - LangGraph agent
20
+ - Groq chat model
21
+ - Supabase Postgres + `pgvector`
22
+ - Railway deployment
23
+
24
+ ## How Retrieval Works
25
+
26
+ Uploaded PDFs are parsed page by page and split into chunks.
27
+ Each chunk is stored with metadata (document, page number, chunk index) and embedded into `pgvector`.
28
+
29
+ At question time:
30
+ 1. The app searches relevant chunks from the user’s accessible documents.
31
+ 2. The agent answers from those chunks when possible.
32
+ 3. If evidence is weak, the agent uses web search and cites external URLs.
33
 
34
  ## Chunking Strategy
35
 
36
+ - Chunk size: `1200`
37
+ - Overlap: `200`
38
+
39
+ Why this setup:
40
+ - Long, structured documents need enough contiguous context.
41
+ - Overlap helps avoid missing content around chunk boundaries.
42
+ - It gives a practical quality/cost balance for retrieval.
43
 
44
  ## Retrieval Approach
45
 
46
+ I use cosine similarity search in `pgvector` (no reranker yet).
47
+ The top matches are turned into readable citations (document name + page + snippet), and those are shown per answer in the UI.
 
 
 
 
 
 
 
 
48
 
49
  ## Agent Routing Logic
50
 
51
+ The agent is prompted to prefer document context first.
 
 
 
 
 
 
 
52
 
53
+ - If retrieved document context is sufficient: answer from documents with citations.
54
+ - If not sufficient: clearly say docs are insufficient and use web search tool.
55
 
56
+ This is implemented as tool-based behavior in LangGraph rather than a static fallback message.
 
 
57
 
58
+ ## Source Citations
59
 
60
+ Each turn stores/returns source metadata separately from the answer body.
 
 
 
 
 
 
 
61
 
62
+ - Vector source cards include:
63
+ - document name
64
+ - page number
65
+ - excerpt (short snippet from retrieved chunk)
66
+ - Web source cards include:
67
+ - title
68
+ - URL
69
 
70
+ ## Conversation Memory
 
 
 
71
 
72
+ Conversation history is maintained within session scope, so follow-ups like “tell me more about that” work as expected.
73
 
74
+ ## Bonus Feature
 
 
 
 
75
 
76
+ I added hash-based deduplicated ingestion:
 
 
77
 
78
+ - If the same PDF is uploaded again, processing/indexing is reused.
79
+ - Access control is still user-scoped via ownership mapping.
80
 
81
+ Why I chose this:
82
+ - saves compute/time,
83
+ - avoids duplicate indexing,
84
+ - keeps retrieval secure per user.
85
 
86
+ ## Challenges I Ran Into
87
 
88
+ 1. Heavy embedding dependencies made deployment images too large.
89
+ - I switched to lightweight embeddings for deployment and added Jina API embedding support.
90
+ 2. Source rendering got messy across multiple chat turns.
91
+ - I separated answer text from source payloads and extracted sources per turn.
92
+ 3. Intermittent DB DNS/pooler issues during deployment.
93
+ - I improved connection handling and standardized Supabase transaction-pooler config.
94
 
95
+ ## If I Had More Time
96
 
97
+ - Add reranking (cross-encoder) for better precision on long multi-doc queries.
98
+ - Add automated citation-faithfulness checks.
99
+ - Add Alembic migrations for cleaner schema evolution.
100
+ - Add stronger eval/observability for routing and retrieval quality.
101
 
102
+ ## Local Setup
103
 
104
  ```bash
105
+ cp .env.example .env
106
  python3 -m venv .venv
107
  source .venv/bin/activate
108
  pip install -e .
 
111
 
112
  Open: `http://127.0.0.1:8000`
113
 
114
+ ## Important Environment Variables
115
 
116
+ Required:
117
+ - `GROQ_API_KEY`
118
+ - `SECRET_KEY`
119
+ - `DATABASE_URL`
120
+
121
+ Embeddings (recommended):
122
+ - `JINA_API_KEY`
123
+ - `JINA_API_BASE` (default: `https://api.jina.ai/v1/embeddings`)
124
+ - `JINA_EMBEDDING_MODEL` (default: `jina-embeddings-v3`)
125
+ - `EMBEDDING_DIMENSIONS` (default: `1024`)
126
+
127
+ Storage:
128
+ - `STORAGE_BACKEND=local|supabase`
129
+ - `SUPABASE_URL`
130
+ - `SUPABASE_SERVICE_ROLE_KEY`
131
+ - `SUPABASE_STORAGE_BUCKET`
132
+ - `SUPABASE_STORAGE_PREFIX`
133
+
134
+ Web search:
135
+ - `WEB_SEARCH_PROVIDER=duckduckgo|tavily`
136
+ - `TAVILY_API_KEY` (if using Tavily)
137
 
138
  ## API Endpoints
139
 
 
142
  - `POST /logout`
143
  - `POST /upload`
144
  - `GET /documents`
145
+ - `DELETE /documents/{document_id}`
146
+ - `GET /documents/{document_id}/pdf`
147
  - `POST /ask`
148
 
149
+ ## Sample Documents
150
+
151
+ As requested in the assignment, sample PDFs are included in `test_documents/`.
152
+
153
+ ## Railway Deployment
154
+
155
+ ```bash
156
+ railway login
157
+ railway link
158
+ railway up
159
+ ```
160
+
161
+ Set the same env vars in Railway service settings before deploying.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app/config.py CHANGED
@@ -21,8 +21,11 @@ class Settings(BaseSettings):
21
  model_name: str = "llama-3.1-8b-instant"
22
  embedding_model: str = "mixedbread-ai/mxbai-embed-large-v1"
23
  embedding_dimensions: int = 1024
 
 
 
24
  groq_api_key: str | None = None
25
- web_search_provider: str = "duckduckgo"
26
  tavily_api_key: str | None = None
27
 
28
  @property
 
21
  model_name: str = "llama-3.1-8b-instant"
22
  embedding_model: str = "mixedbread-ai/mxbai-embed-large-v1"
23
  embedding_dimensions: int = 1024
24
+ jina_api_key: str | None = None
25
+ jina_api_base: str = "https://api.jina.ai/v1/embeddings"
26
+ jina_embedding_model: str = "jina-embeddings-v3"
27
  groq_api_key: str | None = None
28
+ web_search_provider: str = "tavily"
29
  tavily_api_key: str | None = None
30
 
31
  @property
app/services/vector_store.py CHANGED
@@ -3,6 +3,7 @@ import math
3
  import re
4
  from typing import Any
5
 
 
6
  from sqlalchemy import delete, select
7
  from sqlalchemy.orm import Session
8
 
@@ -65,25 +66,70 @@ class LocalHashEmbeddings:
65
  return [value / norm for value in vector]
66
 
67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  class VectorStoreService:
69
  def __init__(self) -> None:
70
  self.splitter = SimpleTextSplitter(chunk_size=1200, chunk_overlap=200)
71
- self.embeddings = None
 
 
 
 
 
 
 
 
 
 
72
 
73
  def _get_embeddings(self) -> Any:
74
- settings = get_settings()
75
- if self.embeddings is None:
76
- try:
77
- from langchain_huggingface import HuggingFaceEmbeddings
78
-
79
- self.embeddings = HuggingFaceEmbeddings(
80
- model_name=settings.embedding_model,
81
- model_kwargs={"device": "cpu"},
82
- encode_kwargs={"normalize_embeddings": True},
83
- )
84
- except Exception:
85
- # Keep the app usable when transformer/torch dependencies are unavailable.
86
- self.embeddings = LocalHashEmbeddings(settings.embedding_dimensions)
87
  return self.embeddings
88
 
89
  def add_document(self, *, db: Session, document_id: int, file_hash: str, filename: str, pages: list[tuple[int, str]]) -> None:
 
3
  import re
4
  from typing import Any
5
 
6
+ import requests
7
  from sqlalchemy import delete, select
8
  from sqlalchemy.orm import Session
9
 
 
66
  return [value / norm for value in vector]
67
 
68
 
69
+ class JinaEmbeddings:
70
+ def __init__(self, *, api_key: str, base_url: str, model: str, dimensions: int) -> None:
71
+ self.api_key = api_key
72
+ self.base_url = base_url
73
+ self.model = model
74
+ self.dimensions = dimensions
75
+
76
+ def embed_documents(self, texts: list[str]) -> list[list[float]]:
77
+ return self._embed(texts=texts, task="retrieval.passage")
78
+
79
+ def embed_query(self, text: str) -> list[float]:
80
+ vectors = self._embed(texts=[text], task="retrieval.query")
81
+ return vectors[0] if vectors else [0.0] * self.dimensions
82
+
83
+ def _embed(self, *, texts: list[str], task: str) -> list[list[float]]:
84
+ if not texts:
85
+ return []
86
+
87
+ response = requests.post(
88
+ self.base_url,
89
+ headers={
90
+ "Content-Type": "application/json",
91
+ "Authorization": f"Bearer {self.api_key}",
92
+ },
93
+ json={
94
+ "model": self.model,
95
+ "task": task,
96
+ "embedding_type": "float",
97
+ "normalized": True,
98
+ "input": texts,
99
+ },
100
+ timeout=60,
101
+ )
102
+ response.raise_for_status()
103
+ data = response.json().get("data", [])
104
+ vectors = [row.get("embedding", []) for row in data]
105
+
106
+ validated: list[list[float]] = []
107
+ for vector in vectors:
108
+ if len(vector) != self.dimensions:
109
+ raise ValueError(
110
+ f"Jina embedding dimension mismatch: got {len(vector)}, expected {self.dimensions}. "
111
+ "Adjust EMBEDDING_DIMENSIONS or switch embedding model."
112
+ )
113
+ validated.append(vector)
114
+ return validated
115
+
116
+
117
  class VectorStoreService:
118
  def __init__(self) -> None:
119
  self.splitter = SimpleTextSplitter(chunk_size=1200, chunk_overlap=200)
120
+ settings = get_settings()
121
+ if settings.jina_api_key:
122
+ self.embeddings = JinaEmbeddings(
123
+ api_key=settings.jina_api_key,
124
+ base_url=settings.jina_api_base,
125
+ model=settings.jina_embedding_model,
126
+ dimensions=settings.embedding_dimensions,
127
+ )
128
+ else:
129
+ # Lightweight fallback when hosted embedding credentials are not configured.
130
+ self.embeddings = LocalHashEmbeddings(settings.embedding_dimensions)
131
 
132
  def _get_embeddings(self) -> Any:
 
 
 
 
 
 
 
 
 
 
 
 
 
133
  return self.embeddings
134
 
135
  def add_document(self, *, db: Session, document_id: int, file_hash: str, filename: str, pages: list[tuple[int, str]]) -> None:
pyproject.toml CHANGED
@@ -9,7 +9,6 @@ dependencies = [
9
  "jinja2>=3.1.4",
10
  "langchain-community>=0.3.0",
11
  "langchain-groq>=0.2.0",
12
- "langchain-huggingface>=0.1.0",
13
  "langchain-text-splitters>=0.3.0",
14
  "langgraph>=0.2.35",
15
  "passlib[bcrypt]>=1.7.4",
@@ -19,8 +18,8 @@ dependencies = [
19
  "pypdf>=5.0.1",
20
  "python-jose[cryptography]>=3.3.0",
21
  "python-multipart>=0.0.9",
 
22
  "sqlalchemy>=2.0.35",
23
- "sentence-transformers>=3.0.1",
24
  "uvicorn[standard]>=0.30.6",
25
  "email-validator>=2.2.0",
26
  "tavily-python==0.7.23",
 
9
  "jinja2>=3.1.4",
10
  "langchain-community>=0.3.0",
11
  "langchain-groq>=0.2.0",
 
12
  "langchain-text-splitters>=0.3.0",
13
  "langgraph>=0.2.35",
14
  "passlib[bcrypt]>=1.7.4",
 
18
  "pypdf>=5.0.1",
19
  "python-jose[cryptography]>=3.3.0",
20
  "python-multipart>=0.0.9",
21
+ "requests>=2.32.0",
22
  "sqlalchemy>=2.0.35",
 
23
  "uvicorn[standard]>=0.30.6",
24
  "email-validator>=2.2.0",
25
  "tavily-python==0.7.23",