syildizz commited on
Commit
1ae5927
·
0 Parent(s):

Initial commit.

Browse files
Files changed (8) hide show
  1. .gitignore +8 -0
  2. README.md +162 -0
  3. app.py +6 -0
  4. config.py +7 -0
  5. create_rag_agent.py +112 -0
  6. generate_vector_db.py +118 -0
  7. gradio_app.py +66 -0
  8. requirements.txt +12 -0
.gitignore ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ chroma_db/
2
+ dataset
3
+ **/__pycache__
4
+ .venv/*
5
+ .env
6
+ pyrightconfig.json
7
+
8
+ !**/.gitkeep
README.md ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # User Manual Chatbot
2
+
3
+ ## Project Overview
4
+
5
+ This project is a chatbot developed as part of the **Akbank GenAI Bootcamp 2025**.
6
+ The chatbot leverages a database of user manuals for various products to provide accurate and contextually relevant answers to technical questions.
7
+ By utilizing **Retrieval-Augmented Generation (RAG)** technology, the chatbot retrieves relevant information from user manuals and combines it with the generative capabilities of the **Gemini-2.5-flash** model to deliver precise responses.
8
+ The project includes a user-friendly interface built with **Gradio** which can be used to interact with the chatbot.
9
+
10
+ ### Purpose
11
+
12
+ The goal of this project is to create an intelligent chatbot capable of answering technical queries about electronic devices and products by referencing user manuals.
13
+ This enables users to quickly access accurate information without manually searching through lengthy documentation.
14
+
15
+ ---
16
+
17
+ ## Dataset
18
+
19
+ The dataset used in this project is sourced from the dataset described in the paper *[Question Answering over Electronic Devices: A New Benchmark Dataset and a Multi-Task Learning based QA Framework](https://arxiv.org/abs/2109.05897)*.
20
+ It can be accessed via this [Google Drive link](https://drive.google.com/drive/folders/1-gX1DlmVodP6OVRJC3WBRZoGgxPuJvvt).
21
+
22
+ ### Dataset Details
23
+ - **Format**: Text-based user manuals for various electronic devices.
24
+ - **Preprocessing**: The manuals are split into overlapping chunks to facilitate efficient retrieval.
25
+ - **Embedding Generation**: The text chunks are converted into embeddings using the [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) model from HuggingFace.
26
+ - **Generation Script**: The [generate_vector_db.py](./generate_vector_db.py) script processes the dataset and generates the vector database. If the process is interrupted the generated embeddings are saved and when ran again, the program will generate new embeddings.
27
+ - **Vector Database**: The embeddings are stored in a Chroma vector database and can be used locally. However, a pregenerated database already exists can be used via the HuggingFace dataset [syildizz/user-manuals-chromadb](https://huggingface.co/datasets/syildizz/user-manuals-chromadb).
28
+
29
+ ### Usage
30
+
31
+ When generating the dataset, the folders that are used for the input user-manual dataset and the output Chroma dataset by the [generate_vector_db.py](./generate_vector_db.py) script is specified in the [config.py](config.py) file.
32
+
33
+ ```python
34
+ dataset_directory = "user_manual_dataset_folder_path"
35
+ chroma_persist_directory = "chroma_dataset_folder_path"
36
+ ```
37
+
38
+ ---
39
+
40
+ ## Methods and Technologies
41
+
42
+ ### Solution Architecture
43
+ The chatbot employs a **Retrieval-Augmented Generation (RAG)** pipeline to combine information retrieval with generative AI:
44
+ 1. **Vector Database**: The embeddings are retrieved from a **Chroma** vector database for efficient similarity-based retrieval.
45
+ 2. **Query Processing**: When a user submits a query, the system retrieves the most relevant manual chunks using similarity search.
46
+ 3. **Response Generation**: The retrieved chunks are passed to the **Gemini-2.5-flash** model to generate a coherent and contextually accurate response.
47
+ 4. **User Interface**: A **Gradio**-based interface allows users to interact with the chatbot seamlessly.
48
+
49
+ ### Technologies Used
50
+ - **LLM**: Gemini-2.5-flash (`langchain-google-genai`)
51
+ - **Embedding Model**: sentence-transformers/all-mpnet-base-v2 (`langchain-huggingface`)
52
+ - **Vector Database**: Chroma (`langchain-chroma`, `chromadb`)
53
+ - **Text Splitting**: `langchain-text-splitters`
54
+ - **Interface**: Gradio (`gradio`)
55
+ - **Environment Management**: `python-dotenv`, `pydantic`
56
+ - **Other Libraries**: `langchain`, `langchain-core`, `langchain-community`
57
+
58
+ ### Key Features
59
+ - **RAG-based Retrieval**: Ensures answers are grounded in the user manual dataset.
60
+ - **Incremental Vector Database**: The `generate_vector_db.py` script supports resumable processing.
61
+ - **Configurability**: Parameters like chunk size and overlap are adjustable in `config.py`.
62
+ - **Interactive UI**: Gradio interface for easy user interaction.
63
+
64
+ ---
65
+
66
+ ## Results
67
+
68
+ The chatbot successfully answers technical questions about electronic devices by retrieving relevant information from user manuals.
69
+ Key outcomes include:
70
+
71
+ - **Accuracy**: The RAG pipeline ensures responses are highly relevant to the query, leveraging the structured manual dataset.
72
+ - **Scalability**: The incremental vector database generation supports large datasets and resumable processing.
73
+ - **Usability**: The Gradio interface provides a seamless experience for users to query the chatbot.
74
+ - **Deployment**: The project is live on HuggingFace Spaces at [Placeholder Link](https://huggingface.co/spaces/placeholder).
75
+
76
+ ---
77
+
78
+ ## Setup and Installation
79
+
80
+ ### Prerequisites
81
+ - Python
82
+ - Git
83
+ - Virtual environment (recommended)
84
+
85
+ ### Installation Steps
86
+
87
+ 1. **Clone the Repository**:
88
+ ```bash
89
+ git clone https://github.com/syildizz/[your-repo-name].git
90
+ cd [your-repo-name]
91
+ ```
92
+
93
+ 2. **Set Up a Virtual Environment**:
94
+ ```bash
95
+ python -m venv .venv
96
+ source .venv/bin/activate # On Windows: .venv\Scripts\activate
97
+ ```
98
+
99
+ 3. **Install Dependencies**:
100
+ Use the [requirements.txt](./requirements.txt) file to install dependencies via running:
101
+
102
+ ```bash
103
+ pip install -r requirements.txt
104
+ ```
105
+
106
+ 4. **Configuration**:
107
+ The public configuration information is stored in the [config.py](./config.py) file. The global parameters in the config file specified can be changed if another value is desired for the project.
108
+
109
+ Default values:
110
+ ```python
111
+ dataset_directory = "dataset"
112
+ chroma_persist_directory = "chroma_db"
113
+ huggingface_embedding_model_repo_path = "sentence-transformers/all-mpnet-base-v2"
114
+ huggingface_vector_embedding_database_repo_path = "syildizz/user-manuals-chromadb"
115
+ google_llm_model_name = "gemini-2.5-flash"
116
+ ```
117
+
118
+ 4. **Configure Environment Variables**:
119
+ Create a `.env` file in the project root with the following:
120
+ ```text
121
+ GEMINI_API_KEY=[your-gemini-api-key]
122
+ HUGGINGFACE_TOKEN=[your-huggingface-token]
123
+ ```
124
+
125
+ 5. **Generate Vector Database** (Optional):
126
+ If you want to generate a local vector database, run:
127
+ ```bash
128
+ python generate_vector_db.py
129
+ ```
130
+ NOTE: Do not generate a vector database if you want to pull the public pregenerated database.
131
+ If a database does not exist in the next step, [app.py](./app.py) will pull the remote pregenerated database.
132
+
133
+ 6. **Run the Application**:
134
+ Launch the Gradio interface:
135
+ ```bash
136
+ python app.py
137
+ ```
138
+ The interface will be available at `http://localhost:7860`.
139
+
140
+ ---
141
+
142
+ ## Web Interface & Product Guide
143
+
144
+ The chatbot is deployed on HuggingFace Spaces at [Placeholder Link](https://huggingface.co/spaces/placeholder).
145
+ The Gradio-based interface allows users to:
146
+
147
+ - Enter technical questions about electronic devices.
148
+ - Receive responses grounded in user manual content.
149
+
150
+ ### Usage Instructions
151
+
152
+ 1. Visit the HuggingFace Spaces link: [Placeholder Link](https://huggingface.co/spaces/placeholder).
153
+ 2. Enter a question in the text input field (e.g., "How do I reset my [device name]?").
154
+ 3. The chatbot will use relevant manual sections to generate a response.
155
+
156
+ ### Screenshots
157
+
158
+ [Placeholder: Add screenshots or a short video demonstrating the interface]
159
+
160
+ ---
161
+
162
+ **Live Demo**: [Placeholder HuggingFace Spaces Link](https://huggingface.co/spaces/placeholder)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+
2
+ from gradio_app import gradio_main
3
+
4
+ if __name__ == "__main__":
5
+ gr_interface = gradio_main()
6
+ gr_interface.queue().launch() # pyright: ignore[reportUnusedCallResult]
config.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+
2
+ dataset_directory = "dataset"
3
+ chroma_persist_directory = "chroma_db"
4
+ huggingface_embedding_model_repo_path = "sentence-transformers/all-mpnet-base-v2"
5
+ huggingface_vector_embedding_database_repo_path = "syildizz/user-manuals-chromadb"
6
+ google_llm_model_name = "gemini-2.5-flash"
7
+ temperature = 0.3
create_rag_agent.py ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # rag_gemini_chroma_v1_0_1_fixed.py
2
+ import os
3
+ from typing import Any
4
+ from dotenv import load_dotenv
5
+ from langchain.agents import create_agent
6
+ from langchain_huggingface import HuggingFaceEmbeddings
7
+ from langgraph.graph.state import CompiledStateGraph
8
+ from pydantic import SecretStr
9
+
10
+ from huggingface_hub import snapshot_download
11
+
12
+ # Core types and prompt
13
+ from langchain_core.documents import Document
14
+
15
+ # Document loaders, text splitters, vectorstore (community / ecosystem packages)
16
+ from langchain_chroma import Chroma
17
+
18
+ # Google Gemini provider and embeddings package
19
+ from langchain_google_genai import ChatGoogleGenerativeAI
20
+
21
+ from langchain.tools import tool
22
+
23
+ import config
24
+
25
+ # Instead, LCEL (Runnable components) is used for chain composition.
26
+
27
+ def get_chroma_store(
28
+ chroma_persist_directory: str = config.chroma_persist_directory,
29
+ huggingface_embedding_model_repo_path: str = config.huggingface_embedding_model_repo_path,
30
+ huggingface_vector_embedding_database_repo_path: str = config.huggingface_vector_embedding_database_repo_path,
31
+ ) -> Chroma:
32
+ """
33
+ Load an existing Chroma store if present, otherwise create from docs and persist.
34
+
35
+ This version uses lazy loading and batch processing to prevent memory issues.
36
+ """
37
+
38
+ embedding_model = HuggingFaceEmbeddings(model_name=huggingface_embedding_model_repo_path)
39
+
40
+ # 3) Check for existing Chroma DB and load it
41
+ if os.path.exists(chroma_persist_directory) and os.path.isdir(chroma_persist_directory):
42
+ print(f"✅ Loading existing Chroma DB from: {chroma_persist_directory}")
43
+ else:
44
+ print("📥 No local Chroma DB found. Pulling from Hugging Face dataset...")
45
+
46
+ # Create local directory
47
+ os.makedirs(chroma_persist_directory, exist_ok=True)
48
+
49
+ # Download all files from the Hugging Face dataset
50
+ snapshot_download( # pyright: ignore[reportUnusedCallResult]
51
+ repo_id=huggingface_vector_embedding_database_repo_path,
52
+ repo_type="dataset",
53
+ local_dir=chroma_persist_directory,
54
+ ignore_patterns=["*.md", "*.json"], # Optional: skip non-DB files like README,
55
+ )
56
+
57
+ print(f"✅ Pulled and persisted Chroma DB to: {chroma_persist_directory}")
58
+
59
+ return Chroma(
60
+ embedding_function=embedding_model,
61
+ persist_directory=chroma_persist_directory
62
+ )
63
+
64
+
65
+ def create_rag_agent(
66
+ google_llm_model_name: str = config.google_llm_model_name,
67
+ temperature: float = 0.3
68
+ ) -> CompiledStateGraph[Any]:
69
+ load_dotenv() # pyright: ignore[reportUnusedCallResult]
70
+
71
+ gemini_api_key = os.getenv("GEMINI_API_KEY")
72
+ if not gemini_api_key:
73
+ raise ValueError("Missing GEMINI_API_KEY in environment")
74
+
75
+ vector_store = get_chroma_store()
76
+
77
+ # 6) Create Gemini chat model (LLM)
78
+ llm = ChatGoogleGenerativeAI(model=google_llm_model_name, temperature=temperature, google_api_key=SecretStr(gemini_api_key))
79
+
80
+ # 7) Prompt template
81
+ # Note: The prompt input variables must match the dict passed to the model
82
+ system_prompt = """
83
+ You are provided with a list of sample text that comes from various different user manuals.
84
+ Your task is to respond to the user using the samples provided to the best of your abilities.
85
+ The context text is in the following paragraph.
86
+
87
+ """
88
+
89
+ # Helper to format documents for the prompt
90
+ def format_docs(docs: list[Document]) -> str:
91
+ """Formats a list of documents into a single string."""
92
+ return "\n".join(doc.page_content for doc in docs)
93
+
94
+ # 8) Build RAG chain using LCEL (LangChain Expression Language)
95
+ # The chain structure is:
96
+ # { 'context': retriever | format_docs, 'input': RunnablePassthrough() } | prompt | llm
97
+
98
+ @tool #fonksiyonun hem cevap (content) hem de kaynak/detay (artifact) döndüreceğini belirtir.
99
+ def retrieve_context(query: str) -> str:
100
+ '''Sorguyu yanıtlamaya yardımcı olacak bilgileri getir.'''
101
+ retrieved_docs = vector_store.similarity_search(query, k=5)
102
+ return format_docs(retrieved_docs)
103
+
104
+ rag_agent = create_agent(llm, [retrieve_context], system_prompt=system_prompt)
105
+
106
+ return rag_agent
107
+
108
+ if __name__ == "__main__":
109
+ rag_agent = create_rag_agent()
110
+ result: dict[str, Any] | Any = rag_agent.invoke(
111
+ {"messages": [{"role": "user", "content": "I want to replace the batteries of a sony brand remote. What can I do?"}]}
112
+ )
generate_vector_db.py ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from langchain_core.documents import Document
2
+ from langchain_community.document_loaders import DirectoryLoader, TextLoader
3
+ from langchain_text_splitters import RecursiveCharacterTextSplitter
4
+
5
+ from langchain_huggingface import HuggingFaceEmbeddings
6
+ from langchain_chroma import Chroma
7
+ import chromadb.errors
8
+
9
+ import gc
10
+ import os
11
+
12
+ import config
13
+
14
+ batch_size = 5
15
+
16
+ def generate_doc_id(chunk: Document, postfix: str) -> str:
17
+ unique_string = f"{chunk.metadata.get('source')}---{postfix}"
18
+ return unique_string
19
+
20
+ def create_chroma_store(
21
+ dataset_directory: str = config.dataset_directory,
22
+ chroma_persist_directory: str = config.chroma_persist_directory,
23
+ huggingface_embedding_model_repo_path: str = config.huggingface_embedding_model_repo_path
24
+ ) -> Chroma:
25
+
26
+ embedding_model = HuggingFaceEmbeddings(model_name=huggingface_embedding_model_repo_path)
27
+
28
+ store: Chroma
29
+
30
+ if os.path.exists(chroma_persist_directory) and os.path.isdir(chroma_persist_directory):
31
+ print(f"✅ Loading existing Chroma DB from: {chroma_persist_directory}")
32
+ store = Chroma(
33
+ embedding_function=embedding_model,
34
+ persist_directory=chroma_persist_directory
35
+ )
36
+ else:
37
+ print(f"📦 Creating new Chroma DB at: {chroma_persist_directory} using batch processing.")
38
+ store = Chroma(
39
+ embedding_function=embedding_model,
40
+ persist_directory=chroma_persist_directory
41
+ )
42
+
43
+ try:
44
+ # Use lazy_load() to get a generator instead of loading all documents into memory
45
+ loader = DirectoryLoader(
46
+ path=dataset_directory,
47
+ glob="**/*.txt",
48
+ loader_cls=TextLoader,
49
+ show_progress=True,
50
+ use_multithreading=False,
51
+ randomize_sample=True
52
+ )
53
+ # Use iterator to avoid loading all documents
54
+ document_iterator = loader.lazy_load()
55
+ except FileNotFoundError:
56
+ raise FileNotFoundError(f"🚨 Warning: '{dataset_directory}' directory not found.")
57
+
58
+ # Splitter for document chunks
59
+ splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
60
+
61
+ doc_batch: list[Document] = []
62
+
63
+ try:
64
+ for document in document_iterator:
65
+
66
+ doc_batch.append(document)
67
+
68
+ global batch_size
69
+ if len(doc_batch) >= batch_size:
70
+ print(f"Processing batch of {len(doc_batch)} documents...")
71
+ # 3. Split the current batch of documents into chunks
72
+ # Splitting a small batch is memory-efficient
73
+ chunks = splitter.split_documents(doc_batch)
74
+
75
+ if len(chunks) == 0:
76
+ doc_batch = []
77
+ continue
78
+
79
+ # Assign an ID to every chunk
80
+ for i, chunk in enumerate(chunks, 1):
81
+ chunk.id = generate_doc_id(chunk, str(i))
82
+
83
+ existingIds = [doc.id for doc in store.get_by_ids([chunk.id for chunk in chunks if chunk.id is not None]) if doc.id is not None]
84
+ unaddedChunks = [chunk for chunk in chunks if chunk.id is not None and chunk.id not in existingIds]
85
+
86
+ if len(unaddedChunks) != 0:
87
+ try:
88
+ store.add_documents(unaddedChunks) # pyright: ignore[reportUnusedCallResult]
89
+ except chromadb.errors.InternalError:
90
+ batch_size //= 2
91
+
92
+ # E) Reset the batch list
93
+ doc_batch = []
94
+
95
+ gc.collect() # pyright: ignore[reportUnusedCallResult]
96
+
97
+ if len(unaddedChunks) != 0:
98
+ #sleep(61)
99
+ pass
100
+
101
+ # Process the final batch (if any)
102
+ if doc_batch:
103
+ print(f"Processing final batch of {len(doc_batch)} documents...")
104
+ chunks = splitter.split_documents(doc_batch)
105
+
106
+ store.add_documents(chunks) # pyright: ignore[reportUnusedCallResult]
107
+
108
+ except KeyboardInterrupt:
109
+ print("Process interrupted")
110
+ pass
111
+
112
+ return store
113
+
114
+ def main():
115
+ vectorstore = create_chroma_store() # pyright: ignore[reportUnusedVariable]
116
+
117
+ if __name__ == "__main__":
118
+ main()
gradio_app.py ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Any
2
+ import gradio as gr
3
+ from langchain_core.messages import AIMessage
4
+ from create_rag_agent import create_rag_agent
5
+
6
+ def gradio_main():
7
+
8
+ rag_agent = create_rag_agent()
9
+
10
+ def rag_agent_response(message: str, history: list[dict[str, Any]]):
11
+ """
12
+ The function integrated with Gradio, calling your LangChain rag_agent.
13
+ It now passes the full conversation history for conversational context.
14
+
15
+ The type hint for history is now the built-in generic: list[dict].
16
+ """
17
+
18
+ full_messages = history + [{"role": "user", "content": message}]
19
+
20
+ agent_input = {
21
+ "messages": full_messages
22
+ }
23
+
24
+ stream = rag_agent.stream(agent_input)
25
+
26
+ current_response=""
27
+
28
+ # Iterate over the stream of chunks
29
+ for chunk in stream:
30
+
31
+ model_in_chunk = chunk.get("model", [])
32
+
33
+ if model_in_chunk:
34
+
35
+ messages_in_chunk = model_in_chunk.get("messages", [])
36
+
37
+ if messages_in_chunk:
38
+ # The final item in the messages list contains the generated text chunk
39
+ message_chunk = messages_in_chunk[-1]
40
+
41
+ # We use getattr to safely get the content from a message object/chunk
42
+ content_chunk = getattr(message_chunk, "text", None)
43
+
44
+ if content_chunk:
45
+ # Accumulate and yield the running response
46
+ current_response += content_chunk
47
+ yield current_response
48
+
49
+ gr_interface = gr.ChatInterface(
50
+ fn=rag_agent_response,
51
+ type="messages",
52
+ chatbot=gr.Chatbot(
53
+ height=500,
54
+ label="LangChain Conversational RAG Chatbot",
55
+ type="messages"
56
+ ),
57
+ textbox=gr.Textbox(placeholder="Enter your query here...", container=False, scale=7),
58
+ title="LangChain RAG Agent Integrated with Gradio (Conversational)",
59
+ description="This interface now passes the full conversation history to the agent for context.",
60
+ theme="soft"
61
+ )
62
+
63
+ return gr_interface
64
+
65
+ if __name__ == "__main__":
66
+ gradio_main().queue().launch() # pyright: ignore[reportUnusedCallResult]
requirements.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ python-dotenv
2
+ pydantic
3
+ langchain==1.0.1
4
+ langchain-core
5
+ langchain-chroma
6
+ langchain-community
7
+ langchain-text-splitters
8
+ langchain-google-genai
9
+ langchain-huggingface
10
+ sentence-transformers
11
+ chromadb
12
+ gradio