Spaces:

syildizz
/

User-Manual-Chatbot

Sleeping

App Files Files Community

syildizz commited on Oct 22

Commit

1ae5927

0 Parent(s):

Initial commit.

Browse files

Files changed (8) hide show

.gitignore +8 -0
README.md +162 -0
app.py +6 -0
config.py +7 -0
create_rag_agent.py +112 -0
generate_vector_db.py +118 -0
gradio_app.py +66 -0
requirements.txt +12 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,8 @@

+chroma_db/
+dataset
+**/__pycache__
+.venv/*
+.env
+pyrightconfig.json
+!**/.gitkeep

README.md ADDED Viewed

	@@ -0,0 +1,162 @@

+# User Manual Chatbot
+## Project Overview
+This project is a chatbot developed as part of the **Akbank GenAI Bootcamp 2025**.
+The chatbot leverages a database of user manuals for various products to provide accurate and contextually relevant answers to technical questions.
+By utilizing **Retrieval-Augmented Generation (RAG)** technology, the chatbot retrieves relevant information from user manuals and combines it with the generative capabilities of the **Gemini-2.5-flash** model to deliver precise responses.
+The project includes a user-friendly interface built with **Gradio** which can be used to interact with the chatbot.
+### Purpose
+The goal of this project is to create an intelligent chatbot capable of answering technical queries about electronic devices and products by referencing user manuals.
+This enables users to quickly access accurate information without manually searching through lengthy documentation.
+---
+## Dataset
+The dataset used in this project is sourced from the dataset described in the paper *[Question Answering over Electronic Devices: A New Benchmark Dataset and a Multi-Task Learning based QA Framework](https://arxiv.org/abs/2109.05897)*.
+It can be accessed via this [Google Drive link](https://drive.google.com/drive/folders/1-gX1DlmVodP6OVRJC3WBRZoGgxPuJvvt).
+### Dataset Details
+- **Format**: Text-based user manuals for various electronic devices.
+- **Preprocessing**: The manuals are split into overlapping chunks to facilitate efficient retrieval.
+- **Embedding Generation**: The text chunks are converted into embeddings using the [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) model from HuggingFace.
+- **Generation Script**: The [generate_vector_db.py](./generate_vector_db.py) script processes the dataset and generates the vector database. If the process is interrupted the generated embeddings are saved and when ran again, the program will generate new embeddings.
+- **Vector Database**: The embeddings are stored in a Chroma vector database and can be used locally. However, a pregenerated database already exists can be used via the HuggingFace dataset [syildizz/user-manuals-chromadb](https://huggingface.co/datasets/syildizz/user-manuals-chromadb).
+### Usage
+When generating the dataset, the folders that are used for the input user-manual dataset and the output Chroma dataset by the [generate_vector_db.py](./generate_vector_db.py) script is specified in the [config.py](config.py) file.
+```python
+dataset_directory = "user_manual_dataset_folder_path"
+chroma_persist_directory = "chroma_dataset_folder_path"
+```
+---
+## Methods and Technologies
+### Solution Architecture
+The chatbot employs a **Retrieval-Augmented Generation (RAG)** pipeline to combine information retrieval with generative AI:
+1. **Vector Database**: The embeddings are retrieved from a **Chroma** vector database for efficient similarity-based retrieval.
+2. **Query Processing**: When a user submits a query, the system retrieves the most relevant manual chunks using similarity search.
+3. **Response Generation**: The retrieved chunks are passed to the **Gemini-2.5-flash** model to generate a coherent and contextually accurate response.
+4. **User Interface**: A **Gradio**-based interface allows users to interact with the chatbot seamlessly.
+### Technologies Used
+- **LLM**: Gemini-2.5-flash (`langchain-google-genai`)
+- **Embedding Model**: sentence-transformers/all-mpnet-base-v2 (`langchain-huggingface`)
+- **Vector Database**: Chroma (`langchain-chroma`, `chromadb`)
+- **Text Splitting**: `langchain-text-splitters`
+- **Interface**: Gradio (`gradio`)
+- **Environment Management**: `python-dotenv`, `pydantic`
+- **Other Libraries**: `langchain`, `langchain-core`, `langchain-community`
+### Key Features
+- **RAG-based Retrieval**: Ensures answers are grounded in the user manual dataset.
+- **Incremental Vector Database**: The `generate_vector_db.py` script supports resumable processing.
+- **Configurability**: Parameters like chunk size and overlap are adjustable in `config.py`.
+- **Interactive UI**: Gradio interface for easy user interaction.
+---
+## Results
+The chatbot successfully answers technical questions about electronic devices by retrieving relevant information from user manuals.
+Key outcomes include:
+- **Accuracy**: The RAG pipeline ensures responses are highly relevant to the query, leveraging the structured manual dataset.
+- **Scalability**: The incremental vector database generation supports large datasets and resumable processing.
+- **Usability**: The Gradio interface provides a seamless experience for users to query the chatbot.
+- **Deployment**: The project is live on HuggingFace Spaces at [Placeholder Link](https://huggingface.co/spaces/placeholder).
+---
+## Setup and Installation
+### Prerequisites
+- Python
+- Git
+- Virtual environment (recommended)
+### Installation Steps
+1. **Clone the Repository**:
+   ```bash
+   git clone https://github.com/syildizz/[your-repo-name].git
+   cd [your-repo-name]
+   ```
+2. **Set Up a Virtual Environment**:
+   ```bash
+   python -m venv .venv
+   source .venv/bin/activate  # On Windows: .venv\Scripts\activate
+   ```
+3. **Install Dependencies**:
+   Use the [requirements.txt](./requirements.txt) file to install dependencies via running:
+   ```bash
+   pip install -r requirements.txt
+   ```
+4. **Configuration**:
+    The public configuration information is stored in the [config.py](./config.py) file. The global parameters in the config file specified can be changed if another value is desired for the project.
+    Default values:
+    ```python
+    dataset_directory = "dataset"
+    chroma_persist_directory = "chroma_db"
+    huggingface_embedding_model_repo_path = "sentence-transformers/all-mpnet-base-v2"
+    huggingface_vector_embedding_database_repo_path = "syildizz/user-manuals-chromadb"
+    google_llm_model_name = "gemini-2.5-flash"
+    ```
+4. **Configure Environment Variables**:
+   Create a `.env` file in the project root with the following:
+   ```text
+   GEMINI_API_KEY=[your-gemini-api-key]
+   HUGGINGFACE_TOKEN=[your-huggingface-token]
+   ```
+5. **Generate Vector Database** (Optional):
+   If you want to generate a local vector database, run:
+   ```bash
+   python generate_vector_db.py
+   ```
+   NOTE: Do not generate a vector database if you want to pull the public pregenerated database.
+   If a database does not exist in the next step, [app.py](./app.py) will pull the remote pregenerated database.
+6. **Run the Application**:
+   Launch the Gradio interface:
+   ```bash
+   python app.py
+   ```
+   The interface will be available at `http://localhost:7860`.
+---
+## Web Interface & Product Guide
+The chatbot is deployed on HuggingFace Spaces at [Placeholder Link](https://huggingface.co/spaces/placeholder).
+The Gradio-based interface allows users to:
+- Enter technical questions about electronic devices.
+- Receive responses grounded in user manual content.
+### Usage Instructions
+1. Visit the HuggingFace Spaces link: [Placeholder Link](https://huggingface.co/spaces/placeholder).
+2. Enter a question in the text input field (e.g., "How do I reset my [device name]?").
+3. The chatbot will use relevant manual sections to generate a response.
+### Screenshots
+[Placeholder: Add screenshots or a short video demonstrating the interface]
+---
+**Live Demo**: [Placeholder HuggingFace Spaces Link](https://huggingface.co/spaces/placeholder)

app.py ADDED Viewed

	@@ -0,0 +1,6 @@

+from gradio_app import gradio_main
+if __name__ == "__main__":
+    gr_interface = gradio_main()
+    gr_interface.queue().launch()  # pyright: ignore[reportUnusedCallResult]

config.py ADDED Viewed

	@@ -0,0 +1,7 @@

+dataset_directory = "dataset"
+chroma_persist_directory = "chroma_db"
+huggingface_embedding_model_repo_path = "sentence-transformers/all-mpnet-base-v2"
+huggingface_vector_embedding_database_repo_path = "syildizz/user-manuals-chromadb"
+google_llm_model_name = "gemini-2.5-flash"
+temperature = 0.3

create_rag_agent.py ADDED Viewed

	@@ -0,0 +1,112 @@

+# rag_gemini_chroma_v1_0_1_fixed.py
+import os
+from typing import Any
+from dotenv import load_dotenv
+from langchain.agents import create_agent
+from langchain_huggingface import HuggingFaceEmbeddings
+from langgraph.graph.state import CompiledStateGraph
+from pydantic import SecretStr
+from huggingface_hub import snapshot_download
+# Core types and prompt
+from langchain_core.documents import Document
+# Document loaders, text splitters, vectorstore (community / ecosystem packages)
+from langchain_chroma import Chroma
+# Google Gemini provider and embeddings package
+from langchain_google_genai import ChatGoogleGenerativeAI
+from langchain.tools import tool
+import config
+# Instead, LCEL (Runnable components) is used for chain composition.
+def get_chroma_store(
+    chroma_persist_directory: str = config.chroma_persist_directory,
+    huggingface_embedding_model_repo_path: str = config.huggingface_embedding_model_repo_path,
+    huggingface_vector_embedding_database_repo_path: str = config.huggingface_vector_embedding_database_repo_path,
+) -> Chroma:
+    """
+    Load an existing Chroma store if present, otherwise create from docs and persist.
+    This version uses lazy loading and batch processing to prevent memory issues.
+    """
+    embedding_model = HuggingFaceEmbeddings(model_name=huggingface_embedding_model_repo_path)
+    # 3) Check for existing Chroma DB and load it
+    if os.path.exists(chroma_persist_directory) and os.path.isdir(chroma_persist_directory):
+        print(f"✅ Loading existing Chroma DB from: {chroma_persist_directory}")
+    else:
+        print("📥 No local Chroma DB found. Pulling from Hugging Face dataset...")
+        # Create local directory
+        os.makedirs(chroma_persist_directory, exist_ok=True)
+        # Download all files from the Hugging Face dataset
+        snapshot_download(  # pyright: ignore[reportUnusedCallResult]
+            repo_id=huggingface_vector_embedding_database_repo_path,
+            repo_type="dataset",
+            local_dir=chroma_persist_directory,
+            ignore_patterns=["*.md", "*.json"],  # Optional: skip non-DB files like README,
+        )
+        print(f"✅ Pulled and persisted Chroma DB to: {chroma_persist_directory}")
+    return Chroma(
+        embedding_function=embedding_model,
+        persist_directory=chroma_persist_directory
+    )
+def create_rag_agent(
+    google_llm_model_name: str = config.google_llm_model_name,
+    temperature: float = 0.3
+) -> CompiledStateGraph[Any]:
+    load_dotenv()  # pyright: ignore[reportUnusedCallResult]
+    gemini_api_key = os.getenv("GEMINI_API_KEY")
+    if not gemini_api_key:
+        raise ValueError("Missing GEMINI_API_KEY in environment")
+    vector_store = get_chroma_store()
+    # 6) Create Gemini chat model (LLM)
+    llm = ChatGoogleGenerativeAI(model=google_llm_model_name, temperature=temperature, google_api_key=SecretStr(gemini_api_key))
+    # 7) Prompt template
+    # Note: The prompt input variables must match the dict passed to the model
+    system_prompt = """
+        You are provided with a list of sample text that comes from various different user manuals.
+        Your task is to respond to the user using the samples provided to the best of your abilities.
+        The context text is in the following paragraph.
+        """
+    # Helper to format documents for the prompt
+    def format_docs(docs: list[Document]) -> str:
+        """Formats a list of documents into a single string."""
+        return "\n".join(doc.page_content for doc in docs)
+    # 8) Build RAG chain using LCEL (LangChain Expression Language)
+    # The chain structure is:
+    # { 'context': retriever | format_docs, 'input': RunnablePassthrough() } | prompt | llm
+    @tool #fonksiyonun hem cevap (content) hem de kaynak/detay (artifact) döndüreceğini belirtir.
+    def retrieve_context(query: str) -> str:
+        '''Sorguyu yanıtlamaya yardımcı olacak bilgileri getir.'''
+        retrieved_docs = vector_store.similarity_search(query, k=5)
+        return format_docs(retrieved_docs)
+    rag_agent = create_agent(llm, [retrieve_context], system_prompt=system_prompt)
+    return rag_agent
+if __name__ == "__main__":
+    rag_agent = create_rag_agent()
+    result: dict[str, Any] | Any = rag_agent.invoke(
+        {"messages": [{"role": "user", "content": "I want to replace the batteries of a sony brand remote. What can I do?"}]}
+    )

generate_vector_db.py ADDED Viewed

	@@ -0,0 +1,118 @@

+from langchain_core.documents import Document
+from langchain_community.document_loaders import DirectoryLoader, TextLoader
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+from langchain_huggingface import HuggingFaceEmbeddings
+from langchain_chroma import Chroma
+import chromadb.errors
+import gc
+import os
+import config
+batch_size = 5
+def generate_doc_id(chunk: Document, postfix: str) -> str:
+    unique_string = f"{chunk.metadata.get('source')}---{postfix}"
+    return unique_string
+def create_chroma_store(
+    dataset_directory: str = config.dataset_directory,
+    chroma_persist_directory: str = config.chroma_persist_directory,
+    huggingface_embedding_model_repo_path: str = config.huggingface_embedding_model_repo_path
+) -> Chroma:
+    embedding_model = HuggingFaceEmbeddings(model_name=huggingface_embedding_model_repo_path)
+    store: Chroma
+    if os.path.exists(chroma_persist_directory) and os.path.isdir(chroma_persist_directory):
+        print(f"✅ Loading existing Chroma DB from: {chroma_persist_directory}")
+        store = Chroma(
+            embedding_function=embedding_model,
+            persist_directory=chroma_persist_directory
+        )
+    else:
+        print(f"📦 Creating new Chroma DB at: {chroma_persist_directory} using batch processing.")
+        store = Chroma(
+            embedding_function=embedding_model,
+            persist_directory=chroma_persist_directory
+        )
+    try:
+        # Use lazy_load() to get a generator instead of loading all documents into memory
+        loader = DirectoryLoader(
+            path=dataset_directory,
+            glob="**/*.txt",
+            loader_cls=TextLoader,
+            show_progress=True,
+            use_multithreading=False,
+            randomize_sample=True
+        )
+        # Use iterator to avoid loading all documents
+        document_iterator = loader.lazy_load()
+    except FileNotFoundError:
+        raise FileNotFoundError(f"🚨 Warning: '{dataset_directory}' directory not found.")
+    # Splitter for document chunks
+    splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
+    doc_batch: list[Document] = []
+    try:
+        for document in document_iterator:
+            doc_batch.append(document)
+            global batch_size
+            if len(doc_batch) >= batch_size:
+                print(f"Processing batch of {len(doc_batch)} documents...")
+                # 3. Split the current batch of documents into chunks
+                # Splitting a small batch is memory-efficient
+                chunks = splitter.split_documents(doc_batch)
+                if len(chunks) == 0:
+                    doc_batch = []
+                    continue
+                # Assign an ID to every chunk
+                for i, chunk in enumerate(chunks, 1):
+                    chunk.id = generate_doc_id(chunk, str(i))
+                existingIds = [doc.id for doc in store.get_by_ids([chunk.id for chunk in chunks if chunk.id is not None]) if doc.id is not None]
+                unaddedChunks = [chunk for chunk in chunks if chunk.id is not None and chunk.id not in existingIds]
+                if len(unaddedChunks) != 0:
+                    try:
+                        store.add_documents(unaddedChunks)  # pyright: ignore[reportUnusedCallResult]
+                    except chromadb.errors.InternalError:
+                        batch_size //= 2
+                # E) Reset the batch list
+                doc_batch = []
+                gc.collect()  # pyright: ignore[reportUnusedCallResult]
+                if len(unaddedChunks) != 0:
+                    #sleep(61)
+                    pass
+        # Process the final batch (if any)
+        if doc_batch:
+            print(f"Processing final batch of {len(doc_batch)} documents...")
+            chunks = splitter.split_documents(doc_batch)
+            store.add_documents(chunks)  # pyright: ignore[reportUnusedCallResult]
+    except KeyboardInterrupt:
+        print("Process interrupted")
+        pass
+    return store
+def main():
+    vectorstore = create_chroma_store()  # pyright: ignore[reportUnusedVariable]
+if __name__ == "__main__":
+    main()

gradio_app.py ADDED Viewed

	@@ -0,0 +1,66 @@

+from typing import Any
+import gradio as gr
+from langchain_core.messages import AIMessage
+from create_rag_agent import create_rag_agent
+def gradio_main():
+    rag_agent = create_rag_agent()
+    def rag_agent_response(message: str, history: list[dict[str, Any]]):
+        """
+        The function integrated with Gradio, calling your LangChain rag_agent.
+        It now passes the full conversation history for conversational context.
+        The type hint for history is now the built-in generic: list[dict].
+        """
+        full_messages = history + [{"role": "user", "content": message}]
+        agent_input = {
+            "messages": full_messages
+        }
+        stream = rag_agent.stream(agent_input)
+        current_response=""
+        # Iterate over the stream of chunks
+        for chunk in stream:
+            model_in_chunk = chunk.get("model", [])
+            if model_in_chunk:
+                messages_in_chunk = model_in_chunk.get("messages", [])
+                if messages_in_chunk:
+                    # The final item in the messages list contains the generated text chunk
+                    message_chunk = messages_in_chunk[-1]
+                    # We use getattr to safely get the content from a message object/chunk
+                    content_chunk = getattr(message_chunk, "text", None)
+                    if content_chunk:
+                        # Accumulate and yield the running response
+                        current_response += content_chunk
+                        yield current_response
+    gr_interface = gr.ChatInterface(
+        fn=rag_agent_response,
+        type="messages",
+        chatbot=gr.Chatbot(
+            height=500,
+            label="LangChain Conversational RAG Chatbot",
+            type="messages"
+        ),
+        textbox=gr.Textbox(placeholder="Enter your query here...", container=False, scale=7),
+        title="LangChain RAG Agent Integrated with Gradio (Conversational)",
+        description="This interface now passes the full conversation history to the agent for context.",
+        theme="soft"
+    )
+    return gr_interface
+if __name__ == "__main__":
+    gradio_main().queue().launch()  # pyright: ignore[reportUnusedCallResult]

requirements.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+python-dotenv
+pydantic
+langchain==1.0.1
+langchain-core
+langchain-chroma
+langchain-community
+langchain-text-splitters
+langchain-google-genai
+langchain-huggingface
+sentence-transformers
+chromadb
+gradio