Spaces:

rmayormartins
/

nlp-rag-langchain

Sleeping

App Files Files Community

rmayormartins commited on Nov 25, 2024

Commit

36b50a6

1 Parent(s): 2239ff4

go

Browse files

Files changed (3) hide show

README.md +95 -7
app.py +208 -0
requirements.txt +9 -0

README.md CHANGED Viewed

@@ -1,14 +1,102 @@
 ---
-title: Nlp Rag Langchain
-emoji: 👀
-colorFrom: purple
-colorTo: blue
 sdk: gradio
-sdk_version: 5.6.0
 app_file: app.py
 pinned: false
 license: ecl-2.0
-short_description: Retrieval-Augmented Generation (RAG) system for questions
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title:  Multilingual RAG Question-Answering System
+emoji: 🈁↔️🤖
+colorFrom: blue
+colorTo: green
 sdk: gradio
+sdk_version: 4.7.1
 app_file: app.py
 pinned: false
 license: ecl-2.0
 ---
+# Multilingual RAG Question-Answering System
+This project implements a Retrieval-Augmented Generation (RAG) system for question answering in multiple languages. It uses advanced language models and embeddings to provide accurate answers based on provided texts.
+## Developer
+Developed by Ramon Mayor Martins (2024)
+* Email: rmayormartins@gmail.com
+* Homepage: https://rmayormartins.github.io/
+* Twitter: @rmayormartins
+* GitHub: https://github.com/rmayormartins
+* Space: https://huggingface.co/rmayormartins
+## Technologies Used
+* **LangChain:** Framework for developing applications powered by language models, providing tools for document loading, text splitting, and creating chains of operations.
+* **Sentence Transformers:** Library for state-of-the-art text embeddings, using the multilingual-e5-large model for superior multilingual understanding.
+* **Flan-T5:** Advanced language model from Google that excels at various NLP tasks, particularly strong in multilingual text generation and understanding.
+* **Chroma DB:** Lightweight vector database for storing and retrieving text embeddings efficiently, enabling semantic search capabilities.
+* **Gradio:** Framework for creating user-friendly web interfaces for machine learning models, providing an intuitive way to interact with the RAG system.
+* **HuggingFace Transformers:** Library providing access to state-of-the-art transformer models, tokenizers, and pipelines.
+* **PyTorch:** Deep learning framework that powers the underlying models and computations.
+## Key Features
+* **Multilingual Support:** Process and answer questions in multiple languages (English, Spanish, Portuguese, and more)
+* **Document Chunking:** Smart text splitting for handling long documents
+* **Semantic Search:** Uses advanced embeddings for accurate information retrieval
+* **Source Attribution:** Provides references to the relevant text passages used for answers
+* **User-Friendly Interface:** Simple web interface for text input and question answering
+## How it Works
+1. **Text Processing:**
+   - User inputs a text document
+   - System splits text into manageable chunks
+   - Chunks are converted into embeddings using multilingual-e5-large
+2. **Knowledge Base Creation:**
+   - Embeddings are stored in Chroma vector database
+   - Document metadata is preserved for source attribution
+3. **Question Answering:**
+   - User asks a question in any supported language
+   - System retrieves relevant document chunks
+   - Flan-T5 generates a coherent answer based on retrieved context
+   - Sources are displayed for transparency
+## How to Use
+1. Open the application interface
+2. Paste your reference text in the "Base Text" field
+3. Enter your question in any supported language
+4. Receive an answer along with relevant source excerpts
+## Example Use Cases
+* Document analysis and comprehension
+* Educational Q&A systems
+* Multilingual information retrieval
+* Research assistance
+* Content summarization
+## Technical Architecture
+* **Embedding Model:** intfloat/multilingual-e5-large
+* **Language Model:** google/flan-t5-large
+* **Vector Store:** Chroma
+* **Chunk Size:** 500 characters
+* **Context Window:** 4 documents
+## Local Development
+```bash
+pip install -r requirements.txt
+python app.py
+```
+## Deployment
+This application is deployed on Hugging Face Spaces. You can access it at [https://huggingface.co/spaces/rmayormartins/nlp-rag-langchain].
+## Note
+The system's responses are generated solely based on the provided text. The quality of answers depends on the content and clarity of the input text.

app.py ADDED Viewed

	@@ -0,0 +1,208 @@

+import os
+from typing import List, Tuple, Dict
+from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer
+from sentence_transformers import SentenceTransformer
+from langchain_community.vectorstores import Chroma
+from langchain.chains import RetrievalQA
+from langchain_community.embeddings import HuggingFaceEmbeddings
+from langchain.llms import HuggingFacePipeline
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+from langchain.prompts import PromptTemplate
+import gradio as gr
+import torch
+class EnhancedRAGSystem:
+    def __init__(self):
+        self.chunk_size = 500
+        self.chunk_overlap = 50
+        self.k_documents = 4
+        self.text_splitter = RecursiveCharacterTextSplitter(
+            chunk_size=self.chunk_size,
+            chunk_overlap=self.chunk_overlap,
+            length_function=len
+        )
+        self.embedding_model_name = "intfloat/multilingual-e5-large"
+        self.llm_model_name = "google/flan-t5-large"
+        self.prompt_template = PromptTemplate(
+            template="""Use the context below to answer the question.
+            If the answer is not in the context, say "I don't have enough information in the context to answer this question."
+            Context: {context}
+            Question: {question}
+            Detailed answer:""",
+            input_variables=["context", "question"]
+        )
+        self.embeddings = HuggingFaceEmbeddings(
+            model_name=self.embedding_model_name,
+            model_kwargs={'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
+        )
+        self.tokenizer = AutoTokenizer.from_pretrained(self.llm_model_name)
+        self.model = AutoModelForSeq2SeqLM.from_pretrained(self.llm_model_name)
+        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
+        self.model.to(self.device)
+        self.pipe = pipeline(
+            "text2text-generation",
+            model=self.model,
+            tokenizer=self.tokenizer,
+            max_length=512,
+            device=0 if torch.cuda.is_available() else -1,
+            model_kwargs={"temperature": 0.7}
+        )
+        self.llm = HuggingFacePipeline(pipeline=self.pipe)
+    def process_documents(self, text: str) -> bool:
+        try:
+            texts = self.text_splitter.split_text(text)
+            self.vectorstore = Chroma.from_texts(
+                texts,
+                self.embeddings,
+                metadatas=[{"source": f"chunk_{i}", "text": t} for i, t in enumerate(texts)],
+                collection_name="enhanced_rag_docs"
+            )
+            self.retriever = self.vectorstore.as_retriever(
+                search_kwargs={"k": self.k_documents}
+            )
+            self.qa_chain = RetrievalQA.from_chain_type(
+                llm=self.llm,
+                chain_type="stuff",
+                retriever=self.retriever,
+                return_source_documents=True,
+                chain_type_kwargs={"prompt": self.prompt_template}
+            )
+            return True
+        except Exception as e:
+            print(f"Processing error: {str(e)}")
+            return False
+    def answer_question(self, question: str) -> Tuple[str, str]:
+        try:
+            response = self.qa_chain({"query": question})
+            answer = response["result"]
+            sources = []
+            for i, doc in enumerate(response["source_documents"], 1):
+                text_preview = doc.page_content[:100] + "..."
+                sources.append(f"Excerpt {i}: {text_preview}")
+            sources_text = "\n".join(sources)
+            return answer, sources_text
+        except Exception as e:
+            return f"Error answering: {str(e)}", ""
+def create_enhanced_interface():
+    rag_system = EnhancedRAGSystem()
+    def process_and_answer(text: str, question: str) -> str:
+        if not text.strip() or not question.strip():
+            return "Please provide both text and question."
+        if not rag_system.process_documents(text):
+            return "Error processing the text."
+        answer, sources = rag_system.answer_question(question)
+        if sources:
+            return f"""Answer: {answer}
+Relevant excerpts consulted:
+{sources}"""
+        return answer
+    # HTML para o cabeçalho
+    custom_css = """
+        .custom-description {
+            margin-bottom: 20px;
+            text-align: center;
+        }
+        .custom-description a {
+            text-decoration: none;
+            color: #007bff;
+            margin: 0 5px;
+        }
+        .custom-description a:hover {
+            text-decoration: underline;
+        }
+    """
+    with gr.Blocks(css=custom_css) as interface:
+        gr.HTML("""
+            <div class="custom-description">
+                <h1>Advanced RAG with Multilingual Support</h1>
+                <p>Ramon Mayor Martins:
+                    <a href="https://rmayormartins.github.io/" target="_blank">Website</a> |
+                    <a href="https://huggingface.co/rmayormartins" target="_blank">Spaces</a> |
+                    <a href="https://github.com/rmayormartins" target="_blank">GitHub</a>
+                </p>
+                <p>This system uses Retrieval-Augmented Generation (RAG) to answer questions about your texts in multiple languages.
+                Simply paste your text and ask questions in any language!</p>
+            </div>
+        """)
+        with gr.Row():
+            with gr.Column():
+                text_input = gr.Textbox(
+                    label="Base Text",
+                    placeholder="Paste here the text that will serve as knowledge base...",
+                    lines=10
+                )
+                question_input = gr.Textbox(
+                    label="Your Question",
+                    placeholder="What would you like to know about the text?"
+                )
+                submit_btn = gr.Button("Submit")
+            with gr.Column():
+                output = gr.Textbox(label="Answer")
+        examples = [
+            ["The Earth is the third planet from the Sun. It has one natural satellite called the Moon. It is the only known planet to harbor life.",
+             "What is Earth's natural satellite?"],
+            ["La Tierra es el tercer planeta del Sistema Solar. Tiene un satélite natural llamado Luna. Es el único planeta conocido que alberga vida.",
+             "¿Cuál es el satélite natural de la Tierra?"],
+            ["A Terra é o terceiro planeta do Sistema Solar. Tem um satélite natural chamado Lua. É o único planeta conhecido que abriga vida.",
+             "Qual é o satélite natural da Terra?"],
+            ["The Sun is a medium-sized star at the center of our Solar System. It provides light and heat to all planets.",
+             "What is the Sun?"],
+            ["El Sol es una estrella de tamaño medio en el centro de nuestro Sistema Solar. Proporciona luz y calor a todos los planetas.",
+             "¿Qué es el Sol?"],
+            ["O Sol é uma estrela de tamanho médio no centro do nosso Sistema Solar. Ele fornece luz e calor para todos os planetas.",
+             "O que é o Sol?"]
+        ]
+        gr.Examples(
+            examples=examples,
+            inputs=[text_input, question_input],
+            outputs=output,
+            fn=process_and_answer,
+            cache_examples=True
+        )
+        submit_btn.click(
+            fn=process_and_answer,
+            inputs=[text_input, question_input],
+            outputs=output
+        )
+    return interface
+if __name__ == "__main__":
+    demo = create_enhanced_interface()
+    demo.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+langchain==0.1.0
+langchain-community==0.0.10
+chromadb==0.4.22
+sentence-transformers==2.2.2
+gradio==4.8.0
+torch==2.1.2
+transformers==4.36.2