rmayormartins commited on
Commit
36b50a6
·
1 Parent(s): 2239ff4
Files changed (3) hide show
  1. README.md +95 -7
  2. app.py +208 -0
  3. requirements.txt +9 -0
README.md CHANGED
@@ -1,14 +1,102 @@
1
  ---
2
- title: Nlp Rag Langchain
3
- emoji: 👀
4
- colorFrom: purple
5
- colorTo: blue
6
  sdk: gradio
7
- sdk_version: 5.6.0
8
  app_file: app.py
9
  pinned: false
10
  license: ecl-2.0
11
- short_description: Retrieval-Augmented Generation (RAG) system for questions
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Multilingual RAG Question-Answering System
3
+ emoji: 🈁↔️🤖
4
+ colorFrom: blue
5
+ colorTo: green
6
  sdk: gradio
7
+ sdk_version: 4.7.1
8
  app_file: app.py
9
  pinned: false
10
  license: ecl-2.0
 
11
  ---
12
 
13
+ # Multilingual RAG Question-Answering System
14
+
15
+ This project implements a Retrieval-Augmented Generation (RAG) system for question answering in multiple languages. It uses advanced language models and embeddings to provide accurate answers based on provided texts.
16
+
17
+ ## Developer
18
+ Developed by Ramon Mayor Martins (2024)
19
+ * Email: rmayormartins@gmail.com
20
+ * Homepage: https://rmayormartins.github.io/
21
+ * Twitter: @rmayormartins
22
+ * GitHub: https://github.com/rmayormartins
23
+ * Space: https://huggingface.co/rmayormartins
24
+
25
+ ## Technologies Used
26
+
27
+ * **LangChain:** Framework for developing applications powered by language models, providing tools for document loading, text splitting, and creating chains of operations.
28
+
29
+ * **Sentence Transformers:** Library for state-of-the-art text embeddings, using the multilingual-e5-large model for superior multilingual understanding.
30
+
31
+ * **Flan-T5:** Advanced language model from Google that excels at various NLP tasks, particularly strong in multilingual text generation and understanding.
32
+
33
+ * **Chroma DB:** Lightweight vector database for storing and retrieving text embeddings efficiently, enabling semantic search capabilities.
34
+
35
+ * **Gradio:** Framework for creating user-friendly web interfaces for machine learning models, providing an intuitive way to interact with the RAG system.
36
+
37
+ * **HuggingFace Transformers:** Library providing access to state-of-the-art transformer models, tokenizers, and pipelines.
38
+
39
+ * **PyTorch:** Deep learning framework that powers the underlying models and computations.
40
+
41
+ ## Key Features
42
+
43
+ * **Multilingual Support:** Process and answer questions in multiple languages (English, Spanish, Portuguese, and more)
44
+ * **Document Chunking:** Smart text splitting for handling long documents
45
+ * **Semantic Search:** Uses advanced embeddings for accurate information retrieval
46
+ * **Source Attribution:** Provides references to the relevant text passages used for answers
47
+ * **User-Friendly Interface:** Simple web interface for text input and question answering
48
+
49
+ ## How it Works
50
+
51
+ 1. **Text Processing:**
52
+ - User inputs a text document
53
+ - System splits text into manageable chunks
54
+ - Chunks are converted into embeddings using multilingual-e5-large
55
+
56
+ 2. **Knowledge Base Creation:**
57
+ - Embeddings are stored in Chroma vector database
58
+ - Document metadata is preserved for source attribution
59
+
60
+ 3. **Question Answering:**
61
+ - User asks a question in any supported language
62
+ - System retrieves relevant document chunks
63
+ - Flan-T5 generates a coherent answer based on retrieved context
64
+ - Sources are displayed for transparency
65
+
66
+ ## How to Use
67
+
68
+ 1. Open the application interface
69
+ 2. Paste your reference text in the "Base Text" field
70
+ 3. Enter your question in any supported language
71
+ 4. Receive an answer along with relevant source excerpts
72
+
73
+ ## Example Use Cases
74
+
75
+ * Document analysis and comprehension
76
+ * Educational Q&A systems
77
+ * Multilingual information retrieval
78
+ * Research assistance
79
+ * Content summarization
80
+
81
+ ## Technical Architecture
82
+
83
+ * **Embedding Model:** intfloat/multilingual-e5-large
84
+ * **Language Model:** google/flan-t5-large
85
+ * **Vector Store:** Chroma
86
+ * **Chunk Size:** 500 characters
87
+ * **Context Window:** 4 documents
88
+
89
+ ## Local Development
90
+
91
+ ```bash
92
+ pip install -r requirements.txt
93
+ python app.py
94
+ ```
95
+
96
+ ## Deployment
97
+
98
+ This application is deployed on Hugging Face Spaces. You can access it at [https://huggingface.co/spaces/rmayormartins/nlp-rag-langchain].
99
+
100
+ ## Note
101
+
102
+ The system's responses are generated solely based on the provided text. The quality of answers depends on the content and clarity of the input text.
app.py ADDED
@@ -0,0 +1,208 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from typing import List, Tuple, Dict
3
+ from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer
4
+ from sentence_transformers import SentenceTransformer
5
+ from langchain_community.vectorstores import Chroma
6
+ from langchain.chains import RetrievalQA
7
+ from langchain_community.embeddings import HuggingFaceEmbeddings
8
+ from langchain.llms import HuggingFacePipeline
9
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
10
+ from langchain.prompts import PromptTemplate
11
+ import gradio as gr
12
+ import torch
13
+
14
+ class EnhancedRAGSystem:
15
+ def __init__(self):
16
+ self.chunk_size = 500
17
+ self.chunk_overlap = 50
18
+ self.k_documents = 4
19
+
20
+ self.text_splitter = RecursiveCharacterTextSplitter(
21
+ chunk_size=self.chunk_size,
22
+ chunk_overlap=self.chunk_overlap,
23
+ length_function=len
24
+ )
25
+
26
+ self.embedding_model_name = "intfloat/multilingual-e5-large"
27
+ self.llm_model_name = "google/flan-t5-large"
28
+
29
+ self.prompt_template = PromptTemplate(
30
+ template="""Use the context below to answer the question.
31
+ If the answer is not in the context, say "I don't have enough information in the context to answer this question."
32
+
33
+ Context: {context}
34
+ Question: {question}
35
+
36
+ Detailed answer:""",
37
+ input_variables=["context", "question"]
38
+ )
39
+
40
+ self.embeddings = HuggingFaceEmbeddings(
41
+ model_name=self.embedding_model_name,
42
+ model_kwargs={'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
43
+ )
44
+
45
+ self.tokenizer = AutoTokenizer.from_pretrained(self.llm_model_name)
46
+ self.model = AutoModelForSeq2SeqLM.from_pretrained(self.llm_model_name)
47
+
48
+ self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
49
+ self.model.to(self.device)
50
+
51
+ self.pipe = pipeline(
52
+ "text2text-generation",
53
+ model=self.model,
54
+ tokenizer=self.tokenizer,
55
+ max_length=512,
56
+ device=0 if torch.cuda.is_available() else -1,
57
+ model_kwargs={"temperature": 0.7}
58
+ )
59
+
60
+ self.llm = HuggingFacePipeline(pipeline=self.pipe)
61
+
62
+ def process_documents(self, text: str) -> bool:
63
+ try:
64
+ texts = self.text_splitter.split_text(text)
65
+
66
+ self.vectorstore = Chroma.from_texts(
67
+ texts,
68
+ self.embeddings,
69
+ metadatas=[{"source": f"chunk_{i}", "text": t} for i, t in enumerate(texts)],
70
+ collection_name="enhanced_rag_docs"
71
+ )
72
+
73
+ self.retriever = self.vectorstore.as_retriever(
74
+ search_kwargs={"k": self.k_documents}
75
+ )
76
+
77
+ self.qa_chain = RetrievalQA.from_chain_type(
78
+ llm=self.llm,
79
+ chain_type="stuff",
80
+ retriever=self.retriever,
81
+ return_source_documents=True,
82
+ chain_type_kwargs={"prompt": self.prompt_template}
83
+ )
84
+ return True
85
+ except Exception as e:
86
+ print(f"Processing error: {str(e)}")
87
+ return False
88
+
89
+ def answer_question(self, question: str) -> Tuple[str, str]:
90
+ try:
91
+ response = self.qa_chain({"query": question})
92
+ answer = response["result"]
93
+
94
+ sources = []
95
+ for i, doc in enumerate(response["source_documents"], 1):
96
+ text_preview = doc.page_content[:100] + "..."
97
+ sources.append(f"Excerpt {i}: {text_preview}")
98
+
99
+ sources_text = "\n".join(sources)
100
+ return answer, sources_text
101
+ except Exception as e:
102
+ return f"Error answering: {str(e)}", ""
103
+
104
+ def create_enhanced_interface():
105
+ rag_system = EnhancedRAGSystem()
106
+
107
+ def process_and_answer(text: str, question: str) -> str:
108
+ if not text.strip() or not question.strip():
109
+ return "Please provide both text and question."
110
+
111
+ if not rag_system.process_documents(text):
112
+ return "Error processing the text."
113
+
114
+ answer, sources = rag_system.answer_question(question)
115
+
116
+ if sources:
117
+ return f"""Answer: {answer}
118
+
119
+ Relevant excerpts consulted:
120
+ {sources}"""
121
+ return answer
122
+
123
+ # HTML para o cabeçalho
124
+ custom_css = """
125
+ .custom-description {
126
+ margin-bottom: 20px;
127
+ text-align: center;
128
+ }
129
+ .custom-description a {
130
+ text-decoration: none;
131
+ color: #007bff;
132
+ margin: 0 5px;
133
+ }
134
+ .custom-description a:hover {
135
+ text-decoration: underline;
136
+ }
137
+ """
138
+
139
+ with gr.Blocks(css=custom_css) as interface:
140
+ gr.HTML("""
141
+ <div class="custom-description">
142
+ <h1>Advanced RAG with Multilingual Support</h1>
143
+ <p>Ramon Mayor Martins:
144
+ <a href="https://rmayormartins.github.io/" target="_blank">Website</a> |
145
+ <a href="https://huggingface.co/rmayormartins" target="_blank">Spaces</a> |
146
+ <a href="https://github.com/rmayormartins" target="_blank">GitHub</a>
147
+ </p>
148
+ <p>This system uses Retrieval-Augmented Generation (RAG) to answer questions about your texts in multiple languages.
149
+ Simply paste your text and ask questions in any language!</p>
150
+
151
+ </div>
152
+ """)
153
+
154
+ with gr.Row():
155
+ with gr.Column():
156
+ text_input = gr.Textbox(
157
+ label="Base Text",
158
+ placeholder="Paste here the text that will serve as knowledge base...",
159
+ lines=10
160
+ )
161
+ question_input = gr.Textbox(
162
+ label="Your Question",
163
+ placeholder="What would you like to know about the text?"
164
+ )
165
+ submit_btn = gr.Button("Submit")
166
+
167
+ with gr.Column():
168
+ output = gr.Textbox(label="Answer")
169
+
170
+ examples = [
171
+ ["The Earth is the third planet from the Sun. It has one natural satellite called the Moon. It is the only known planet to harbor life.",
172
+ "What is Earth's natural satellite?"],
173
+
174
+ ["La Tierra es el tercer planeta del Sistema Solar. Tiene un satélite natural llamado Luna. Es el único planeta conocido que alberga vida.",
175
+ "¿Cuál es el satélite natural de la Tierra?"],
176
+
177
+ ["A Terra é o terceiro planeta do Sistema Solar. Tem um satélite natural chamado Lua. É o único planeta conhecido que abriga vida.",
178
+ "Qual é o satélite natural da Terra?"],
179
+
180
+ ["The Sun is a medium-sized star at the center of our Solar System. It provides light and heat to all planets.",
181
+ "What is the Sun?"],
182
+
183
+ ["El Sol es una estrella de tamaño medio en el centro de nuestro Sistema Solar. Proporciona luz y calor a todos los planetas.",
184
+ "¿Qué es el Sol?"],
185
+
186
+ ["O Sol é uma estrela de tamanho médio no centro do nosso Sistema Solar. Ele fornece luz e calor para todos os planetas.",
187
+ "O que é o Sol?"]
188
+ ]
189
+
190
+ gr.Examples(
191
+ examples=examples,
192
+ inputs=[text_input, question_input],
193
+ outputs=output,
194
+ fn=process_and_answer,
195
+ cache_examples=True
196
+ )
197
+
198
+ submit_btn.click(
199
+ fn=process_and_answer,
200
+ inputs=[text_input, question_input],
201
+ outputs=output
202
+ )
203
+
204
+ return interface
205
+
206
+ if __name__ == "__main__":
207
+ demo = create_enhanced_interface()
208
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ langchain==0.1.0
2
+ langchain-community==0.0.10
3
+ chromadb==0.4.22
4
+ sentence-transformers==2.2.2
5
+ gradio==4.8.0
6
+ torch==2.1.2
7
+ transformers==4.36.2
8
+
9
+