Spaces:

baobuiquang
/

nlqna-chatbot

Paused

App Files Files

baobuiquang commited on Mar 25

Commit

00c9a21

•

1 Parent(s): b83df1c

initial commit

Browse files

Files changed (5) hide show

.gitignore +2 -0
README.md +181 -13
app.py +74 -55
data/sample.xlsx +0 -0
requirements.txt +0 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ venv/
2	+ .vscode/

README.md CHANGED Viewed

@@ -1,13 +1,181 @@
----
-title: Chatbot
-emoji: 🐠
-colorFrom: green
-colorTo: indigo
-sdk: gradio
-sdk_version: 4.22.0
-app_file: app.py
-pinned: false
-license: unknown
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Natural Language Q&A Chatbot
+## Problem
+Input:
+* `data` - Example: `data/sample.xlsx`
+* `question` - Example: "Tổng số hồ sơ chứng thực chữ ký vào ngày 12 tháng 1 năm 2024 là bao nhiêu?"
+Expected output:
+* `answer`: Example: "165"
+## Solution Approach
+### Preprocessing `data`:
+* Raw Data (`.XLSX`)
+* ↳ Raw Dataframe (`Pandas DF`)
+* ↳ Preprocessed Dataframe (`Pandas DF`)
+### Feature Extracting `data` and `question`:
+* Preprocessed Dataframe Data / Question (`String`)
+* ↳ Embedding (`PyTorch Tensor`)
+#### Model:
+* Stable Model: [HF/XLM-ROBERTA-ME5-BASE](https://huggingface.co/baobuiquang/XLM-ROBERTA-ME5-BASE) (License: [MIT License](https://choosealicense.com/licenses/mit/))
+  * Forked from: [HF/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) (License: [MIT License](https://choosealicense.com/licenses/mit/))
+    * Initialized from [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) (License: [MIT License](https://choosealicense.com/licenses/mit/))
+### Feature Map Down Sampling Method: [Mean Pooling](https://paperswithcode.com/method/average-pooling)
+* Reduce computationally expensive -> Fast chatbot (Speed)
+* Prevent overfitting -> Better answer (Accuracy)
+### Measurement: [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)
+* Input:
+  * Embedding `a` (`PyTorch Tensor`)
+  * Embedding `b` (`PyTorch Tensor`)
+* Output:
+  * Cosine Similarity: The cosine of the angle between the 2 non-zero vectors `a` and `b` in space.
+```
+cos_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
+```
+### Interactive UI
+Chatbot's Web UI is currently built with [gradio](https://github.com/gradio-app/gradio)  (License: [Apache-2.0 License](https://choosealicense.com/licenses/apache-2.0/)).
+## Example and Rough Explanation
+Sample data: [sample.xlsx](https://github.com/baobuiquang/nlqna-chatbot/blob/main/data/sample.xlsx)
+### Step 1. Input:
+* `question` = "Tổng số hồ sơ chứng thực chữ ký vào ngày 12 tháng 1 năm 2024 là bao nhiêu?"
+* `data` = `data/sample.xlsx`
+|                                                         |       |                |                |                |       |
+| :-----------------------------------------------------: | :---: | :------------: | :------------: | :------------: | :---: |
+|                                                         |  ...  | **11/01/2024** | **12/01/2024** | **13/01/2024** |  ...  |
+|                           ...                           |       |                |                |                |       |
+|      **Tổng số HS chứng thực hợp đồng, giao dịch**      |       |      156       |      161       |      177       |       |
+|            **Tổng số HS chứng thực chữ ký**             |       |      159       |      165       |      182       |       |
+| **Tổng số HS chứng thực việc sửa đổi, bổ sung, hủy bỏ** |       |      162       |      169       |      187       |       |
+|                           ...                           |       |                |                |                |       |
+### Step 2. Feature Extraction:
+* `question` -> `question_embedding` (`PyTorch Tensor`)
+* `data` -> `data_embeddings` (Map of `PyTorch Tensors`)
+|                     |       |                     |                     |                     |       |
+| :-----------------: | :---: | :-----------------: | :-----------------: | :-----------------: | :---: |
+|                     |  ...  | ***\<PT Tensor\>*** | ***\<PT Tensor\>*** | ***\<PT Tensor\>*** |  ...  |
+|         ...         |       |                     |                     |                     |       |
+| ***\<PT Tensor\>*** |       |         156         |         161         |         177         |       |
+| ***\<PT Tensor\>*** |       |         159         |         165         |         182         |       |
+| ***\<PT Tensor\>*** |       |         162         |         169         |         187         |       |
+|         ...         |       |                     |                     |                     |       |
+### Step 3. Measurement Calculation:
+Calculate the Cosine Similarity between `question_embedding` and `data_embeddings`.
+|                 |       |                 |                 |                 |       |
+| :-------------: | :---: | :-------------: | :-------------: | :-------------: | :---: |
+|                 |  ...  | ***{cos_sim}*** | ***{cos_sim}*** | ***{cos_sim}*** |  ...  |
+|       ...       |       |                 |                 |                 |       |
+| ***{cos_sim}*** |       |       156       |       161       |       177       |       |
+| ***{cos_sim}*** |       |       159       |       165       |       182       |       |
+| ***{cos_sim}*** |       |       162       |       169       |       187       |       |
+|       ...       |       |                 |                 |                 |       |
+### Step 4. Output:
+Find the highest Cosine Similarity in horizontal and vertical axis to determine the cell for final answer.
+|                                |       |             |                                |             |       |
+| :----------------------------: | :---: | :---------: | :----------------------------: | :---------: | :---: |
+|                                |  ...  | *{cos_sim}* | ***{highest_cos_sim_x_axis}*** | *{cos_sim}* |  ...  |
+|              ...               |       |             |                                |             |       |
+|          *{cos_sim}*           |       |     156     |              161               |     177     |       |
+| ***{highest_cos_sim_y_axis}*** |       |     159     |           ***165***            |     182     |       |
+|          *{cos_sim}*           |       |     162     |              169               |     187     |       |
+|              ...               |       |             |                                |             |       |
+Output the answer (cell value): "165"
+## Demo
+https://github.com/baobuiquang/nlqna-chatbot/assets/60503568/57621579-6a58-4638-9644-b4e482ac975e
+## Instructions (Recommended workflow)
+### Installation
+Prerequisites:
+* [Python 3](https://www.python.org/downloads/)
+* [Git](https://git-scm.com/downloads)
+Clone [this repository](https://github.com/baobuiquang/nlqna-chatbot):
+```
+git clone https://github.com/baobuiquang/nlqna-chatbot.git
+cd nlqna-chatbot
+```
+Create virtual environment:
+```
+python -m venv venv
+```
+Activate virtual environment:
+```
+venv\Scripts\activate
+```
+Upgrade `pip` command:
+```
+python.exe -m pip install --upgrade pip
+```
+Install [required packages/libraries](https://github.com/baobuiquang/nlqna-chatbot/blob/main/requirements.txt):
+```
+pip install -r requirements.txt
+```
+Deactivate virtual environment:
+```
+deactivate
+```
+### Start chatbot
+Activate virtual environment:
+```
+venv\Scripts\activate
+```
+Run chatbot app:
+```
+python app.py
+```
+Wait until the terminal print something like this:
+```
+...\nlqna-chatbot> python app.py
+Running on local URL:  http://127.0.0.1:7860
+To create a public link, set `share=True` in `launch()`.
+```
+Now chatbot can be accessed from [http://127.0.0.1:7860](http://127.0.0.1:7860).
+### Stop chatbot
+Press `Ctrl + C` in the terminal to close the chatbot server.
+Deactivate virtual environment:
+```
+deactivate
+```

app.py CHANGED Viewed

@@ -1,33 +1,40 @@
 # !wget -nc https://raw.githubusercontent.com/baobuiquang/datasets/main/sample.xlsx >& /dev/null
 # !pip install gradio==4.21.0 >& /dev/null
-import gradio as gr
 import pandas as pd
 import numpy as np
 import torch
 from transformers import AutoTokenizer, AutoModel
-from numpy.linalg import norm
 from datetime import datetime
-from numpy import dot
-pd.options.mode.chained_assignment = None  # default='warn'
-FILE_NAME = "sample.xlsx"
 df_map = pd.read_excel(FILE_NAME, header=None, sheet_name=None)
 df_map_sheet_names = pd.ExcelFile(FILE_NAME).sheet_names
 MODEL_NAME = "baobuiquang/XLM-ROBERTA-ME5-BASE"
-# MODEL_NAME = "intfloat/multilingual-e5-base"                                  # 10/10
-# MODEL_NAME = "keepitreal/vietnamese-sbert"                                    # 9/10
-# MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"                         # 9/10
-# MODEL_NAME = "BAAI/bge-m3"                                                    # 9/10
 tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
 model = AutoModel.from_pretrained(MODEL_NAME)
-#Mean Pooling - Take attention mask into account for correct averaging
-def mean_pooling(model_output, attention_mask):
-    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
-    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
-    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
 # List of Texts -> List of Embeddings
 def texts_to_embeddings(list_of_texts):
@@ -38,18 +45,16 @@ def texts_to_embeddings(list_of_texts):
     list_of_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
     return list_of_embeddings
-# Text -> Embedding
-def text_to_embedding(text):
-    lower_text = text.lower() # Lowercasing
-    encoded_input = tokenizer(lower_text, padding=True, truncation=True, return_tensors='pt')
-    with torch.no_grad():
-        model_output = model(**encoded_input)
-    embedding = mean_pooling(model_output, encoded_input['attention_mask'])
-    return embedding[0]
 # Cosine Similarity between 2 embeddings
 def cosine_similarity(a, b):
-    return dot(a, b)/(norm(a)*norm(b))
 # Find index of the max similarity when comparing an embedding to a list
 def similarity(my_embedding, list_of_embeddings):
@@ -64,7 +69,11 @@ def similarity(my_embedding, list_of_embeddings):
             max_sim_index = i
     return {"max_index": max_sim_index, "max": max_sim, "list": list_of_sim}
-# preprocessed_df_map ==========================================================
 preprocessed_df_map = []
@@ -78,23 +87,29 @@ for sheet_name in df_map_sheet_names:
     new_header = []
     for e in df.loc[header_position]:
         if isinstance(e, datetime):
-            new_header.append(e.strftime("ngày %-d tháng %-m năm %Y %-d/%-m/%Y %d/%m/%Y"))
         else:
             new_header.append(e)
     df = df.rename(columns = dict(zip(df.columns, new_header)))
     df = df.iloc[header_position+1:]
-    # Preprocess column "#" values
-    df['#'] = df['#'].replace(to_replace = r'^\d+(\.\d+)?$', value = np.nan, regex=True)
-    df['#'] = df['#'].fillna(method = 'ffill')
-    df = df.dropna(thresh = df.shape[1] * 0.25, axis = 0) # Keep rows that have at least 25% values are not NaN
-    df = df.dropna(thresh = df.shape[1] * 0.25, axis = 1) # Keep cols that have at least 25% values are not NaN
-    df = df.rename(columns={'#': 'Nhóm chỉ số'})
-    # Move column "#" to the end
-    columns = list(df.columns)
-    columns.append(columns.pop(0))
-    df = df.reindex(columns=columns)
     # General Preprocess
     df = df.reset_index(drop=True)
@@ -104,7 +119,11 @@ for sheet_name in df_map_sheet_names:
     # Return the preprocessed sheet
     preprocessed_df_map.append(df)
-# embeddings_map ===============================================================
 x_list_embeddings_map = []
 y_list_embeddings_map = []
@@ -113,6 +132,7 @@ for i in range(len(preprocessed_df_map)):
     df = preprocessed_df_map[i]
     x_list = list(df['Tên chỉ số'])
     y_list = list(df.columns)
@@ -124,13 +144,8 @@ for i in range(len(preprocessed_df_map)):
     x_list_embeddings_map.append(x_list_embeddings)
     y_list_embeddings_map.append(y_list_embeddings)
-# ==============================================================================
-# preprocessed_df_map:
-# - A list of dataframes (preprocessed), each dataframe contains data from 1 sheet from the XLSX file
-# x/y_list_embeddings_map:
-# - A list of pre-calculated embeddings (vectors) of x/y axis in the corresponding dataframe in the `preprocessed_df_map`
 def chatbot_mechanism(message, history, additional_input_1):
     # Clarify namings
@@ -153,8 +168,12 @@ def chatbot_mechanism(message, history, additional_input_1):
     if x_score < 0.85 or y_score < 0.85:
         eval_text = "\n⚠️ Low Cosine Similarity ⚠️"
     # Cell value
-    cell_value = df.loc[x_index][y_index]
-    return f"**{cell_value}**\n<div style='color: gray; font-size: 80%; font-family: courier, monospace;'>[x={str(round(x_score,2))}, y={str(round(y_score,2))}]{eval_text}</div>"
 textbox_input = gr.Textbox(
     label = "Câu hỏi",
@@ -211,11 +230,11 @@ with gr.Blocks(
                 label = 'Câu hỏi ví dụ (Dữ liệu "Tư pháp")',
                 examples_per_page = 100,
                 examples = [
-                    "Tổng số hồ sơ chứng thực bản sao từ bản chính tới ngày 10/1/2024 là bao nhiêu?", # 100
                     "15 tháng 1 năm 2024, hãy tìm dữ liệu tổng số hồ sơ chứng thực hợp đồng, giao dịch.", # 219
-                    "Tổng số hồ sơ chứng thực chữ ký vào ngày 12 tháng 1 năm 2024 là bao nhiêu?", # 165
-                    "Có bao nhiêu HS chứng thực việc sửa đổi, bổ sung, hủy bỏ ngày 14/01/2024?", # 194
-                    "Tính đến ngày 11 tháng 1, 2024, số hồ sơ đăng ký kết hôn là bao nhiêu?", # 177
                 ],
                 inputs = [textbox_input],
             )
@@ -229,10 +248,10 @@ with gr.Blocks(
                 examples_per_page = 100,
                 examples = [
                     "Số vụ phạm tội công nghệ cao ngày 19 tháng 3 năm 2024 là bao nhiêu?", # 121
-                    "Tới ngày 20/3/2024, có mấy vụ án đặc biệt nghiêm trọng?", # 208
-                    "Ngày 22 tháng 3 năm 2024, có bao nhiêu người chết do TNGT", # 273
-                    "Có bao nhiêu vụ cháy cho đến ngày 24/03/2024?", # 437
-                    "Tìm thông tin số vụ tai nạn giao thông tại ngày 18/3 năm 2024.", # 104
                 ],
                 inputs = [textbox_input],
             )
@@ -242,4 +261,4 @@ with gr.Blocks(
                 """
             )
-app.launch(debug = False)

 # !wget -nc https://raw.githubusercontent.com/baobuiquang/datasets/main/sample.xlsx >& /dev/null
 # !pip install gradio==4.21.0 >& /dev/null
+# ==============================
+# ========== PACKAGES ==========
+import gradio as gr # gradio==4.21.0
 import pandas as pd
 import numpy as np
 import torch
+import time
 from transformers import AutoTokenizer, AutoModel
 from datetime import datetime
+# pd.options.mode.chained_assignment = None  # default='warn'
+# ===========================
+# ========== FILES ==========
+FILE_NAME = "data/sample.xlsx"
 df_map = pd.read_excel(FILE_NAME, header=None, sheet_name=None)
 df_map_sheet_names = pd.ExcelFile(FILE_NAME).sheet_names
+# ============================
+# ========== MODELS ==========
 MODEL_NAME = "baobuiquang/XLM-ROBERTA-ME5-BASE"
 tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
 model = AutoModel.from_pretrained(MODEL_NAME)
+# ===============================
+# ========== FUNCTIONS ==========
+# Text -> Embedding
+def text_to_embedding(text):
+    lower_text = text.lower() # Lowercasing
+    encoded_input = tokenizer(lower_text, padding=True, truncation=True, return_tensors='pt')
+    with torch.no_grad():
+        model_output = model(**encoded_input)
+    embedding = mean_pooling(model_output, encoded_input['attention_mask'])
+    return embedding[0]
 # List of Texts -> List of Embeddings
 def texts_to_embeddings(list_of_texts):
     list_of_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
     return list_of_embeddings
+# Mean Pooling
+# - Take attention mask into account for correct averaging
+def mean_pooling(model_output, attention_mask):
+    token_embeddings = model_output[0] # First element of model_output contains all token embeddings
+    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
 # Cosine Similarity between 2 embeddings
 def cosine_similarity(a, b):
+    return np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))
 # Find index of the max similarity when comparing an embedding to a list
 def similarity(my_embedding, list_of_embeddings):
             max_sim_index = i
     return {"max_index": max_sim_index, "max": max_sim, "list": list_of_sim}
+# ===================================
+# ========== PREPROCESSING ==========
+# preprocessed_df_map ----------------------------------------------------------
+# - A list of dataframes (preprocessed), each dataframe contains data from 1 sheet from the XLSX file
 preprocessed_df_map = []
     new_header = []
     for e in df.loc[header_position]:
         if isinstance(e, datetime):
+            new_header.append(
+                f"\
+                ngày {e.strftime('%d').lstrip('0')} tháng {e.strftime('%m').lstrip('0')} năm {e.strftime('%Y')} \
+                {e.strftime('%d').lstrip('0')}/{e.strftime('%m').lstrip('0')}/{e.strftime('%Y')} \
+                {e.strftime('%d')}/{e.strftime('%m')}/{e.strftime('%Y')} \
+                "
+            )
         else:
             new_header.append(e)
     df = df.rename(columns = dict(zip(df.columns, new_header)))
     df = df.iloc[header_position+1:]
+    # # Preprocess column "#" values
+    # df['#'] = df['#'].replace(to_replace = r'^\d+(\.\d+)?$', value = np.nan, regex=True)
+    # df['#'] = df['#'].fillna(method = 'ffill')
+    # df = df.dropna(thresh = df.shape[1] * 0.25, axis = 0) # Keep rows that have at least 25% values are not NaN
+    # df = df.dropna(thresh = df.shape[1] * 0.25, axis = 1) # Keep cols that have at least 25% values are not NaN
+    # df = df.rename(columns={'#': 'Nhóm chỉ số'})
+    # # Move column "#" to the end
+    # columns = list(df.columns)
+    # columns.append(columns.pop(0))
+    # df = df.reindex(columns=columns)
     # General Preprocess
     df = df.reset_index(drop=True)
     # Return the preprocessed sheet
     preprocessed_df_map.append(df)
+# ========================================
+# ========== FEATURE EXTRACTION ==========
+# embeddings_map ---------------------------------------------------------------
+# - A list of pre-calculated embeddings (vectors) of x/y axis in the corresponding dataframe in the `preprocessed_df_map`
 x_list_embeddings_map = []
 y_list_embeddings_map = []
     df = preprocessed_df_map[i]
+    # HARDCODE
     x_list = list(df['Tên chỉ số'])
     y_list = list(df.columns)
     x_list_embeddings_map.append(x_list_embeddings)
     y_list_embeddings_map.append(y_list_embeddings)
+# ==========================
+# ========== MAIN ==========
 def chatbot_mechanism(message, history, additional_input_1):
     # Clarify namings
     if x_score < 0.85 or y_score < 0.85:
         eval_text = "\n⚠️ Low Cosine Similarity ⚠️"
     # Cell value
+    cell_value = df.iloc[x_index, y_index]
+    final_output_message = f"**{cell_value}**\n<div style='color: gray; font-size: 80%; font-family: courier, monospace;'>[x={str(round(x_score,2))}, y={str(round(y_score,2))}]{eval_text}</div>"
+    return final_output_message
+    # for i in range(len(final_output_message)):
+    #     time.sleep(0.1)
+    #     yield final_output_message[: i+1]
 textbox_input = gr.Textbox(
     label = "Câu hỏi",
                 label = 'Câu hỏi ví dụ (Dữ liệu "Tư pháp")',
                 examples_per_page = 100,
                 examples = [
+                    "Tổng số hồ sơ chứng thực bản sao từ bản chính tới ngày 10/1/2024 là bao nhiêu?",     # 100
                     "15 tháng 1 năm 2024, hãy tìm dữ liệu tổng số hồ sơ chứng thực hợp đồng, giao dịch.", # 219
+                    "Tổng số hồ sơ chứng thực chữ ký vào ngày 12 tháng 1 năm 2024 là bao nhiêu?",         # 165
+                    "Có bao nhiêu HS chứng thực việc sửa đổi, bổ sung, hủy bỏ ngày 14/01/2024?",          # 194
+                    "Tính đến ngày 11 tháng 1, 2024, số hồ sơ đăng ký kết hôn là bao nhiêu?",             # 177
                 ],
                 inputs = [textbox_input],
             )
                 examples_per_page = 100,
                 examples = [
                     "Số vụ phạm tội công nghệ cao ngày 19 tháng 3 năm 2024 là bao nhiêu?", # 121
+                    "Tới ngày 20/3/2024, có mấy vụ án đặc biệt nghiêm trọng?",             # 208
+                    "Ngày 22 tháng 3 năm 2024, có bao nhiêu người chết do TNGT",           # 273
+                    "Có bao nhiêu vụ cháy cho đến ngày 24/03/2024?",                       # 437
+                    "Tìm thông tin số vụ tai nạn giao thông tại ngày 18/3 năm 2024.",      # 104
                 ],
                 inputs = [textbox_input],
             )
                 """
             )
+app.launch(debug = False, share = False)

data/sample.xlsx ADDED Viewed

Binary file (56.6 kB). View file

requirements.txt CHANGED Viewed

Binary files a/requirements.txt and b/requirements.txt differ