baobuiquang commited on
Commit
73c471f
1 Parent(s): f249c8e

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +193 -0
README.md ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Chatbot
3
+ emoji: 💬
4
+ colorFrom: gray
5
+ colorTo: gray
6
+ sdk: gradio
7
+ sdk_version: 4.21.0
8
+ python_version: 3.12.0
9
+ app_file: app.py
10
+ pinned: false
11
+ ---
12
+
13
+ # Natural Language Q&A Chatbot
14
+
15
+ ## Problem
16
+
17
+ Input:
18
+ * `data` - Example: `data/sample.xlsx`
19
+ * `question` - Example: "Tổng số hồ sơ chứng thực chữ ký vào ngày 12 tháng 1 năm 2024 là bao nhiêu?"
20
+
21
+ Expected output:
22
+ * `answer`: Example: "165"
23
+
24
+ ## Solution Approach
25
+
26
+ ### Preprocessing `data`:
27
+
28
+ * Raw Data (`.XLSX`)
29
+ * ↳ Raw Dataframe (`Pandas DF`)
30
+ * ↳ Preprocessed Dataframe (`Pandas DF`)
31
+
32
+
33
+ ### Feature Extracting `data` and `question`:
34
+
35
+ * Preprocessed Dataframe Data / Question (`String`)
36
+ * ↳ Embedding (`PyTorch Tensor`)
37
+
38
+ #### Model:
39
+ * Stable Model: [HF/XLM-ROBERTA-ME5-BASE](https://huggingface.co/baobuiquang/XLM-ROBERTA-ME5-BASE) (License: [MIT License](https://choosealicense.com/licenses/mit/))
40
+ * Forked from: [HF/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) (License: [MIT License](https://choosealicense.com/licenses/mit/))
41
+ * Initialized from [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) (License: [MIT License](https://choosealicense.com/licenses/mit/))
42
+
43
+
44
+ ### Feature Map Down Sampling Method: [Mean Pooling](https://paperswithcode.com/method/average-pooling)
45
+
46
+ * Reduce computationally expensive -> Fast chatbot (Speed)
47
+ * Prevent overfitting -> Better answer (Accuracy)
48
+
49
+ ### Measurement: [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)
50
+ * Input:
51
+ * Embedding `a` (`PyTorch Tensor`)
52
+ * Embedding `b` (`PyTorch Tensor`)
53
+ * Output:
54
+ * Cosine Similarity: The cosine of the angle between the 2 non-zero vectors `a` and `b` in space.
55
+ ```
56
+ cos_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
57
+ ```
58
+
59
+ ### Interactive UI
60
+
61
+ Chatbot's Web UI is currently built with [gradio](https://github.com/gradio-app/gradio) (License: [Apache-2.0 License](https://choosealicense.com/licenses/apache-2.0/)).
62
+
63
+ ## Example and Rough Explanation
64
+
65
+ Sample data: [sample.xlsx](https://github.com/baobuiquang/nlqna-chatbot/blob/main/data/sample.xlsx)
66
+
67
+ ### Step 1. Input:
68
+ * `question` = "Tổng số hồ sơ chứng thực chữ ký vào ngày 12 tháng 1 năm 2024 là bao nhiêu?"
69
+ * `data` = `data/sample.xlsx`
70
+
71
+ | | | | | | |
72
+ | :-----------------------------------------------------: | :---: | :------------: | :------------: | :------------: | :---: |
73
+ | | ... | **11/01/2024** | **12/01/2024** | **13/01/2024** | ... |
74
+ | ... | | | | | |
75
+ | **Tổng số HS chứng thực hợp đồng, giao dịch** | | 156 | 161 | 177 | |
76
+ | **Tổng số HS chứng thực chữ ký** | | 159 | 165 | 182 | |
77
+ | **Tổng số HS chứng thực việc sửa đổi, bổ sung, hủy bỏ** | | 162 | 169 | 187 | |
78
+ | ... | | | | | |
79
+
80
+ ### Step 2. Feature Extraction:
81
+
82
+ * `question` -> `question_embedding` (`PyTorch Tensor`)
83
+ * `data` -> `data_embeddings` (Map of `PyTorch Tensors`)
84
+
85
+ | | | | | | |
86
+ | :-----------------: | :---: | :-----------------: | :-----------------: | :-----------------: | :---: |
87
+ | | ... | ***\<PT Tensor\>*** | ***\<PT Tensor\>*** | ***\<PT Tensor\>*** | ... |
88
+ | ... | | | | | |
89
+ | ***\<PT Tensor\>*** | | 156 | 161 | 177 | |
90
+ | ***\<PT Tensor\>*** | | 159 | 165 | 182 | |
91
+ | ***\<PT Tensor\>*** | | 162 | 169 | 187 | |
92
+ | ... | | | | | |
93
+
94
+ ### Step 3. Measurement Calculation:
95
+
96
+ Calculate the Cosine Similarity between `question_embedding` and `data_embeddings`.
97
+
98
+ | | | | | | |
99
+ | :-------------: | :---: | :-------------: | :-------------: | :-------------: | :---: |
100
+ | | ... | ***{cos_sim}*** | ***{cos_sim}*** | ***{cos_sim}*** | ... |
101
+ | ... | | | | | |
102
+ | ***{cos_sim}*** | | 156 | 161 | 177 | |
103
+ | ***{cos_sim}*** | | 159 | 165 | 182 | |
104
+ | ***{cos_sim}*** | | 162 | 169 | 187 | |
105
+ | ... | | | | | |
106
+
107
+ ### Step 4. Output:
108
+
109
+ Find the highest Cosine Similarity in horizontal and vertical axis to determine the cell for final answer.
110
+
111
+ | | | | | | |
112
+ | :----------------------------: | :---: | :---------: | :----------------------------: | :---------: | :---: |
113
+ | | ... | *{cos_sim}* | ***{highest_cos_sim_x_axis}*** | *{cos_sim}* | ... |
114
+ | ... | | | | | |
115
+ | *{cos_sim}* | | 156 | 161 | 177 | |
116
+ | ***{highest_cos_sim_y_axis}*** | | 159 | ***165*** | 182 | |
117
+ | *{cos_sim}* | | 162 | 169 | 187 | |
118
+ | ... | | | | | |
119
+
120
+ Output the answer (cell value): "165"
121
+
122
+ ## Demo
123
+
124
+ https://github.com/baobuiquang/nlqna-chatbot/assets/60503568/57621579-6a58-4638-9644-b4e482ac975e
125
+
126
+ ## Instructions (Recommended workflow)
127
+
128
+ ### Installation
129
+
130
+ Prerequisites:
131
+ * [Python 3](https://www.python.org/downloads/)
132
+ * [Git](https://git-scm.com/downloads)
133
+
134
+ Clone [this repository](https://github.com/baobuiquang/nlqna-chatbot):
135
+ ```
136
+ git clone https://github.com/baobuiquang/nlqna-chatbot.git
137
+ cd nlqna-chatbot
138
+ ```
139
+
140
+ Create virtual environment:
141
+ ```
142
+ python -m venv venv
143
+ ```
144
+
145
+ Activate virtual environment:
146
+ ```
147
+ venv\Scripts\activate
148
+ ```
149
+
150
+ Upgrade `pip` command:
151
+ ```
152
+ python.exe -m pip install --upgrade pip
153
+ ```
154
+
155
+ Install [required packages/libraries](https://github.com/baobuiquang/nlqna-chatbot/blob/main/requirements.txt):
156
+ ```
157
+ pip install -r requirements.txt
158
+ ```
159
+
160
+ Deactivate virtual environment:
161
+ ```
162
+ deactivate
163
+ ```
164
+
165
+ ### Start chatbot
166
+
167
+ Activate virtual environment:
168
+ ```
169
+ venv\Scripts\activate
170
+ ```
171
+
172
+ Run chatbot app:
173
+ ```
174
+ python app.py
175
+ ```
176
+
177
+ Wait until the terminal print something like this:
178
+ ```
179
+ ...\nlqna-chatbot> python app.py
180
+ Running on local URL: http://127.0.0.1:7860
181
+ To create a public link, set `share=True` in `launch()`.
182
+ ```
183
+
184
+ Now chatbot can be accessed from [http://127.0.0.1:7860](http://127.0.0.1:7860).
185
+
186
+ ### Stop chatbot
187
+
188
+ Press `Ctrl + C` in the terminal to close the chatbot server.
189
+
190
+ Deactivate virtual environment:
191
+ ```
192
+ deactivate
193
+ ```