Boardpac/theekshanas commited on
Commit
39de480
1 Parent(s): 467720e

upload files again

Browse files
.env ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #embeddings
2
+ EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2
3
+ EMBEDDING_CHUNK_SIZE=1000
4
+ EMBEDDING_CHUNK_OVERLAP=150
5
+
6
+ #gpt4all
7
+ GPT4ALL_MODEL_PATH=models/ggml-gpt4all-j-v1.3-groovy.bin
8
+ MODEL_N_CTX=1000
9
+ MODEL_N_BATCH=8
10
+ TARGET_SOURCE_CHUNKS=4
11
+
12
+ #API token keys
13
+ HUGGINGFACEHUB_API_TOKEN=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
14
+ OPENAI_API_KEY=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
15
+
16
+ #api app
17
+ APP_HOST=127.0.0.1
18
+ APP_PORT=8000
19
+
20
+ #model verbose
21
+ VERBOSE = True
22
+
23
+ ENABLE_HUGGINGFSCE_HUB_MODELS =True
24
+ ENABLE_OPENAI_API_MODELS =True
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ models/
2
+ *.ipynb
3
+
4
+ CBSL
5
+ faiss_index/
LICENSE ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright [yyyy] [name of copyright owner]
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
README.md ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Boardpac Chat App Test
3
+ emoji: 😻
4
+ colorFrom: gray
5
+ colorTo: purple
6
+ sdk: streamlit
7
+ sdk_version: 1.26.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: apache-2.0
11
+ ---
12
+
13
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
14
+
15
+ # privateGPT
16
+ Ask questions to your documents without an internet connection, using the power of LLMs. 100% private, no data leaves your execution environment at any point. You can ingest documents and ask questions without an internet connection!
17
+
18
+ Built with [LangChain](https://github.com/hwchase17/langchain), [GPT4All](https://github.com/nomic-ai/gpt4all), [LlamaCpp](https://github.com/ggerganov/llama.cpp), [Chroma](https://www.trychroma.com/) and [SentenceTransformers](https://www.sbert.net/).
19
+
20
+ <img width="902" alt="demo" src="https://user-images.githubusercontent.com/721666/236942256-985801c9-25b9-48ef-80be-3acbb4575164.png">
21
+
22
+ ### how to run
23
+ python -m streamlit run app.py
24
+
25
+ # Environment Setup
26
+ In order to set your environment up to run the code here, first install all requirements:
27
+
28
+ ```shell
29
+ pip3 install -r requirements.txt
30
+ ```
31
+
32
+ Then, download the LLM model and place it in a directory of your choice:
33
+ - LLM: default to [ggml-gpt4all-j-v1.3-groovy.bin](https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin). If you prefer a different GPT4All-J compatible model, just download it and reference it in your `.env` file.
34
+
35
+ Copy the `example.env` template into `.env`
36
+ ```shell
37
+ cp example.env .env
38
+ ```
39
+
40
+ and edit the variables appropriately in the `.env` file.
41
+ ```
42
+ MODEL_TYPE: supports LlamaCpp or GPT4All
43
+ PERSIST_DIRECTORY: is the folder you want your vectorstore in
44
+ MODEL_PATH: Path to your GPT4All or LlamaCpp supported LLM
45
+ MODEL_N_CTX: Maximum token limit for the LLM model
46
+ MODEL_N_BATCH: Number of tokens in the prompt that are fed into the model at a time. Optimal value differs a lot depending on the model (8 works well for GPT4All, and 1024 is better for LlamaCpp)
47
+ EMBEDDINGS_MODEL_NAME: SentenceTransformers embeddings model name (see https://www.sbert.net/docs/pretrained_models.html)
48
+ TARGET_SOURCE_CHUNKS: The amount of chunks (sources) that will be used to answer a question
49
+ ```
50
+
51
+ Note: because of the way `langchain` loads the `SentenceTransformers` embeddings, the first time you run the script it will require internet connection to download the embeddings model itself.
52
+
53
+ ## Test dataset
54
+ This repo uses a [state of the union transcript](https://github.com/imartinez/privateGPT/blob/main/source_documents/state_of_the_union.txt) as an example.
55
+
56
+ ## Instructions for ingesting your own dataset
57
+
58
+ Put any and all your files into the `source_documents` directory
59
+
60
+ The supported extensions are:
61
+
62
+ - `.csv`: CSV,
63
+ - `.docx`: Word Document,
64
+ - `.doc`: Word Document,
65
+ - `.enex`: EverNote,
66
+ - `.eml`: Email,
67
+ - `.epub`: EPub,
68
+ - `.html`: HTML File,
69
+ - `.md`: Markdown,
70
+ - `.msg`: Outlook Message,
71
+ - `.odt`: Open Document Text,
72
+ - `.pdf`: Portable Document Format (PDF),
73
+ - `.pptx` : PowerPoint Document,
74
+ - `.ppt` : PowerPoint Document,
75
+ - `.txt`: Text file (UTF-8),
76
+
77
+ Run the following command to ingest all the data.
78
+
79
+ ```shell
80
+ python ingest.py
81
+ ```
82
+
83
+ Output should look like this:
84
+
85
+ ```shell
86
+ Creating new vectorstore
87
+ Loading documents from source_documents
88
+ Loading new documents: 100%|██████████████████████| 1/1 [00:01<00:00, 1.73s/it]
89
+ Loaded 1 new documents from source_documents
90
+ Split into 90 chunks of text (max. 500 tokens each)
91
+ Creating embeddings. May take some minutes...
92
+ Using embedded DuckDB with persistence: data will be stored in: db
93
+ Ingestion complete! You can now run privateGPT.py to query your documents
94
+ ```
95
+
96
+ It will create a `db` folder containing the local vectorstore. Will take 20-30 seconds per document, depending on the size of the document.
97
+ You can ingest as many documents as you want, and all will be accumulated in the local embeddings database.
98
+ If you want to start from an empty database, delete the `db` folder.
99
+
100
+ Note: during the ingest process no data leaves your local environment. You could ingest without an internet connection, except for the first time you run the ingest script, when the embeddings model is downloaded.
101
+
102
+ ## Ask questions to your documents, locally!
103
+ In order to ask a question, run a command like:
104
+
105
+ ```shell
106
+ python privateGPT.py
107
+ ```
108
+
109
+ And wait for the script to require your input.
110
+
111
+ ```plaintext
112
+ > Enter a query:
113
+ ```
114
+
115
+ Hit enter. You'll need to wait 20-30 seconds (depending on your machine) while the LLM model consumes the prompt and prepares the answer. Once done, it will print the answer and the 4 sources it used as context from your documents; you can then ask another question without re-running the script, just wait for the prompt again.
116
+
117
+ Note: you could turn off your internet connection, and the script inference would still work. No data gets out of your local environment.
118
+
119
+ Type `exit` to finish the script.
120
+
121
+
122
+ ### CLI
123
+ The script also supports optional command-line arguments to modify its behavior. You can see a full list of these arguments by running the command ```python privateGPT.py --help``` in your terminal.
124
+
125
+
126
+ # How does it work?
127
+ Selecting the right local models and the power of `LangChain` you can run the entire pipeline locally, without any data leaving your environment, and with reasonable performance.
128
+
129
+ - `ingest.py` uses `LangChain` tools to parse the document and create embeddings locally using `HuggingFaceEmbeddings` (`SentenceTransformers`). It then stores the result in a local vector database using `Chroma` vector store.
130
+ - `privateGPT.py` uses a local LLM based on `GPT4All-J` or `LlamaCpp` to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the docs.
131
+ - `GPT4All-J` wrapper was introduced in LangChain 0.0.162.
132
+
133
+ # System Requirements
134
+
135
+ ## Python Version
136
+ To use this software, you must have Python 3.10 or later installed. Earlier versions of Python will not compile.
137
+
138
+ ## C++ Compiler
139
+ If you encounter an error while building a wheel during the `pip install` process, you may need to install a C++ compiler on your computer.
140
+
141
+ ### For Windows 10/11
142
+ To install a C++ compiler on Windows 10/11, follow these steps:
143
+
144
+ 1. Install Visual Studio 2022.
145
+ 2. Make sure the following components are selected:
146
+ * Universal Windows Platform development
147
+ * C++ CMake tools for Windows
148
+ 3. Download the MinGW installer from the [MinGW website](https://sourceforge.net/projects/mingw/).
149
+ 4. Run the installer and select the `gcc` component.
150
+
151
+ ## Mac Running Intel
152
+ When running a Mac with Intel hardware (not M1), you may run into _clang: error: the clang compiler does not support '-march=native'_ during pip install.
153
+
154
+ If so set your archflags during pip install. eg: _ARCHFLAGS="-arch x86_64" pip3 install -r requirements.txt_
155
+
156
+ # Disclaimer
157
+ This is a test project to validate the feasibility of a fully private solution for question answering using LLMs and Vector embeddings. It is not production ready, and it is not meant to be used in production. The models selection is not optimized for performance, but for privacy; but it is possible to use different models and vectorstores to improve performance.
__pycache__/chroma.cpython-311.pyc ADDED
Binary file (5.25 kB). View file
 
__pycache__/chromaDb.cpython-311.pyc ADDED
Binary file (5.25 kB). View file
 
__pycache__/config.cpython-311.pyc ADDED
Binary file (436 Bytes). View file
 
__pycache__/faissDb.cpython-311.pyc ADDED
Binary file (1.94 kB). View file
 
__pycache__/qaPipeline.cpython-311.pyc ADDED
Binary file (5.06 kB). View file
 
app.py ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Python Backend API to chat with private data
3
+
4
+ 08/16/2023
5
+ D.M. Theekshana Samaradiwakara
6
+ """
7
+
8
+ import os
9
+ import streamlit as st
10
+ from streamlit.logger import get_logger
11
+
12
+ logger = get_logger(__name__)
13
+
14
+ from ui.htmlTemplates import css, bot_template, user_template, source_template
15
+ from config import MODELS, DATASETS
16
+
17
+ from qaPipeline import QAPipeline
18
+ from faissDb import create_faiss
19
+
20
+ # loads environment variables
21
+ from dotenv import load_dotenv
22
+ load_dotenv()
23
+
24
+ isHuggingFaceHubEnabled = os.environ.get('ENABLE_HUGGINGFSCE_HUB_MODELS')
25
+ isOpenAiApiEnabled = os.environ.get('ENABLE_OPENAI_API_MODELS')
26
+
27
+ qaPipeline = QAPipeline()
28
+
29
+ def initialize_session_state():
30
+ # Initialise all session state variables with defaults
31
+ SESSION_DEFAULTS = {
32
+ "model": MODELS["DEFAULT"],
33
+ "dataset": DATASETS["DEFAULT"],
34
+ "chat_history": None,
35
+ "is_parameters_changed":False,
36
+ "show_source_files": False
37
+ }
38
+
39
+ for k, v in SESSION_DEFAULTS.items():
40
+ if k not in st.session_state:
41
+ st.session_state[k] = v
42
+
43
+
44
+ def main():
45
+
46
+ st.set_page_config(page_title="Chat with data",
47
+ page_icon=":books:")
48
+ st.write(css, unsafe_allow_html=True)
49
+
50
+ initialize_session_state()
51
+
52
+
53
+
54
+ st.header("Chat with your own data:")
55
+
56
+ user_question = st.text_input(
57
+ "Ask a question about your documents:",
58
+ placeholder="enter question",
59
+ )
60
+ # Interactive questions and answers
61
+ if user_question:
62
+ with st.spinner("Processing"):
63
+ handle_userinput(user_question)
64
+
65
+
66
+ with st.sidebar:
67
+ st.subheader("Chat parameters")
68
+
69
+ chat_model = st.selectbox(
70
+ "Chat model",
71
+ MODELS,
72
+ key="chat_model",
73
+ help="Select the LLM model for the chat",
74
+ on_change=update_parameters_change,
75
+ )
76
+
77
+ # data_source = st.selectbox(
78
+ # "dataset",
79
+ # DATASETS,
80
+ # key="data_source",
81
+ # help="Select the private data_source for the chat",
82
+ # on_change=update_parameters_change,
83
+ # )
84
+
85
+ st.session_state.dataset = "DEFAULT"
86
+
87
+ show_source = st.checkbox(
88
+ label="show source files",
89
+ key="show_source",
90
+ help="Select this to show relavant source files for the query",
91
+ on_change=update_parameters_change,
92
+ )
93
+
94
+ if st.session_state.is_parameters_changed:
95
+ if st.button("Update"):
96
+ st.session_state.model = chat_model
97
+ st.session_state.dataset = "DEFAULT"
98
+ st.session_state.show_source_files = show_source
99
+ st.success("done")
100
+ st.session_state.is_parameters_changed = False
101
+ return
102
+
103
+ st.markdown("\n")
104
+
105
+ if st.button("Create FAISS db"):
106
+ with st.spinner('creating faiss vector store'):
107
+ create_faiss()
108
+ st.success('faiss saved')
109
+
110
+
111
+ st.markdown(
112
+ "### How to use\n"
113
+ "1. Select the chat model\n" # noqa: E501
114
+ "2. Select \"show source files\" to show the source files related to the answer.📄\n"
115
+ "3. Ask a question about the documents💬\n"
116
+ )
117
+
118
+
119
+
120
+
121
+ def update_parameters_change():
122
+ st.session_state.is_parameters_changed = True
123
+
124
+ def get_answer_from_backend(query, model, dataset):
125
+
126
+ response = qaPipeline.run(query=query, model=model, dataset=dataset)
127
+ return response
128
+
129
+ def show_query_response(query, response, show_source_files):
130
+
131
+ answer, docs = response['result'], response['source_documents']
132
+
133
+ st.write(user_template.replace(
134
+ "{{MSG}}", query), unsafe_allow_html=True)
135
+ st.write(bot_template.replace(
136
+ "{{MSG}}", answer ), unsafe_allow_html=True)
137
+
138
+ if show_source_files:
139
+ # st.write(source_template.replace(
140
+ # "{{MSG}}", "source files" ), unsafe_allow_html=True)
141
+ st.markdown("#### source files : ")
142
+ for source in docs:
143
+ # st.info(source.metadata)
144
+ with st.expander(source.metadata["source"]):
145
+ st.markdown(source.page_content)
146
+
147
+ # st.write(response)
148
+
149
+ def is_query_valid(query: str) -> bool:
150
+ if (not query) or (query.strip() == ''):
151
+ st.error("Please enter a question!")
152
+ return False
153
+ return True
154
+
155
+ def handle_userinput(query):
156
+ # Get the answer from the chain
157
+ try:
158
+ if not is_query_valid(query):
159
+ st.stop()
160
+
161
+ model = MODELS[st.session_state.model]
162
+ dataset = DATASETS[st.session_state.dataset]
163
+ show_source_files = st.session_state.show_source_files
164
+
165
+ # Try to access openai and deeplake
166
+ print(f">\n model: {model} \n dataset : {dataset} \n show_source_files : {show_source_files}")
167
+
168
+ response = get_answer_from_backend(query, model, dataset)
169
+
170
+ show_query_response(query, response, show_source_files)
171
+
172
+
173
+ except Exception as e:
174
+ # logger.error(f"Answer retrieval failed with {e}")
175
+ st.error(f"Error : {e}")#, icon=":books:")
176
+ return
177
+
178
+ if __name__ == "__main__":
179
+ main()
chromaDb.py ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Python Backend API to chat with private data
3
+
4
+ 08/14/2023
5
+ D.M. Theekshana Samaradiwakara
6
+ """
7
+
8
+ import os
9
+ from dotenv import load_dotenv
10
+ import glob
11
+
12
+ import torch
13
+ import pickle
14
+ import io
15
+
16
+ from langchain.vectorstores import Chroma
17
+ from langchain.vectorstores import FAISS
18
+
19
+ from langchain.embeddings import HuggingFaceEmbeddings
20
+
21
+ from chromadb.config import Settings
22
+
23
+ load_dotenv()
24
+
25
+ import streamlit as st
26
+ embeddings_model_name = os.environ.get("EMBEDDINGS_MODEL_NAME")
27
+ embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)
28
+
29
+ def does_chroma_vectorstore_exist(persist_directory: str) -> bool:
30
+ # Checks if vectorstore exists
31
+ if os.path.exists(os.path.join(persist_directory, 'index')):
32
+ if os.path.exists(os.path.join(persist_directory, 'chroma-collections.parquet')) and os.path.exists(os.path.join(persist_directory, 'chroma-embeddings.parquet')):
33
+ list_index_files = glob.glob(os.path.join(persist_directory, 'index/*.bin'))
34
+ list_index_files += glob.glob(os.path.join(persist_directory, 'index/*.pkl'))
35
+ # At least 3 documents are needed in a working vectorstore
36
+ if len(list_index_files) > 3:
37
+ return True
38
+ return False
39
+
40
+ def load_store(directory: str) -> Chroma:
41
+ index_path = "data/{0}".format(directory)
42
+ # index_exists = os.path.exists(index_path)
43
+ index_exists = does_chroma_vectorstore_exist(index_path)
44
+
45
+ if index_exists:
46
+ try:
47
+
48
+ CHROMA_SETTINGS = Settings(
49
+ chroma_db_impl='duckdb+parquet',
50
+ persist_directory=index_path,
51
+ anonymized_telemetry=False
52
+ )
53
+
54
+ # return Chroma.load(index_path)
55
+ vectorstore= Chroma(
56
+ persist_directory=index_path,
57
+ embedding_function=embeddings,
58
+ client_settings=CHROMA_SETTINGS
59
+ )
60
+
61
+ # with open("vectorstore.pkl", "wb") as f:
62
+ # pickle.dump(vectorstore, f)
63
+
64
+ return vectorstore
65
+ except Exception as e:
66
+ raise Exception(f"Error loading vector store: {e} ")
67
+
68
+ else:
69
+ # raise exception if model_type is not supported
70
+ raise Exception(f"A vector store in directory {directory} is not created. Please choose a valid one")
71
+
72
+ class CPU_Unpickler(pickle.Unpickler):
73
+ def find_class(self, module, name):
74
+ if module == 'torch.storage' and name == '_load_from_bytes':
75
+ return lambda b: torch.load(io.BytesIO(b), map_location='cpu')
76
+ else:
77
+ return super().find_class(module, name)
78
+
79
+ def create_db(document_splits,persist_directory):
80
+ return Chroma.from_documents(
81
+ documents=document_splits,
82
+ embedding=embeddings,
83
+ persist_directory=persist_directory
84
+ )
85
+
86
+ def save_files(persist_directory, document_splits):
87
+ print(f"Saving document splits...")
88
+ if does_chroma_vectorstore_exist(persist_directory):
89
+ print(f"Updating esisting vector store. May take some minutes...")
90
+ #update function
91
+ db = Chroma(
92
+ persist_directory=index_path,
93
+ embedding_function=embeddings,
94
+ )
95
+ db.aadd_documents(document_splits)
96
+
97
+ else:
98
+ print(f"Creating new vector store. May take some minutes...")
99
+ index_path = "data/{0}".format(persist_directory)
100
+ db = create_db(document_splits,index_path)
101
+ db.persist()
102
+
config.py ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MODELS={
2
+ "DEFAULT":"tiiuae/falcon-7b-instruct",
3
+ "gpt4all":"gpt4all",
4
+ "flan-t5-xxl":"google/flan-t5-xxl",
5
+ "falcon-7b-instruct":"tiiuae/falcon-7b-instruct",
6
+ "openai gpt-3.5":"openai",
7
+
8
+ }
9
+
10
+ DATASETS={
11
+ "DEFAULT":"chroma_txt",
12
+ "a":"A",
13
+ "b":"B",
14
+ "c":"C"
15
+
16
+ }
dataPipeline.py ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Python Backend API to chat with private data
3
+
4
+ 08/15/2023
5
+ D.M. Theekshana Samaradiwakara
6
+ """
7
+
8
+ import os
9
+ import time
10
+ import glob
11
+ from multiprocessing import Pool
12
+ from tqdm import tqdm
13
+ from dotenv import load_dotenv
14
+
15
+ from chromaDb import save_files
16
+
17
+ from langchain.document_loaders import (
18
+ CSVLoader,
19
+ EverNoteLoader,
20
+ PyMuPDFLoader,
21
+ TextLoader,
22
+ UnstructuredEmailLoader,
23
+ UnstructuredEPubLoader,
24
+ UnstructuredHTMLLoader,
25
+ UnstructuredMarkdownLoader,
26
+ UnstructuredODTLoader,
27
+ UnstructuredPowerPointLoader,
28
+ UnstructuredWordDocumentLoader,
29
+ )
30
+
31
+ from langchain.document_loaders import DirectoryLoader
32
+ text_loader_kwargs={'autodetect_encoding': True}
33
+
34
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
35
+ from langchain.embeddings import HuggingFaceEmbeddings
36
+ from langchain.docstore.document import Document
37
+
38
+ from chroma import load_store
39
+
40
+ load_dotenv()
41
+
42
+ chunk_size = os.environ.get('EMBEDDING_CHUNK_SIZE')
43
+ chunk_overlap = os.environ.get('EMBEDDING_CHUNK_OVERLAP')
44
+ embeddings_model_name = os.environ.get("EMBEDDINGS_MODEL_NAME")
45
+
46
+ # Map file extensions to document loaders and their arguments
47
+ LOADER_MAPPING = {
48
+ ".csv": (CSVLoader, {}),
49
+ # ".docx": (Docx2txtLoader, {}),
50
+ ".doc": (UnstructuredWordDocumentLoader, {}),
51
+ ".docx": (UnstructuredWordDocumentLoader, {}),
52
+ ".enex": (EverNoteLoader, {}),
53
+ ".eml": (UnstructuredEmailLoader, {}),
54
+ ".epub": (UnstructuredEPubLoader, {}),
55
+ ".html": (UnstructuredHTMLLoader, {}),
56
+ ".md": (UnstructuredMarkdownLoader, {}),
57
+ ".odt": (UnstructuredODTLoader, {}),
58
+ ".pdf": (PyMuPDFLoader, {}),
59
+ ".ppt": (UnstructuredPowerPointLoader, {}),
60
+ ".pptx": (UnstructuredPowerPointLoader, {}),
61
+ ".txt": (TextLoader, {"encoding": "utf8"}),
62
+ # Add more mappings for other file extensions and loaders as needed
63
+ }
64
+
65
+ class DataPipeline:
66
+
67
+ def __init__(self):
68
+
69
+ self.dataset_name = None
70
+ self.vectorstore = None
71
+
72
+
73
+ def load_documents_in_folder(self, folder):
74
+ print(f"loading documents...")
75
+ loader = DirectoryLoader(folder, glob="**/[!.]*", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
76
+ pages = loader.load()
77
+ return pages
78
+
79
+
80
+ def load_single_document(self, doc):
81
+ ext = "." + doc.name.rsplit(".", 1)[-1]
82
+ if ext in LOADER_MAPPING:
83
+ loader_class, loader_args = LOADER_MAPPING[ext]
84
+ loader = loader_class(doc, **loader_args)
85
+ return loader.load()
86
+
87
+ raise ValueError(f"Unsupported file extension '{ext}'")
88
+
89
+
90
+ def load_documents(self, uploaded_files):
91
+ with Pool(processes=os.cpu_count()) as pool:
92
+ results = []
93
+ with tqdm(total=len(uploaded_files), desc='Loading new documents', ncols=80) as pbar:
94
+ for i, docs in enumerate(pool.imap_unordered(self.load_single_document, uploaded_files)):
95
+ results.extend(docs)
96
+ pbar.update()
97
+
98
+ return results
99
+
100
+
101
+ def load_streamlit_documents(self, uploaded_files, year):
102
+ documents = []
103
+ for uploaded_file in uploaded_files:
104
+ print(print("\n\n uploaded_file \n\n",uploaded_file,"\n"))
105
+ source = uploaded_file.name
106
+ print(print("\n\n source \n\n",source,"\n"))
107
+ content = uploaded_file.read().decode('latin-1')
108
+ print(print("\n\n content \n\n",content[:10],"\n"))
109
+
110
+
111
+ doc = Document(
112
+ page_content=content,
113
+ metadata={
114
+ "source": source,
115
+ 'year': year
116
+ }
117
+ )
118
+ print(print("doc"))
119
+ print(print("\n doc \n\n",doc,"\n\n\n\n"))
120
+
121
+ documents.append(doc)
122
+
123
+ return documents
124
+
125
+
126
+ def process_documents(self, documents):
127
+ print(f"Creating embeddings. May take some minutes...")
128
+ text_splitter = RecursiveCharacterTextSplitter(
129
+ chunk_size=chunk_size,
130
+ chunk_overlap=chunk_overlap,
131
+ separators=["\n\n", "\n", "(?<=\. )", " ", ""]
132
+ )
133
+ texts = text_splitter.split_documents(documents)
134
+ return texts
135
+
136
+
137
+ def persist_documents(self, persist_directory, document_splits):
138
+ save_files(persist_directory, document_splits)
139
+
140
+
141
+ def add_metadata(self, documents, metadata, value):
142
+ for doc in documents:
143
+ doc.metadata[metadata]=value
144
+ return documents
faissDb.py ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from langchain.embeddings import HuggingFaceEmbeddings
2
+
3
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
4
+ from langchain.embeddings import HuggingFaceEmbeddings
5
+ from langchain.docstore.document import Document
6
+ from langchain.document_loaders import PyPDFLoader
7
+ from langchain.document_loaders import TextLoader
8
+ from langchain.document_loaders import DirectoryLoader
9
+ from langchain.vectorstores.faiss import FAISS
10
+
11
+ EMBEDDINGS_MODEL_NAME="all-MiniLM-L6-v2"
12
+ embeddings_model_name =EMBEDDINGS_MODEL_NAME
13
+ embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)
14
+ persist_directory = "data/cbsl"
15
+ index_path = persist_directory
16
+
17
+ chunk_size=1000
18
+ chunk_overlap=50
19
+
20
+
21
+ def create_faiss():
22
+ # documents = DirectoryLoader(persist_directory, loader_cls=PyMuPDFLoader).load()
23
+ documents = DirectoryLoader("CBSL", loader_cls=PyPDFLoader).load()
24
+
25
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
26
+ texts = text_splitter.split_documents(documents)
27
+ embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)
28
+
29
+ vectorstore = FAISS.from_documents(texts, embeddings)
30
+ vectorstore.save_local("faiss_index")
31
+
32
+
33
+ def load_FAISS_store():
34
+ return FAISS.load_local("faiss_index", embeddings)
fileUpload.py ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Python Backend API to chat with private data
3
+
4
+ 08/17/2023
5
+ D.M. Theekshana Samaradiwakara
6
+ """
7
+
8
+ import os
9
+ import streamlit as st
10
+ from streamlit.logger import get_logger
11
+ from io import StringIO
12
+
13
+ logger = get_logger(__name__)
14
+
15
+ from dataPipeline import DataPipeline
16
+
17
+ def initialize_session_state():
18
+ # Initialise all session state variables with defaults
19
+ SESSION_DEFAULTS = {
20
+ "data_index": None,
21
+ "published_year": 2023,
22
+ "is_parameters_changed":False,
23
+ "is_input_validated":False,
24
+ }
25
+
26
+ for k, v in SESSION_DEFAULTS.items():
27
+ if k not in st.session_state:
28
+ st.session_state[k] = v
29
+
30
+ def update_parameters_change():
31
+ st.session_state.is_parameters_changed = True
32
+
33
+ def validate_index():
34
+ index = st.session_state.data_index
35
+ if (not index) or (not index.strip()):
36
+ st.error("Empty index directory name!")
37
+ st.stop()
38
+
39
+ st.info(f"file persist directory name: {index}")
40
+
41
+ def validate_files(uploaded_file):
42
+ if not uploaded_file:
43
+ st.error("No uploaded files to process!")
44
+ st.stop()
45
+
46
+ st.info(f"No of files uploaded : {len(uploaded_file)}")
47
+
48
+ def validate_published_year():
49
+ if not st.session_state.published_year:
50
+ st.error("Invalid year!")
51
+ st.stop()
52
+
53
+ st.info(f"file published year : {st.session_state.published_year}")
54
+
55
+ def validate_inputs(uploaded_file):
56
+ validate_index()
57
+ validate_published_year()
58
+ validate_files(uploaded_file)
59
+
60
+ return True
61
+
62
+
63
+ def process_files(uploaded_files, data_index):
64
+
65
+ try:
66
+
67
+ st.info(uploaded_files)
68
+ dataPipe = DataPipeline()
69
+
70
+ documents = dataPipe.load_streamlit_documents(uploaded_files, st.session_state.published_year)
71
+
72
+ # documents = dataPipe.add_metadata(documents, "year", st.session_state.published_year)
73
+ # process_docs = dataPipe.process_documents(documents)
74
+ # st.success("files successfully processed!")
75
+
76
+ # dataPipe.persist_documents(data_index, process_docs)
77
+ # st.success("files successfully stored!")
78
+
79
+ except Exception as e:
80
+ st.error(str(e))
81
+
82
+
83
+ #sidebar function
84
+ def sidebar():
85
+ with st.sidebar:
86
+ st.subheader("Data indexing parameters")
87
+
88
+ persist_index_name = st.text_input(
89
+ label="file persist directory name",
90
+ placeholder="enter index name",
91
+ key="persist_index_name",
92
+ help="name of the directory which processed files need to persisted.",
93
+ on_change=update_parameters_change,
94
+ )
95
+
96
+ publish_year = st.number_input(
97
+ label="published year",
98
+ min_value=1950,
99
+ value=2023,
100
+ max_value=2025,
101
+ key="publish_year",
102
+ help="year of the files are published.",
103
+ on_change=update_parameters_change,
104
+ )
105
+
106
+ if st.session_state.is_parameters_changed:
107
+ st.session_state.data_index = persist_index_name
108
+ st.session_state.published_year = publish_year
109
+ st.session_state.is_parameters_changed = False
110
+ st.info(f"file persist directory name: {st.session_state.data_index}")
111
+ st.info(f"file published year : {st.session_state.published_year}")
112
+
113
+
114
+ #main function
115
+ def main():
116
+ st.set_page_config(page_title="upload files to databse", page_icon="📖")#, layout="wide")
117
+ st.header("📖Boardpac chat App")
118
+
119
+ initialize_session_state()
120
+
121
+ sidebar()
122
+
123
+ uploaded_file = st.file_uploader(
124
+ "Upload your filess here and click on 'Process'",
125
+ key = "uploaded_file",
126
+ accept_multiple_files=True,
127
+ help="Upload files here!",
128
+ )
129
+
130
+ col1, col2 = st.columns(2)
131
+
132
+ with col1:
133
+ if st.button("validate"):
134
+ if validate_inputs(uploaded_file):
135
+ st.session_state.is_input_validated=True
136
+
137
+
138
+ with col2:
139
+ if st.session_state.is_input_validated:
140
+ if st.button("process"):
141
+ with st.spinner("Indexing document... This may take a while⏳"):
142
+ process_files(uploaded_file,st.session_state.data_index)
143
+ uploaded_file = None
144
+ st.session_state.is_input_validated = False
145
+
146
+
147
+ if __name__ == "__main__":
148
+ main()
qaPipeline.py ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Python Backend API to chat with private data
3
+
4
+ 08/14/2023
5
+ D.M. Theekshana Samaradiwakara
6
+ """
7
+
8
+ import os
9
+ import time
10
+
11
+ from dotenv import load_dotenv
12
+
13
+ from langchain.chains import RetrievalQA
14
+ from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
15
+
16
+ from langchain.llms import GPT4All
17
+ from langchain.llms import HuggingFaceHub
18
+ from langchain.chat_models import ChatOpenAI
19
+
20
+ # from langchain.retrievers.self_query.base import SelfQueryRetriever
21
+ # from langchain.chains.query_constructor.base import AttributeInfo
22
+
23
+ # from chromaDb import load_store
24
+ from faissDb import load_FAISS_store
25
+
26
+ load_dotenv()
27
+
28
+ #gpt4 all model
29
+ gpt4all_model_path = os.environ.get('GPT4ALL_MODEL_PATH')
30
+ model_n_ctx = os.environ.get('MODEL_N_CTX')
31
+ model_n_batch = int(os.environ.get('MODEL_N_BATCH',8))
32
+ target_source_chunks = int(os.environ.get('TARGET_SOURCE_CHUNKS',4))
33
+
34
+ openai_api_key = os.environ.get('OPENAI_API_KEY')
35
+
36
+ verbose = os.environ.get('VERBOSE')
37
+
38
+ # activate/deactivate the streaming StdOut callback for LLMs
39
+ callbacks = [StreamingStdOutCallbackHandler()]
40
+
41
+ class QAPipeline:
42
+
43
+ def __init__(self):
44
+
45
+ self.llm_name = None
46
+ self.llm = None
47
+
48
+ self.dataset_name = None
49
+ self.vectorstore = None
50
+
51
+ self.qa_chain = None
52
+
53
+ def run(self,query, model, dataset):
54
+
55
+ if (self.llm_name != model) or (self.dataset_name != dataset) or (self.qa_chain == None):
56
+ self.set_model(model)
57
+ self.set_vectorstore(dataset)
58
+ self.set_qa_chain()
59
+
60
+ # Get the answer from the chain
61
+ start = time.time()
62
+ res = self.qa_chain(query)
63
+ # answer, docs = res['result'],res['source_documents']
64
+ end = time.time()
65
+
66
+ # Print the result
67
+ print("\n\n> Question:")
68
+ print(query)
69
+ print(f"\n> Answer (took {round(end - start, 2)} s.):")
70
+ print( res)
71
+
72
+ return res
73
+
74
+
75
+ def set_model(self,model_type):
76
+ if model_type != self.llm_name:
77
+ match model_type:
78
+ case "gpt4all":
79
+ # self.llm = GPT4All(model=gpt4all_model_path, n_ctx=model_n_ctx, backend='gptj', n_batch=model_n_batch, callbacks=callbacks, verbose=verbose)
80
+ self.llm = GPT4All(model=gpt4all_model_path, max_tokens=model_n_ctx, backend='gptj', n_batch=model_n_batch, callbacks=callbacks, verbose=verbose)
81
+ # self.llm = HuggingFaceHub(repo_id="nomic-ai/gpt4all-j", model_kwargs={"temperature":0.001, "max_length":1024})
82
+ case "google/flan-t5-xxl":
83
+ self.llm = HuggingFaceHub(repo_id="google/flan-t5-xxl", model_kwargs={"temperature":0.001, "max_length":1024})
84
+ case "tiiuae/falcon-7b-instruct":
85
+ self.llm = HuggingFaceHub(repo_id=model_type, model_kwargs={"temperature":0.001, "max_length":1024})
86
+ case "openai":
87
+ self.llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
88
+ case _default:
89
+ # raise exception if model_type is not supported
90
+ raise Exception(f"Model type {model_type} is not supported. Please choose a valid one")
91
+
92
+ self.llm_name = model_type
93
+
94
+ def set_vectorstore(self, dataset):
95
+ if dataset != self.dataset_name:
96
+ # self.vectorstore = load_store(dataset)
97
+ self.vectorstore = load_FAISS_store()
98
+ print("\n\n> vectorstore loaded:")
99
+ self.dataset_name = dataset
100
+
101
+ def set_qa_chain(self):
102
+
103
+ self.qa_chain = RetrievalQA.from_chain_type(
104
+ llm=self.llm,
105
+ chain_type="stuff",
106
+ retriever = self.vectorstore.as_retriever(),
107
+ # retriever = self.vectorstore.as_retriever(search_kwargs={"k": target_source_chunks}
108
+ return_source_documents= True
109
+ )
110
+
requirements.txt ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ langchain
2
+ openai
3
+ gpt4all
4
+
5
+ chromadb
6
+ duckdb
7
+
8
+ torch
9
+ faiss-cpu
10
+
11
+ streamlit
12
+ # uncomment to use huggingface llms
13
+ huggingface-hub
14
+ sentence_transformers
15
+
16
+ python-dotenv
schema/apiSchema.py ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Python Backend API to chat with private data
3
+
4
+ 08/14/2023
5
+ D.M. Theekshana Samaradiwakara
6
+ """
7
+
8
+ from typing import Optional, List, Any, Dict
9
+ from pydantic import BaseModel
10
+
11
+
12
+ class Document(BaseModel):
13
+ name: Optional[str]
14
+ page_content: str
15
+ metadata: Dict[str, Any]
16
+
17
+
18
+ class QueryModel(BaseModel):
19
+ model: str
20
+ dataset: str
21
+ question: str
22
+ history: list = None
23
+
24
+
25
+ class ResponseModel(BaseModel):
26
+ success: str = None
27
+ error: str = None
28
+ documents: List[Document] # = None
ui/__pycache__/htmlTemplates.cpython-311.pyc ADDED
Binary file (1.42 kB). View file
 
ui/a.jpg ADDED
ui/bot1.jpg ADDED
ui/bot2.webp ADDED
ui/htmlTemplates.py ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ css = '''
2
+ <style>
3
+ .chat-message {
4
+ padding: 1.5rem; border-radius: 0.5rem; margin-bottom: 1rem; display: flex
5
+ }
6
+ .chat-message.user {
7
+ background-color: #2b313e
8
+ }
9
+ .chat-message.bot {
10
+ background-color: #475063
11
+ }
12
+ .chat-message .avatar {
13
+ width: 20%;
14
+ }
15
+ .chat-message .avatar img {
16
+ max-width: 78px;
17
+ max-height: 78px;
18
+ border-radius: 50%;
19
+ object-fit: cover;
20
+ }
21
+ .chat-message .message {
22
+ width: 80%;
23
+ padding: 0 1.5rem;
24
+ color: #fff;
25
+ }
26
+ '''
27
+
28
+ bot_template = '''
29
+ <div class="chat-message bot">
30
+ <div class="avatar">
31
+ <img src="https://as2.ftcdn.net/v2/jpg/05/56/09/81/1000_F_556098117_GdiFN9p9j89dpt3JhLJsegV76tt1NhfA.jpg">
32
+ </div>
33
+ <div class="message">{{MSG}}</div>
34
+ </div>
35
+ '''
36
+ user_template = '''
37
+ <div class="chat-message user">
38
+ <div class="avatar">
39
+ <img src="https://coursera-profile-photos.s3.amazonaws.com/2a/f80e20d0fe4e628036656d2ec2b85b/a.jpg">
40
+ </div>
41
+ <div class="message">{{MSG}}</div>
42
+ </div>
43
+ '''
44
+ source_template = '''
45
+ <div class="chat-message bot">
46
+ <div class="avatar">
47
+ <img src="https://st.depositphotos.com/1427101/4468/v/950/depositphotos_44680417-stock-illustration-pdf-paper-sheet-icons.jpg">
48
+ </div>
49
+ <div class="message">{{MSG}}</div>
50
+ </div>
51
+ '''
ui/pdf.jpg ADDED