WhiskeyCorridor commited on
Commit
5459be4
1 Parent(s): a5952d8

Upload 7 files

Browse files
Files changed (7) hide show
  1. .gitignore +163 -0
  2. README.md +6 -12
  3. app.py +21 -0
  4. fileingestor.py +94 -0
  5. loadllm.py +44 -0
  6. readme.txt +45 -0
  7. requirements.txt +12 -0
.gitignore ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Byte-compiled / optimized / DLL files
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+
6
+ # C extensions
7
+ *.so
8
+
9
+ # Distribution / packaging
10
+ .Python
11
+ build/
12
+ develop-eggs/
13
+ dist/
14
+ downloads/
15
+ eggs/
16
+ .eggs/
17
+ lib/
18
+ lib64/
19
+ parts/
20
+ sdist/
21
+ var/
22
+ wheels/
23
+ share/python-wheels/
24
+ *.egg-info/
25
+ .installed.cfg
26
+ *.egg
27
+ MANIFEST
28
+ *.Q4_K_M.gguf
29
+ *.gguf
30
+ *.Q4_K_M
31
+
32
+ # PyInstaller
33
+ # Usually these files are written by a python script from a template
34
+ # before PyInstaller builds the exe, so as to inject date/other infos into it.
35
+ *.manifest
36
+ *.spec
37
+
38
+ # Installer logs
39
+ pip-log.txt
40
+ pip-delete-this-directory.txt
41
+
42
+ # Unit test / coverage reports
43
+ htmlcov/
44
+ .tox/
45
+ .nox/
46
+ .coverage
47
+ .coverage.*
48
+ .cache
49
+ nosetests.xml
50
+ coverage.xml
51
+ *.cover
52
+ *.py,cover
53
+ .hypothesis/
54
+ .pytest_cache/
55
+ cover/
56
+
57
+ # Translations
58
+ *.mo
59
+ *.pot
60
+
61
+ # Django stuff:
62
+ *.log
63
+ local_settings.py
64
+ db.sqlite3
65
+ db.sqlite3-journal
66
+
67
+ # Flask stuff:
68
+ instance/
69
+ .webassets-cache
70
+
71
+ # Scrapy stuff:
72
+ .scrapy
73
+
74
+ # Sphinx documentation
75
+ docs/_build/
76
+
77
+ # PyBuilder
78
+ .pybuilder/
79
+ target/
80
+
81
+ # Jupyter Notebook
82
+ .ipynb_checkpoints
83
+
84
+ # IPython
85
+ profile_default/
86
+ ipython_config.py
87
+
88
+ # pyenv
89
+ # For a library or package, you might want to ignore these files since the code is
90
+ # intended to run in multiple environments; otherwise, check them in:
91
+ # .python-version
92
+
93
+ # pipenv
94
+ # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
95
+ # However, in case of collaboration, if having platform-specific dependencies or dependencies
96
+ # having no cross-platform support, pipenv may install dependencies that don't work, or not
97
+ # install all needed dependencies.
98
+ #Pipfile.lock
99
+
100
+ # poetry
101
+ # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
102
+ # This is especially recommended for binary packages to ensure reproducibility, and is more
103
+ # commonly ignored for libraries.
104
+ # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
105
+ #poetry.lock
106
+
107
+ # pdm
108
+ # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
109
+ #pdm.lock
110
+ # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
111
+ # in version control.
112
+ # https://pdm.fming.dev/#use-with-ide
113
+ .pdm.toml
114
+
115
+ # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
116
+ __pypackages__/
117
+
118
+ # Celery stuff
119
+ celerybeat-schedule
120
+ celerybeat.pid
121
+
122
+ # SageMath parsed files
123
+ *.sage.py
124
+
125
+ # Environments
126
+ .env
127
+ .venv
128
+ env/
129
+ venv/
130
+ ENV/
131
+ env.bak/
132
+ venv.bak/
133
+
134
+ # Spyder project settings
135
+ .spyderproject
136
+ .spyproject
137
+
138
+ # Rope project settings
139
+ .ropeproject
140
+
141
+ # mkdocs documentation
142
+ /site
143
+
144
+ # mypy
145
+ .mypy_cache/
146
+ .dmypy.json
147
+ dmypy.json
148
+
149
+ # Pyre type checker
150
+ .pyre/
151
+
152
+ # pytype static type analyzer
153
+ .pytype/
154
+
155
+ # Cython debug symbols
156
+ cython_debug/
157
+
158
+ # PyCharm
159
+ # JetBrains specific template is maintained in a separate JetBrains.gitignore that can
160
+ # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
161
+ # and can be added to the global gitignore or merged into this file. For a more nuclear
162
+ # option (not recommended) you can uncomment the following to ignore the entire idea folder.
163
+ #.idea/
README.md CHANGED
@@ -1,12 +1,6 @@
1
- ---
2
- title: PDF Chatbot
3
- emoji:
4
- colorFrom: red
5
- colorTo: blue
6
- sdk: streamlit
7
- sdk_version: 1.33.0
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
+ UTS NLP Semester Genap 2023 <br>
2
+ Chatbot PDF dengan menggunakan Framework Streamlit dan LLM Llama 2 <br><br>
3
+ 1121018 - Friendly Sejati Bunardi<br>
4
+ 1121028 - David Kharis Elio m<br>
5
+ 1121030 - Juan Vincent Nugrahaputra<br>
6
+ 1121031 - Jonathan Senjaya<br>
 
 
 
 
 
 
app.py ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Import streamlit sebagai framework untuk aplikasi ini
2
+ import streamlit as st
3
+
4
+ from fileingestor import FileIngestor
5
+
6
+ # Set the title for the Streamlit app
7
+ # Mengatur judul dan subjudul untuk tampilan aplikasi nantinya
8
+
9
+ st.title("PDF-Chatbot")
10
+ st.write("Chat with your PDF documents!")
11
+ st.write("Powered by Llama2")
12
+ st.write("Made by Team John Snow")
13
+
14
+ # Create a file uploader in the sidebar
15
+ # Membuat sidebar dimana file pdf yang akan digunakan oleh chatbot bisa diupload
16
+ uploaded_file = st.sidebar.file_uploader("Upload File", type="pdf")
17
+
18
+ # Jika file telah diupload, maka panggil class FileIngestor yang akan mengolah file PDF yang telah disubmit
19
+ if uploaded_file:
20
+ file_ingestor = FileIngestor(uploaded_file)
21
+ file_ingestor.handlefileandingest()
fileingestor.py ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Import streamlit, langchanin, PyMuPDFLoader, dan file loadllm
2
+ # PyMuPDFLoader adalah library untuk mengekstraksi, menganalisa, dan mengkonversi data dari dokumen PDF
3
+ import streamlit as st
4
+ from langchain.document_loaders import PyMuPDFLoader
5
+ from loadllm import Loadllm
6
+ from streamlit_chat import message
7
+ import tempfile
8
+ from langchain.embeddings import HuggingFaceEmbeddings
9
+ from langchain.vectorstores import FAISS
10
+ from langchain.chains import ConversationalRetrievalChain
11
+
12
+ # Load model directly
13
+ #from transformers import AutoModel
14
+
15
+ # Path dimana hasil vectore score dari FAISS akan disimpan
16
+ # FAISS (Facebook AI Similarity Search) adalah sebuah library untuk mencari embedding dalam dokumen yang serupa satu dengan yang lainnya
17
+ # FAISS mempunyai algoritma yang mencari kesamaan di set vector dengan ukuran apapun
18
+ # FAISS bisa mencari melalui banyak informasi dengan cepat dan memilih mereka yang penting
19
+ DB_FAISS_PATH = 'vectorstore/db_faiss'
20
+
21
+ class FileIngestor:
22
+ def __init__(self, uploaded_file):
23
+ self.uploaded_file = uploaded_file
24
+
25
+ def handlefileandingest(self):
26
+ with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
27
+ tmp_file.write(self.uploaded_file.getvalue())
28
+ tmp_file_path = tmp_file.name
29
+
30
+ loader = PyMuPDFLoader(file_path=tmp_file_path)
31
+ data = loader.load()
32
+
33
+ # Create embeddings using Sentence Transformers
34
+ # Word embedding dari dokumen akan dibuat menggunakan sentence-transformers yang disediakan HuggingFace
35
+ # Transformer ini berbasis BERT dan bisa memetakan kalimat dan paragraf menjadi vector space dengan
36
+ # densitas 384 dimensi
37
+ embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
38
+
39
+ # Create a FAISS vector store and save embeddings
40
+ db = FAISS.from_documents(data, embeddings)
41
+ db.save_local(DB_FAISS_PATH)
42
+
43
+ # Load the language model
44
+ # Load model Llama 2 yang telah disiapkan di file loadllm.py
45
+ llm = Loadllm.load_llm()
46
+ #llm = AutoModel.from_pretrained("TheBloke/Llama-2-7B-Chat-GGUF")
47
+
48
+ # Create a conversational chain
49
+ # Membuat chain conversation dari Llama 2
50
+ chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=db.as_retriever())
51
+
52
+ # Function for conversational chat
53
+ # Memasukkan chat baru bagi Streamlit
54
+ # Query adalah pertanyaan yang kita berikan, answer jawaban, dan history agar Llama mengetahui
55
+ # konteks untuk percakapan kita dengan dia
56
+ def conversational_chat(query):
57
+ result = chain({"question": query, "chat_history": st.session_state['history']})
58
+ st.session_state['history'].append((query, result["answer"]))
59
+ return result["answer"]
60
+
61
+ # Initialize chat history
62
+ if 'history' not in st.session_state:
63
+ st.session_state['history'] = []
64
+
65
+ # Initialize messages
66
+ if 'generated' not in st.session_state:
67
+ st.session_state['generated'] = ["Hello ! Ask me(LLAMA2) about " + self.uploaded_file.name + " 🤗"]
68
+
69
+ if 'past' not in st.session_state:
70
+ st.session_state['past'] = ["Hey ! 👋"]
71
+
72
+ # Create containers for chat history and user input
73
+ # Buat container untuk display UI
74
+ response_container = st.container()
75
+ container = st.container()
76
+
77
+ # User input form
78
+ with container:
79
+ with st.form(key='my_form', clear_on_submit=True):
80
+ user_input = st.text_input("Query:", placeholder="Talk to PDF data 🧮", key='input')
81
+ submit_button = st.form_submit_button(label='Send')
82
+
83
+ # Jika kita mengklik tombol submit/enter dan user input telah diisi, maka conversation akan kita mulai
84
+ if submit_button and user_input:
85
+ output = conversational_chat(user_input)
86
+ st.session_state['past'].append(user_input)
87
+ st.session_state['generated'].append(output)
88
+
89
+ # Display chat history
90
+ if st.session_state['generated']:
91
+ with response_container:
92
+ for i in range(len(st.session_state['generated'])):
93
+ message(st.session_state["past"][i], is_user=True, key=str(i) + '_user', avatar_style="big-smile")
94
+ message(st.session_state["generated"][i], key=str(i), avatar_style="thumbs")
loadllm.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Import library langchain
2
+ # Langchain adalah framework untuk mempermudah pembuatan aplikasi dengan menggunakan Large Language Models (LLM) seperti
3
+ # GPT, Claude, Llama, dan banyak LLM lainnya
4
+ from langchain.llms import LlamaCpp
5
+ from langchain.callbacks.manager import CallbackManager
6
+ from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
7
+
8
+ # Path dimana file model Llama yang digunakan sebagai chatbot disimpan
9
+ # Model yang kami gunakan adalah Llama 2 7B Chat GGUF yang merupakan modifikasi dari Llama 2 7B Chat yang dibuat oleh Meta
10
+ # Model ini dimodifikasi untuk menggunakan format GGUF yang menawarkan beberapa keuntungan dari tipe lama GGML seperti
11
+ # tokenization yang lebih baik, support untuk token special, support untuk metadata, dan didesain extensible
12
+ model_path = 'model/llama-2-7b-chat.Q4_K_M.gguf'
13
+
14
+ class Loadllm:
15
+ @staticmethod
16
+ # Function untuk meload model Llama 2 dan menyiapkannya untuk digunakan
17
+ def load_llm():
18
+ callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
19
+ # Prepare the LLM
20
+
21
+ # LlamaCpp adalah sebuah library yang bertujuan unutk memberikan LLM inference dengan setup minimal dan performa
22
+ # state of the art pada berbagai macam hardware, baik local, maupun di cloud
23
+ # model_path = Tempat dimana model Llama disimpan di komputer
24
+ # n_gpu_layers = Jumlah layer yang akan dioffload ke GPU
25
+ # n_batch = Ukuran batch maximum untuk pemrosesan prompt
26
+ # n_ctx = Text context
27
+ # max_tokens = Jumlah maximum token yang akan digenerate sebagai respons oleh model
28
+ # local_files_only = Apakah hanya menggunakan file model yang ada secara lokal saja atau akan mendownload dari luar
29
+ # f16_kv
30
+ # callback_manager
31
+ # verbose = Print output verbose
32
+ llm = LlamaCpp(
33
+ model_path=model_path,
34
+ n_gpu_layers=20,
35
+ n_batch=512,
36
+ n_ctx=4096,
37
+ max_tokens=4096,
38
+ local_files_only = True,
39
+ f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls
40
+ callback_manager=callback_manager,
41
+ verbose=True,
42
+ )
43
+ # Return model Llama yang telah siap
44
+ return llm
readme.txt ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Cara menggunakan Chatbot
2
+
3
+ Chatbot kami memerlukan library Python sebagai berikut:
4
+ langchain==0.1.11
5
+ numpy==1.25.2
6
+ Pillow==10.2.0
7
+ protobuf==4.25.3
8
+ streamlit==1.31.1
9
+ streamlit_chat==0.1.1
10
+ tornado==6.1
11
+ transformers==4.26.1
12
+ pymupdf
13
+ sentence-transformers
14
+ faiss-cpu
15
+ llama-cpp-python
16
+
17
+ Library tersebut perlu diinstall terlebih dahulu pada environment python yang akan menjalakan program kami menggunakan pip install.
18
+
19
+ Struktur Folder
20
+
21
+ PDF-Chatbot
22
+ .streamlit
23
+ config.toml
24
+ model
25
+ llama-2-7b-chat.Q4_K_M.gguf
26
+ vectorstore
27
+ db_faiss
28
+ index.faiss
29
+ index.pkl
30
+ app.py
31
+ fileingestor.py
32
+ loadllm.py
33
+ readme.txt
34
+ requirements.txt
35
+
36
+ Tahap penggunaan
37
+ 1. Download model kami pada link Google Drive berikut : https://bit.ly/model-PDF-Chatbot
38
+ 2. Clone atau download source code kami dari github pada link github berikut : https://github.com/FriendlySB/PDF-Chatbot
39
+ 3. Di dalam folder PDF-Chatbot, buat sebuah folder bernama model
40
+ 4. Pindahkan model yang telah didownload ke dalam folder tersebut
41
+ 5. Untuk menjalankan aplikasi, buka command prompt
42
+ 6. Lakukan perintah cd atau change directory ke path dimana folder PDF-Chatbot disimpan
43
+ 7. Jalankan perintah streamlit run app.py pada command prompt
44
+ 8. Program akan membuka sebuah tab baru di browser dimana aplikasi chatbot akan dijalankan
45
+ 9. Chatbot siap digunakan
requirements.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ langchain==0.1.11
2
+ numpy==1.25.2
3
+ Pillow==10.2.0
4
+ protobuf==4.25.3
5
+ streamlit==1.31.1
6
+ streamlit_chat==0.1.1
7
+ tornado==6.1
8
+ transformers==4.26.1
9
+ pymupdf
10
+ sentence-transformers
11
+ faiss-cpu
12
+ llama-cpp-python