learn-ai

Runtime error

App Files Files Community

dh-mc commited on Aug 18, 2023

Commit

910c4c8

1 Parent(s): 71ef2ac

gradio app: support chatting with llama-2

Browse files

Files changed (10) hide show

.env.example +2 -2
Makefile +5 -2
README.md +120 -1
app.py +33 -19
app_modules/llm_loader.py +34 -14
assets/Open Source LLMs.png +0 -0
assets/Workflow-Overview.png +0 -0
data/questions.txt +1 -0
test.py +67 -147
unit_test.py +183 -0

.env.example CHANGED Viewed

@@ -25,7 +25,7 @@ HF_PIPELINE_DEVICE_TYPE=
 # LOAD_QUANTIZED_MODEL=4bit
 # LOAD_QUANTIZED_MODEL=8bit
-USE_LLAMA_2_PROMPT_TEMPLATE=true
 DISABLE_MODEL_PRELOADING=true
 CHAT_HISTORY_ENABLED=true
 SHOW_PARAM_SETTINGS=false
@@ -84,7 +84,7 @@ TOKENIZERS_PARALLELISM=true
 # env variables for ingesting source PDF files
 SOURCE_PDFS_PATH="./data/pdfs/"
-SOURCE_URLS="./data/pci_dss_urls.txt"
 CHUNCK_SIZE=1024
 CHUNK_OVERLAP=512

 # LOAD_QUANTIZED_MODEL=4bit
 # LOAD_QUANTIZED_MODEL=8bit
+# USE_LLAMA_2_PROMPT_TEMPLATE=true
 DISABLE_MODEL_PRELOADING=true
 CHAT_HISTORY_ENABLED=true
 SHOW_PARAM_SETTINGS=false
 # env variables for ingesting source PDF files
 SOURCE_PDFS_PATH="./data/pdfs/"
+SOURCE_URLS=
 CHUNCK_SIZE=1024
 CHUNK_OVERLAP=512

Makefile CHANGED Viewed

@@ -1,7 +1,7 @@
 .PHONY: start
 start:
 	python app.py
 serve:
 ifeq ("$(PORT)", "")
 	JINA_HIDE_SURVEY=1 TRANSFORMERS_OFFLINE=1 python -m lcserve deploy local server
@@ -10,11 +10,14 @@ else
 endif
 test:
-	python test.py $(TEST)
 chat:
 	python test.py chat
 tele:
 	python telegram_bot.py

 .PHONY: start
 start:
 	python app.py
 serve:
 ifeq ("$(PORT)", "")
 	JINA_HIDE_SURVEY=1 TRANSFORMERS_OFFLINE=1 python -m lcserve deploy local server
 endif
 test:
+	python test.py
 chat:
 	python test.py chat
+unittest:
+	python unit_test.py $(TEST)
 tele:
 	python telegram_bot.py

README.md CHANGED Viewed

@@ -8,7 +8,126 @@ sdk_version: 3.36.1
 app_file: app.py
 pinned: false
 license: apache-2.0
-duplicated_from: inflaton/chat-with-pci-dss-v4
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 app_file: app.py
 pinned: false
 license: apache-2.0
 ---
+# ChatPDF - Talk to Your PDF Files
+This project uses Open AI and open-source large language models (LLMs) to enable you to talk to your own PDF files.
+## How it works
+We're using an AI design pattern, namely "in-context learning" which uses LLMs off the shelf (i.e., without any fine-tuning), then controls their behavior through clever prompting and conditioning on private “contextual” data, e.g., texts extracted from your PDF files.
+At a very high level, the workflow can be divided into three stages:
+1. Data preprocessing / embedding: This stage involves storing private data (your PDF files) to be retrieved later. Typically, the documents are broken into chunks, passed through an embedding model, then stored the created embeddings in a vectorstore.
+2. Prompt construction / retrieval: When a user submits a query, the application constructs a series of prompts to submit to the language model. A compiled prompt typically combines a prompt template and a set of relevant documents retrieved from the vectorstore.
+3. Prompt execution / inference: Once the prompts have been compiled, they are submitted to a pre-trained LLM for inference—including both proprietary model APIs and open-source or self-trained models.
+![In-context Learning - Workflow Overview](./assets/Workflow-Overview.png)
+Tech stack used includes LangChain, Gradio, Chroma and FAISS.
+- LangChain is an open-source framework that makes it easier to build scalable AI/LLM apps and chatbots.
+- Gradio is an open-source Python library that is used to build machine learning and data science demos and web applications.
+- Chroma and FAISS are open-source vectorstores for storing embeddings for your files.
+## Running Locally
+1. Check pre-conditions:
+- [Git Large File Storage (LFS)](https://git-lfs.com/) must have been installed.
+- Run `python --version` to make sure you're running Python version 3.10 or above.
+- The latest PyTorch with GPU support must have been installed. Here is a sample `conda` command:
+```
+conda install -y pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
+```
+- [CMake](https://cmake.org/) must have been installed. Here is a sample command to install `CMake` on `ubuntu`:
+```
+sudo apt install cmake
+```
+2. Clone the repo
+```
+git lfs install
+git clone https://huggingface.co/spaces/inflaton/learn-ai
+```
+3. Install packages
+```
+pip install -U -r requirements.txt
+```
+4. Set up your environment variables
+- By default, environment variables are loaded `.env.example` file
+- If you don't want to use the default settings, copy `.env.example` into `.env`. Your can then update it for your local runs.
+5. Start the local server at `http://localhost:7860`:
+```
+python app.py
+```
+## Duplicate This Space
+Duplicate this HuggingFace Space from the UI or click the following link:
+- [Duplicate this space](https://huggingface.co/spaces/inflaton/learn-ai?duplicate=true)
+Once duplicated, you can set up environment variables from the space settings. The values there will take precedence of those in `.env.example`.
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+## Talk to Your Own PDF Files
+- The sample PDF books & documents are downloaded from the internet (for AI Books) and [PCI DSS official website](https://www.pcisecuritystandards.org/document_library/?category=pcidss) and the corresponding embeddings are stored in folders `data/ai_books` and `data/pci_dss_v4` respectively, which allows you to run locally without any additional effort.
+- You can also put your own PDF files into any folder specified in `SOURCE_PDFS_PATH` and run the command below to generate embeddings which will be stored in folder `FAISS_INDEX_PATH` or `CHROMADB_INDEX_PATH`. If both `*_INDEX_PATH` env vars are set, `FAISS_INDEX_PATH` takes precedence. Make sure the folder specified by `*_INDEX_PATH` doesn't exist; other wise the command will simply try to load index from the folder and do a simple similarity search, as a way to verify if embeddings are generated and stored properly. Please note the HuggingFace Embedding model specified by `HF_EMBEDDINGS_MODEL_NAME` will be used to generate the embeddings.
+```
+python ingest.py
+```
+- Once embeddings are generated, you can test them out locally, or check them into your duplicated space. Please note HF Spaces git server does not allow PDF files to be checked in.
+## Play with Different Large Language Models
+The source code supports different LLM types - as shown at the top of `.env.example`
+```
+# LLM_MODEL_TYPE=openai
+# LLM_MODEL_TYPE=gpt4all-j
+# LLM_MODEL_TYPE=gpt4all
+# LLM_MODEL_TYPE=llamacpp
+LLM_MODEL_TYPE=huggingface
+# LLM_MODEL_TYPE=mosaicml
+# LLM_MODEL_TYPE=stablelm
+# LLM_MODEL_TYPE=openllm
+# LLM_MODEL_TYPE=hftgi
+```
+- By default, the app runs `lmsys/fastchat-t5-3b-v1.0` model with HF Transformers, which works well with most PCs/laptops with 32GB or more RAM, without any GPU. It also works on HF Spaces with their free-tier: 2 vCPU, 16GB RAM and 500GB hard disk, though the inference speed is very slow.
+- Uncomment/comment the above to play with different LLM types. You may also want to update other related env vars. E.g., here's the list of HF models which have been tested with the code:
+```
+# HUGGINGFACE_MODEL_NAME_OR_PATH="databricks/dolly-v2-3b"
+# HUGGINGFACE_MODEL_NAME_OR_PATH="databricks/dolly-v2-7b"
+# HUGGINGFACE_MODEL_NAME_OR_PATH="databricks/dolly-v2-12b"
+# HUGGINGFACE_MODEL_NAME_OR_PATH="TheBloke/wizardLM-7B-HF"
+# HUGGINGFACE_MODEL_NAME_OR_PATH="TheBloke/vicuna-7B-1.1-HF"
+# HUGGINGFACE_MODEL_NAME_OR_PATH="nomic-ai/gpt4all-j"
+# HUGGINGFACE_MODEL_NAME_OR_PATH="nomic-ai/gpt4all-falcon"
+HUGGINGFACE_MODEL_NAME_OR_PATH="lmsys/fastchat-t5-3b-v1.0"
+# HUGGINGFACE_MODEL_NAME_OR_PATH="meta-llama/Llama-2-7b-chat-hf"
+# HUGGINGFACE_MODEL_NAME_OR_PATH="meta-llama/Llama-2-13b-chat-hf"
+# HUGGINGFACE_MODEL_NAME_OR_PATH="meta-llama/Llama-2-70b-chat-hf"
+```
+The script `test.sh` automates running different LLMs and records the outputs in `data/logs` folder which currently contains a few log files created by previous test runs on Nvidia GeForce RTX 4090, A40 and L40 GPUs.

app.py CHANGED Viewed

@@ -8,15 +8,21 @@ import gradio as gr
 from anyio.from_thread import start_blocking_portal
 from app_modules.init import app_init
 from app_modules.utils import print_llm_response, remove_extra_spaces
 llm_loader, qa_chain = app_init()
-chat_history_enabled = os.environ.get("CHAT_HISTORY_ENABLED") == "true"
 show_param_settings = os.environ.get("SHOW_PARAM_SETTINGS") == "true"
 share_gradio_app = os.environ.get("SHARE_GRADIO_APP") == "true"
 using_openai = os.environ.get("LLM_MODEL_TYPE") == "openai"
 model = (
     "OpenAI GPT-3.5"
     if using_openai
@@ -28,7 +34,13 @@ href = (
     else f"https://huggingface.co/{model}"
 )
-title = """<h1 align="left" style="min-width:200px; margin-top:0;"> Chat with AI Books </h1>"""
 description_top = f"""\
 <div align="left">
@@ -42,7 +54,7 @@ The demo is built on <a href="https://github.com/hwchase17/langchain">LangChain<
 </div>
 """
-CONCURRENT_COUNT = 100
 def qa(chatbot):
@@ -53,9 +65,10 @@ def qa(chatbot):
     def task(question, chat_history):
         start = timer()
-        ret = qa_chain.call_chain(
-            {"question": question, "chat_history": chat_history}, None, q
-        )
         end = timer()
         print(f"Completed in {end - start:.3f}s")
@@ -93,17 +106,18 @@ def qa(chatbot):
             count -= 1
-        chatbot[-1][1] += "\n\nSources:\n"
-        ret = result.get()
-        titles = []
-        for doc in ret["source_documents"]:
-            page = doc.metadata["page"] + 1
-            url = f"{doc.metadata['url']}#page={page}"
-            file_name = doc.metadata["source"].split("/")[-1]
-            title = f"{file_name} Page: {page}"
-            if title not in titles:
-                titles.append(title)
-                chatbot[-1][1] += f"1. [{title}]({url})\n"
         yield chatbot
@@ -195,5 +209,5 @@ with gr.Blocks(css=customCSS) as demo:
         api_name="reset",
     )
-demo.title = "Chat with AI Books"
 demo.queue(concurrency_count=CONCURRENT_COUNT).launch(share=share_gradio_app)

 from anyio.from_thread import start_blocking_portal
 from app_modules.init import app_init
+from app_modules.llm_chat_chain import ChatChain
 from app_modules.utils import print_llm_response, remove_extra_spaces
 llm_loader, qa_chain = app_init()
 show_param_settings = os.environ.get("SHOW_PARAM_SETTINGS") == "true"
 share_gradio_app = os.environ.get("SHARE_GRADIO_APP") == "true"
 using_openai = os.environ.get("LLM_MODEL_TYPE") == "openai"
+chat_with_llama_2 = (
+    not using_openai and os.environ.get("USE_LLAMA_2_PROMPT_TEMPLATE") == "true"
+)
+chat_history_enabled = (
+    not chat_with_llama_2 and os.environ.get("CHAT_HISTORY_ENABLED") == "true"
+)
 model = (
     "OpenAI GPT-3.5"
     if using_openai
     else f"https://huggingface.co/{model}"
 )
+if chat_with_llama_2:
+    qa_chain = ChatChain(llm_loader)
+    name = "Llama-2"
+else:
+    name = "AI Books"
+title = f"""<h1 align="left" style="min-width:200px; margin-top:0;"> Chat with {name} </h1>"""
 description_top = f"""\
 <div align="left">
 </div>
 """
+CONCURRENT_COUNT = 1
 def qa(chatbot):
     def task(question, chat_history):
         start = timer()
+        inputs = {"question": question}
+        if not chat_with_llama_2:
+            inputs["chat_history"] = chat_history
+        ret = qa_chain.call_chain(inputs, None, q)
         end = timer()
         print(f"Completed in {end - start:.3f}s")
             count -= 1
+        if not chat_with_llama_2:
+            chatbot[-1][1] += "\n\nSources:\n"
+            ret = result.get()
+            titles = []
+            for doc in ret["source_documents"]:
+                page = doc.metadata["page"] + 1
+                url = f"{doc.metadata['url']}#page={page}"
+                file_name = doc.metadata["source"].split("/")[-1]
+                title = f"{file_name} Page: {page}"
+                if title not in titles:
+                    titles.append(title)
+                    chatbot[-1][1] += f"1. [{title}]({url})\n"
         yield chatbot
         api_name="reset",
     )
+demo.title = "Chat with AI Books" if chat_with_llama_2 else "Chat with Llama-2"
 demo.queue(concurrency_count=CONCURRENT_COUNT).launch(share=share_gradio_app)

app_modules/llm_loader.py CHANGED Viewed

@@ -421,20 +421,40 @@ class LLMLoader:
                     else:
                         model = MODEL_NAME_OR_PATH
-                    pipe = pipeline(
-                        task,
-                        model=model,
-                        tokenizer=tokenizer,
-                        streamer=self.streamer,
-                        return_full_text=return_full_text,  # langchain expects the full text
-                        device=hf_pipeline_device_type,
-                        torch_dtype=torch_dtype,
-                        max_new_tokens=2048,
-                        trust_remote_code=True,
-                        temperature=temperature,
-                        top_p=0.95,
-                        top_k=0,  # select from top 0 tokens (because zero, relies on top_p)
-                        repetition_penalty=1.115,
                     )
                 self.llm = HuggingFacePipeline(pipeline=pipe, callbacks=callbacks)

                     else:
                         model = MODEL_NAME_OR_PATH
+                    pipe = (
+                        pipeline(
+                            task,
+                            model=model,
+                            tokenizer=tokenizer,
+                            streamer=self.streamer,
+                            return_full_text=return_full_text,  # langchain expects the full text
+                            device=hf_pipeline_device_type,
+                            torch_dtype=torch_dtype,
+                            max_new_tokens=2048,
+                            trust_remote_code=True,
+                            temperature=temperature,
+                            top_p=0.95,
+                            top_k=0,  # select from top 0 tokens (because zero, relies on top_p)
+                            repetition_penalty=1.115,
+                        )
+                        if token is None
+                        else pipeline(
+                            task,
+                            model=model,
+                            tokenizer=tokenizer,
+                            streamer=self.streamer,
+                            return_full_text=return_full_text,  # langchain expects the full text
+                            device=hf_pipeline_device_type,
+                            torch_dtype=torch_dtype,
+                            max_new_tokens=2048,
+                            trust_remote_code=True,
+                            temperature=temperature,
+                            top_p=0.95,
+                            top_k=0,  # select from top 0 tokens (because zero, relies on top_p)
+                            repetition_penalty=1.115,
+                            use_auth_token=token,
+                            token=token,
+                        )
                     )
                 self.llm = HuggingFacePipeline(pipeline=pipe, callbacks=callbacks)

assets/Open Source LLMs.png ADDED Viewed

assets/Workflow-Overview.png ADDED Viewed

data/questions.txt CHANGED Viewed

@@ -2,3 +2,4 @@ What's AI?
 life in AI era
 machine learning
 generative model

 life in AI era
 machine learning
 generative model
+graph attention network

test.py CHANGED Viewed

@@ -1,183 +1,103 @@
-# project/test.py
 import os
 import sys
-import unittest
 from timeit import default_timer as timer
 from langchain.callbacks.base import BaseCallbackHandler
-from langchain.schema import HumanMessage
 from app_modules.init import app_init
-from app_modules.llm_chat_chain import ChatChain
-from app_modules.llm_loader import LLMLoader
-from app_modules.utils import get_device_types, print_llm_response
-class TestLLMLoader(unittest.TestCase):
-    question = os.environ.get("CHAT_QUESTION")
-    def run_test_case(self, llm_model_type, query):
-        n_threds = int(os.environ.get("NUMBER_OF_CPU_CORES") or "4")
-        hf_embeddings_device_type, hf_pipeline_device_type = get_device_types()
-        print(f"hf_embeddings_device_type: {hf_embeddings_device_type}")
-        print(f"hf_pipeline_device_type: {hf_pipeline_device_type}")
-        llm_loader = LLMLoader(llm_model_type)
-        start = timer()
-        llm_loader.init(
-            n_threds=n_threds, hf_pipeline_device_type=hf_pipeline_device_type
-        )
-        end = timer()
-        print(f"Model loaded in {end - start:.3f}s")
-        result = llm_loader.llm(
-            [HumanMessage(content=query)] if llm_model_type == "openai" else query
-        )
-        end2 = timer()
-        print(f"Inference completed in {end2 - end:.3f}s")
-        print(result)
-    def test_openai(self):
-        self.run_test_case("openai", self.question)
-    def test_llamacpp(self):
-        self.run_test_case("llamacpp", self.question)
-    def test_gpt4all_j(self):
-        self.run_test_case("gpt4all-j", self.question)
-    def test_huggingface(self):
-        self.run_test_case("huggingface", self.question)
-    def test_hftgi(self):
-        self.run_test_case("hftgi", self.question)
-class TestChatChain(unittest.TestCase):
-    question = os.environ.get("CHAT_QUESTION")
-    def run_test_case(self, llm_model_type, query):
-        n_threds = int(os.environ.get("NUMBER_OF_CPU_CORES") or "4")
-        hf_embeddings_device_type, hf_pipeline_device_type = get_device_types()
-        print(f"hf_embeddings_device_type: {hf_embeddings_device_type}")
-        print(f"hf_pipeline_device_type: {hf_pipeline_device_type}")
-        llm_loader = LLMLoader(llm_model_type)
-        start = timer()
-        llm_loader.init(
-            n_threds=n_threds, hf_pipeline_device_type=hf_pipeline_device_type
-        )
-        chat = ChatChain(llm_loader)
-        end = timer()
-        print(f"Model loaded in {end - start:.3f}s")
-        inputs = {"question": query}
-        result = chat.call_chain(inputs, None)
-        end2 = timer()
-        print(f"Inference completed in {end2 - end:.3f}s")
-        print(result)
-        inputs = {"question": "how many people?"}
-        result = chat.call_chain(inputs, None)
-        end3 = timer()
-        print(f"Inference completed in {end3 - end2:.3f}s")
-        print(result)
-    def test_openai(self):
-        self.run_test_case("openai", self.question)
-    def test_llamacpp(self):
-        self.run_test_case("llamacpp", self.question)
-    def test_gpt4all_j(self):
-        self.run_test_case("gpt4all-j", self.question)
-    def test_huggingface(self):
-        self.run_test_case("huggingface", self.question)
-    def test_hftgi(self):
-        self.run_test_case("hftgi", self.question)
-class TestQAChain(unittest.TestCase):
-    qa_chain: any
-    question = os.environ.get("QA_QUESTION")
-    def run_test_case(self, llm_model_type, query):
-        start = timer()
-        os.environ["LLM_MODEL_TYPE"] = llm_model_type
-        qa_chain = app_init()[1]
-        end = timer()
-        print(f"App initialized in {end - start:.3f}s")
-        chat_history = []
-        inputs = {"question": query, "chat_history": chat_history}
-        result = qa_chain.call_chain(inputs, None)
-        end2 = timer()
-        print(f"Inference completed in {end2 - end:.3f}s")
-        print_llm_response(result)
-        chat_history.append((query, result["answer"]))
-        inputs = {"question": "tell me more", "chat_history": chat_history}
-        result = qa_chain.call_chain(inputs, None)
-        end3 = timer()
-        print(f"Inference completed in {end3 - end2:.3f}s")
-        print_llm_response(result)
-    def test_openai(self):
-        self.run_test_case("openai", self.question)
-    def test_llamacpp(self):
-        self.run_test_case("llamacpp", self.question)
-    def test_gpt4all_j(self):
-        self.run_test_case("gpt4all-j", self.question)
-    def test_huggingface(self):
-        self.run_test_case("huggingface", self.question)
-    def test_hftgi(self):
-        self.run_test_case("hftgi", self.question)
-def chat():
     start = timer()
-    llm_loader = app_init()[0]
     end = timer()
-    print(f"Model loaded in {end - start:.3f}s")
-    chat_chain = ChatChain(llm_loader)
-    chat_history = []
-    chat_start = timer()
-    while True:
-        query = input("Please enter your question: ")
-        query = query.strip()
-        if query.lower() == "exit":
-            break
-        print("\nQuestion: " + query)
         start = timer()
-        result = chat_chain.call_chain(
-            {"question": query, "chat_history": chat_history}, None
-        )
         end = timer()
-        print(f"Completed in {end - start:.3f}s")
-        chat_history.append((query, result["text"]))
-    chat_end = timer()
-    print(f"Total time used: {chat_end - chat_start:.3f}s")
-if __name__ == "__main__":
-    if len(sys.argv) > 1 and sys.argv[1] == "chat":
-        chat()
-    else:
-        unittest.main()

 import os
 import sys
+from queue import Queue
 from timeit import default_timer as timer
 from langchain.callbacks.base import BaseCallbackHandler
+from langchain.schema import LLMResult
 from app_modules.init import app_init
+from app_modules.utils import print_llm_response
+llm_loader, qa_chain = app_init()
+class MyCustomHandler(BaseCallbackHandler):
+    def __init__(self):
+        self.reset()
+    def reset(self):
+        self.texts = []
+    def get_standalone_question(self) -> str:
+        return self.texts[0].strip() if len(self.texts) > 0 else None
+    def on_llm_end(self, response: LLMResult, **kwargs) -> None:
+        """Run when chain ends running."""
+        print("\non_llm_end - response:")
+        print(response)
+        self.texts.append(response.generations[0][0].text)
+chatting = len(sys.argv) > 1 and sys.argv[1] == "chat"
+questions_file_path = os.environ.get("QUESTIONS_FILE_PATH")
+chat_history_enabled = os.environ.get("CHAT_HISTORY_ENABLED") or "true"
+custom_handler = MyCustomHandler()
+# Chatbot loop
+chat_history = []
+print("Welcome to the ChatPDF! Type 'exit' to stop.")
+# Open the file for reading
+file = open(questions_file_path, "r")
+# Read the contents of the file into a list of strings
+queue = file.readlines()
+for i in range(len(queue)):
+    queue[i] = queue[i].strip()
+# Close the file
+file.close()
+queue.append("exit")
+chat_start = timer()
+while True:
+    if chatting:
+        query = input("Please enter your question: ")
+    else:
+        query = queue.pop(0)
+    query = query.strip()
+    if query.lower() == "exit":
+        break
+    print("\nQuestion: " + query)
+    custom_handler.reset()
     start = timer()
+    result = qa_chain.call_chain(
+        {"question": query, "chat_history": chat_history}, custom_handler
+    )
     end = timer()
+    print(f"Completed in {end - start:.3f}s")
+    print_llm_response(result)
+    if len(chat_history) == 0:
+        standalone_question = query
+    else:
+        standalone_question = custom_handler.get_standalone_question()
+    if standalone_question is not None:
+        print(f"Load relevant documents for standalone question: {standalone_question}")
         start = timer()
+        qa = qa_chain.get_chain()
+        docs = qa.retriever.get_relevant_documents(standalone_question)
         end = timer()
+        # print(docs)
+        print(f"Completed in {end - start:.3f}s")
+    if chat_history_enabled == "true":
+        chat_history.append((query, result["answer"]))
+chat_end = timer()
+total_time = chat_end - chat_start
+print(f"Total time used: {total_time:.3f} s")
+print(f"Number of tokens generated: {llm_loader.streamer.total_tokens}")
+print(
+    f"Average generation speed: {llm_loader.streamer.total_tokens / total_time:.3f} tokens/s"
+)

unit_test.py ADDED Viewed

	@@ -0,0 +1,183 @@

+# project/test.py
+import os
+import sys
+import unittest
+from timeit import default_timer as timer
+from langchain.callbacks.base import BaseCallbackHandler
+from langchain.schema import HumanMessage
+from app_modules.init import app_init
+from app_modules.llm_chat_chain import ChatChain
+from app_modules.llm_loader import LLMLoader
+from app_modules.utils import get_device_types, print_llm_response
+class TestLLMLoader(unittest.TestCase):
+    question = os.environ.get("CHAT_QUESTION")
+    def run_test_case(self, llm_model_type, query):
+        n_threds = int(os.environ.get("NUMBER_OF_CPU_CORES") or "4")
+        hf_embeddings_device_type, hf_pipeline_device_type = get_device_types()
+        print(f"hf_embeddings_device_type: {hf_embeddings_device_type}")
+        print(f"hf_pipeline_device_type: {hf_pipeline_device_type}")
+        llm_loader = LLMLoader(llm_model_type)
+        start = timer()
+        llm_loader.init(
+            n_threds=n_threds, hf_pipeline_device_type=hf_pipeline_device_type
+        )
+        end = timer()
+        print(f"Model loaded in {end - start:.3f}s")
+        result = llm_loader.llm(
+            [HumanMessage(content=query)] if llm_model_type == "openai" else query
+        )
+        end2 = timer()
+        print(f"Inference completed in {end2 - end:.3f}s")
+        print(result)
+    def test_openai(self):
+        self.run_test_case("openai", self.question)
+    def test_llamacpp(self):
+        self.run_test_case("llamacpp", self.question)
+    def test_gpt4all_j(self):
+        self.run_test_case("gpt4all-j", self.question)
+    def test_huggingface(self):
+        self.run_test_case("huggingface", self.question)
+    def test_hftgi(self):
+        self.run_test_case("hftgi", self.question)
+class TestChatChain(unittest.TestCase):
+    question = os.environ.get("CHAT_QUESTION")
+    def run_test_case(self, llm_model_type, query):
+        n_threds = int(os.environ.get("NUMBER_OF_CPU_CORES") or "4")
+        hf_embeddings_device_type, hf_pipeline_device_type = get_device_types()
+        print(f"hf_embeddings_device_type: {hf_embeddings_device_type}")
+        print(f"hf_pipeline_device_type: {hf_pipeline_device_type}")
+        llm_loader = LLMLoader(llm_model_type)
+        start = timer()
+        llm_loader.init(
+            n_threds=n_threds, hf_pipeline_device_type=hf_pipeline_device_type
+        )
+        chat = ChatChain(llm_loader)
+        end = timer()
+        print(f"Model loaded in {end - start:.3f}s")
+        inputs = {"question": query}
+        result = chat.call_chain(inputs, None)
+        end2 = timer()
+        print(f"Inference completed in {end2 - end:.3f}s")
+        print(result)
+        inputs = {"question": "how many people?"}
+        result = chat.call_chain(inputs, None)
+        end3 = timer()
+        print(f"Inference completed in {end3 - end2:.3f}s")
+        print(result)
+    def test_openai(self):
+        self.run_test_case("openai", self.question)
+    def test_llamacpp(self):
+        self.run_test_case("llamacpp", self.question)
+    def test_gpt4all_j(self):
+        self.run_test_case("gpt4all-j", self.question)
+    def test_huggingface(self):
+        self.run_test_case("huggingface", self.question)
+    def test_hftgi(self):
+        self.run_test_case("hftgi", self.question)
+class TestQAChain(unittest.TestCase):
+    qa_chain: any
+    question = os.environ.get("QA_QUESTION")
+    def run_test_case(self, llm_model_type, query):
+        start = timer()
+        os.environ["LLM_MODEL_TYPE"] = llm_model_type
+        qa_chain = app_init()[1]
+        end = timer()
+        print(f"App initialized in {end - start:.3f}s")
+        chat_history = []
+        inputs = {"question": query, "chat_history": chat_history}
+        result = qa_chain.call_chain(inputs, None)
+        end2 = timer()
+        print(f"Inference completed in {end2 - end:.3f}s")
+        print_llm_response(result)
+        chat_history.append((query, result["answer"]))
+        inputs = {"question": "tell me more", "chat_history": chat_history}
+        result = qa_chain.call_chain(inputs, None)
+        end3 = timer()
+        print(f"Inference completed in {end3 - end2:.3f}s")
+        print_llm_response(result)
+    def test_openai(self):
+        self.run_test_case("openai", self.question)
+    def test_llamacpp(self):
+        self.run_test_case("llamacpp", self.question)
+    def test_gpt4all_j(self):
+        self.run_test_case("gpt4all-j", self.question)
+    def test_huggingface(self):
+        self.run_test_case("huggingface", self.question)
+    def test_hftgi(self):
+        self.run_test_case("hftgi", self.question)
+def chat():
+    start = timer()
+    llm_loader = app_init()[0]
+    end = timer()
+    print(f"Model loaded in {end - start:.3f}s")
+    chat_chain = ChatChain(llm_loader)
+    chat_history = []
+    chat_start = timer()
+    while True:
+        query = input("Please enter your question: ")
+        query = query.strip()
+        if query.lower() == "exit":
+            break
+        print("\nQuestion: " + query)
+        start = timer()
+        result = chat_chain.call_chain(
+            {"question": query, "chat_history": chat_history}, None
+        )
+        end = timer()
+        print(f"Completed in {end - start:.3f}s")
+        chat_history.append((query, result["text"]))
+    chat_end = timer()
+    print(f"Total time used: {chat_end - chat_start:.3f}s")
+if __name__ == "__main__":
+    if len(sys.argv) > 1 and sys.argv[1] == "chat":
+        chat()
+    else:
+        unittest.main()