# llama2-wrapper - Use [llama2-wrapper](https://pypi.org/project/llama2-wrapper/) as your local llama2 backend for Generative Agents/Apps, [colab example](https://github.com/liltom-eth/llama2-webui/blob/main/colab/Llama_2_7b_Chat_GPTQ.ipynb). - [Run OpenAI Compatible API](https://github.com/liltom-eth/llama2-webui#start-openai-compatible-api) on Llama2 models. ## Features - Supporting models: [Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)/[13b](https://huggingface.co/llamaste/Llama-2-13b-chat-hf)/[70b](https://huggingface.co/llamaste/Llama-2-70b-chat-hf), [Llama-2-GPTQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ), [Llama-2-GGML](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML), [CodeLlama](https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GPTQ)... - Supporting model backends: [tranformers](https://github.com/huggingface/transformers), [bitsandbytes(8-bit inference)](https://github.com/TimDettmers/bitsandbytes), [AutoGPTQ(4-bit inference)](https://github.com/PanQiWei/AutoGPTQ), [llama.cpp](https://github.com/ggerganov/llama.cpp) - Demos: [Run Llama2 on MacBook Air](https://twitter.com/liltom_eth/status/1682791729207070720?s=20); [Run Llama2 on Colab T4 GPU](https://github.com/liltom-eth/llama2-webui/blob/main/colab/Llama_2_7b_Chat_GPTQ.ipynb) - Use [llama2-wrapper](https://pypi.org/project/llama2-wrapper/) as your local llama2 backend for Generative Agents/Apps; [colab example](./colab/Llama_2_7b_Chat_GPTQ.ipynb). - [Run OpenAI Compatible API](https://github.com/liltom-eth/llama2-webui#start-openai-compatible-api) on Llama2 models. - [News](https://github.com/liltom-eth/llama2-webui/blob/main/docs/news.md), [Benchmark](https://github.com/liltom-eth/llama2-webui/blob/main/docs/performance.md), [Issue Solutions](https://github.com/liltom-eth/llama2-webui/blob/main/docs/issues.md) [llama2-wrapper](https://pypi.org/project/llama2-wrapper/) is the backend and part of [llama2-webui](https://github.com/liltom-eth/llama2-webui), which can run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). ## Install ```bash pip install llama2-wrapper ``` ## Start OpenAI Compatible API ``` python -m llama2_wrapper.server ``` it will use `llama.cpp` as the backend by default to run `llama-2-7b-chat.ggmlv3.q4_0.bin` model. Start Fast API for `gptq` backend: ``` python -m llama2_wrapper.server --backend_type gptq ``` Navigate to http://localhost:8000/docs to see the OpenAPI documentation. ## API Usage ### `__call__` `__call__()` is the function to generate text from a prompt. For example, run ggml llama2 model on CPU, [colab example](https://github.com/liltom-eth/llama2-webui/blob/main/colab/ggmlv3_q4_0.ipynb): ```python from llama2_wrapper import LLAMA2_WRAPPER, get_prompt llama2_wrapper = LLAMA2_WRAPPER() # Default running on backend llama.cpp. # Automatically downloading model to: ./models/llama-2-7b-chat.ggmlv3.q4_0.bin prompt = "Do you know Pytorch" # llama2_wrapper() will run __call__() answer = llama2_wrapper(get_prompt(prompt), temperature=0.9) ``` Run gptq llama2 model on Nvidia GPU, [colab example](https://github.com/liltom-eth/llama2-webui/blob/main/colab/Llama_2_7b_Chat_GPTQ.ipynb): ```python from llama2_wrapper import LLAMA2_WRAPPER llama2_wrapper = LLAMA2_WRAPPER(backend_type="gptq") # Automatically downloading model to: ./models/Llama-2-7b-Chat-GPTQ ``` Run llama2 7b with bitsandbytes 8 bit with a `model_path`: ```python from llama2_wrapper import LLAMA2_WRAPPER llama2_wrapper = LLAMA2_WRAPPER( model_path = "./models/Llama-2-7b-chat-hf", backend_type = "transformers", load_in_8bit = True ) ``` ### completion `completion()` is the function to generate text from a prompt for OpenAI compatible API `/v1/completions`. ```python llama2_wrapper = LLAMA2_WRAPPER() prompt = get_prompt("Hi do you know Pytorch?") print(llm.completion(prompt)) ``` ### chat_completion `chat_completion()` is the function to generate text from a dialog (chat history) for OpenAI compatible API `/v1/chat/completions`. ```python llama2_wrapper = LLAMA2_WRAPPER() dialog = [ { "role":"system", "content":"You are a helpful, respectful and honest assistant. " },{ "role":"user", "content":"Hi do you know Pytorch?", }, ] print(llm.chat_completion(dialog)) ``` ### generate `generate()` is the function to create a generator of response from a prompt. This is useful when you want to stream the output like typing in the chatbot. ```python llama2_wrapper = LLAMA2_WRAPPER() prompt = get_prompt("Hi do you know Pytorch?") for response in llama2_wrapper.generate(prompt): print(response) ``` The response will be like: ``` Yes, Yes, I'm Yes, I'm familiar Yes, I'm familiar with Yes, I'm familiar with PyTorch! ... ``` ### run `run()` is similar to `generate()`, but `run()`can also accept `chat_history`and `system_prompt` from the users. It will process the input message to llama2 prompt template with `chat_history` and `system_prompt` for a chatbot-like app. ### get_prompt `get_prompt()` will process the input message to llama2 prompt with `chat_history` and `system_prompt`for chatbot. By default, `chat_history` and `system_prompt` are empty and `get_prompt()` will add llama2 prompt template to your message: ```python prompt = get_prompt("Hi do you know Pytorch?") ``` prompt will be: ``` [INST] <> <> Hi do you know Pytorch? [/INST] ``` If use `get_prompt("Hi do you know Pytorch?", system_prompt="You are a helpful...")`: ``` [INST] <> You are a helpful, respectful and honest assistant. <> Hi do you know Pytorch? [/INST] ``` ### get_prompt_for_dialog `get_prompt_for_dialog()` will process dialog (chat history) to llama2 prompt for OpenAI compatible API `/v1/chat/completions`. ```python dialog = [ { "role":"system", "content":"You are a helpful, respectful and honest assistant. " },{ "role":"user", "content":"Hi do you know Pytorch?", }, ] prompt = get_prompt_for_dialog("Hi do you know Pytorch?") # [INST] <> # You are a helpful, respectful and honest assistant. # <> # # Hi do you know Pytorch? [/INST] ```