goavinash5's picture
Upload folder using huggingface_hub
e97665c

A newer version of the Gradio SDK is available: 4.37.1

Upgrade

llama2-wrapper

Features

llama2-wrapper is the backend and part of llama2-webui, which can run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac).

Install

pip install llama2-wrapper

Start OpenAI Compatible API

python -m llama2_wrapper.server

it will use llama.cpp as the backend by default to run llama-2-7b-chat.ggmlv3.q4_0.bin model.

Start Fast API for gptq backend:

python -m llama2_wrapper.server --backend_type gptq

Navigate to http://localhost:8000/docs to see the OpenAPI documentation.

API Usage

__call__

__call__() is the function to generate text from a prompt.

For example, run ggml llama2 model on CPU, colab example:

from llama2_wrapper import LLAMA2_WRAPPER, get_prompt 
llama2_wrapper = LLAMA2_WRAPPER()
# Default running on backend llama.cpp.
# Automatically downloading model to: ./models/llama-2-7b-chat.ggmlv3.q4_0.bin
prompt = "Do you know Pytorch"
# llama2_wrapper() will run __call__()
answer = llama2_wrapper(get_prompt(prompt), temperature=0.9)

Run gptq llama2 model on Nvidia GPU, colab example:

from llama2_wrapper import LLAMA2_WRAPPER 
llama2_wrapper = LLAMA2_WRAPPER(backend_type="gptq")
# Automatically downloading model to: ./models/Llama-2-7b-Chat-GPTQ

Run llama2 7b with bitsandbytes 8 bit with a model_path:

from llama2_wrapper import LLAMA2_WRAPPER 
llama2_wrapper = LLAMA2_WRAPPER(
    model_path = "./models/Llama-2-7b-chat-hf",
  backend_type = "transformers",
  load_in_8bit = True
)

completion

completion() is the function to generate text from a prompt for OpenAI compatible API /v1/completions.

llama2_wrapper = LLAMA2_WRAPPER()
prompt = get_prompt("Hi do you know Pytorch?")
print(llm.completion(prompt))

chat_completion

chat_completion() is the function to generate text from a dialog (chat history) for OpenAI compatible API /v1/chat/completions.

llama2_wrapper = LLAMA2_WRAPPER()
dialog = [
    {
        "role":"system",
        "content":"You are a helpful, respectful and honest assistant. "
    },{
        "role":"user",
        "content":"Hi do you know Pytorch?",
    },
]
print(llm.chat_completion(dialog))

generate

generate() is the function to create a generator of response from a prompt.

This is useful when you want to stream the output like typing in the chatbot.

llama2_wrapper = LLAMA2_WRAPPER()
prompt = get_prompt("Hi do you know Pytorch?")
for response in llama2_wrapper.generate(prompt):
    print(response)

The response will be like:

Yes, 
Yes, I'm 
Yes, I'm familiar 
Yes, I'm familiar with 
Yes, I'm familiar with PyTorch! 
...

run

run() is similar to generate(), but run()can also accept chat_historyand system_prompt from the users.

It will process the input message to llama2 prompt template with chat_history and system_prompt for a chatbot-like app.

get_prompt

get_prompt() will process the input message to llama2 prompt with chat_history and system_promptfor chatbot.

By default, chat_history and system_prompt are empty and get_prompt() will add llama2 prompt template to your message:

prompt = get_prompt("Hi do you know Pytorch?")

prompt will be:

[INST] <<SYS>>

<</SYS>>

Hi do you know Pytorch? [/INST]

If use get_prompt("Hi do you know Pytorch?", system_prompt="You are a helpful..."):

[INST] <<SYS>>
You are a helpful, respectful and honest assistant. 
<</SYS>>

Hi do you know Pytorch? [/INST]

get_prompt_for_dialog

get_prompt_for_dialog() will process dialog (chat history) to llama2 prompt for OpenAI compatible API /v1/chat/completions.

dialog = [
    {
        "role":"system",
        "content":"You are a helpful, respectful and honest assistant. "
    },{
        "role":"user",
        "content":"Hi do you know Pytorch?",
    },
]
prompt = get_prompt_for_dialog("Hi do you know Pytorch?")
# [INST] <<SYS>>
# You are a helpful, respectful and honest assistant. 
# <</SYS>>
# 
# Hi do you know Pytorch? [/INST]