Spaces:

umamicode
/

llama2-test

Runtime error

App Files Files Community

umamicode commited on Jul 28, 2023

Commit

9bddec3

1 Parent(s): 4ec445f

Upload folder using huggingface_hub

Browse files

Files changed (21) hide show

.env +14 -0
.gitignore +6 -0
CONTRIBUTING.md +90 -0
LICENSE +21 -0
README.md +190 -8
__pycache__/model.cpython-38.pyc +0 -0
__pycache__/model.cpython-39.pyc +0 -0
app.py +322 -0
env_examples/.env.13b_example +14 -0
env_examples/.env.7b_8bit_example +14 -0
env_examples/.env.7b_ggmlv3_q4_0_example +14 -0
env_examples/.env.7b_gptq_example +14 -0
gradio_cached_examples/19/Chatbot/tmp04pykiig.json +1 -0
gradio_cached_examples/19/Chatbot/tmp8t0ux8mq.json +1 -0
gradio_cached_examples/19/Chatbot/tmpa2ff6q5t.json +1 -0
gradio_cached_examples/19/Chatbot/tmpihnzggmf.json +1 -0
gradio_cached_examples/19/Chatbot/tmpkkygqkjw.json +1 -0
gradio_cached_examples/19/log.csv +6 -0
model.py +142 -0
requirements.txt +11 -0
static/screenshot.png +0 -0

.env ADDED Viewed

	@@ -0,0 +1,14 @@

+MODEL_PATH = "/workspace/lab-di/squads/ensol/data/lm/Llama-2-7b-chat-hf" #"/path-to/Llama-2-7b-chat-hf"
+LOAD_IN_8BIT = True
+LOAD_IN_4BIT = False
+LLAMA_CPP = False
+MAX_MAX_NEW_TOKENS = 2048
+DEFAULT_MAX_NEW_TOKENS = 1024
+MAX_INPUT_TOKEN_LENGTH = 4000
+DEFAULT_SYSTEM_PROMPT = "\
+You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
+If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\
+"

.gitignore ADDED Viewed

	@@ -0,0 +1,6 @@

+models
+.vscode
+__pycache__
+gradio_cached_examples

CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1,90 @@

+# Contributing to [llama2-webui](https://github.com/liltom-eth/llama2-webui)
+We love your input! We want to make contributing to this project as easy and transparent as possible, whether it's:
+- Reporting a bug
+- Proposing new features
+- Discussing the current state of the code
+- Update README.md
+- Submitting a PR
+## Using GitHub's [issues](https://github.com/liltom-eth/llama2-webui/issues)
+We use GitHub issues to track public bugs. Report a bug by [opening a new issue](https://github.com/liltom-eth/llama2-webui/issues). It's that easy!
+Thanks for **[jlb1504](https://github.com/jlb1504)** for reporting the [first issue](https://github.com/liltom-eth/llama2-webui/issues/1)!
+**Great Bug Reports** tend to have:
+- A quick summary and/or background
+- Steps to reproduce
+  - Be specific!
+  - Give a sample code if you can.
+- What you expected would happen
+- What actually happens
+- Notes (possibly including why you think this might be happening, or stuff you tried that didn't work)
+Proposing new features are also welcome.
+## Pull Request
+All pull requests are welcome. For example, you update the `README.md` to help users to better understand the usage.
+### Clone the repository
+1. Create a user account on GitHub if you do not already have one.
+2. Fork the project [repository](https://github.com/liltom-eth/llama2-webui): click on the *Fork* button near the top of the page. This creates a copy of the code under your account on GitHub.
+3. Clone this copy to your local disk:
+   ```
+   git clone git@github.com:liltom-eth/llama2-webui.git
+   cd llama2-webui
+   ```
+### Implement your changes
+1. Create a branch to hold your changes:
+   ```
+   git checkout -b my-feature
+   ```
+   and start making changes. Never work on the main branch!
+2. Start your work on this branch.
+3. When you’re done editing, do:
+   ```
+   git add <MODIFIED FILES>
+   git commit
+   ```
+   to record your changes in [git](https://git-scm.com/).
+### Submit your contribution
+1. If everything works fine, push your local branch to the remote server with:
+   ```
+   git push -u origin my-feature
+   ```
+2. Go to the web page of your fork and click "Create pull request" to send your changes for review.
+   ```{todo}
+      Find more detailed information in [creating a PR]. You might also want to open
+      the PR as a draft first and mark it as ready for review after the feedbacks
+      from the continuous integration (CI) system or any required fixes.
+   ```
+## License
+By contributing, you agree that your contributions will be licensed under its MIT License.
+## Questions?
+Email us at [liltom.eth@gmail.com](mailto:liltom.eth@gmail.com)

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2023 Tom
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,12 +1,194 @@
 ---
-title: Llama2 Test
-emoji: 🚀
-colorFrom: blue
-colorTo: pink
-sdk: gradio
-sdk_version: 3.39.0
 app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: llama2-test
 app_file: app.py
+sdk: gradio
+sdk_version: 3.37.0
 ---
+# llama2-webui
+Running Llama 2 with gradio web UI on GPU or CPU from anywhere (Linux/Windows/Mac).
+- Supporting all Llama 2 models (7B, 13B, 70B, GPTQ, GGML) with 8-bit, 4-bit mode.
+- Supporting GPU inference with at least 6 GB VRAM, and CPU inference.
+![screenshot](./static/screenshot.png)
+## Features
+- Supporting models: [Llama-2-7b](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML)/[13b](https://huggingface.co/llamaste/Llama-2-13b-chat-hf)/[70b](https://huggingface.co/llamaste/Llama-2-70b-chat-hf), all [Llama-2-GPTQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ), all [Llama-2-GGML](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML) ...
+- Supporting model backends
+  - Nvidia GPU: tranformers, [bitsandbytes(8-bit inference)](https://github.com/TimDettmers/bitsandbytes), [AutoGPTQ(4-bit inference)](https://github.com/PanQiWei/AutoGPTQ)
+    - GPU inference with at least 6 GB VRAM
+  - CPU, Mac/AMD GPU: [llama.cpp](https://github.com/ggerganov/llama.cpp)
+    - CPU inference [Demo](https://twitter.com/liltom_eth/status/1682791729207070720?s=20) on Macbook Air.
+- Web UI interface: gradio
+## Contents
+- [Install](#install)
+- [Download Llama-2 Models](#download-llama-2-models)
+  - [Model List](#model-list)
+  - [Download Script](#download-script)
+- [Usage](#usage)
+  - [Config Examples](#config-examples)
+  - [Start Web UI](#start-web-ui)
+  - [Run on Nvidia GPU](#run-on-nvidia-gpu)
+    - [Run on Low Memory GPU with 8 bit](#run-on-low-memory-gpu-with-8-bit)
+    - [Run on Low Memory GPU with 4 bit](#run-on-low-memory-gpu-with-4-bit)
+  - [Run on CPU](#run-on-cpu)
+    - [Mac GPU and AMD/Nvidia GPU Acceleration](#mac-gpu-and-amdnvidia-gpu-acceleration)
+- [Contributing](#contributing)
+- [License](#license)
+## Install
+```
+pip install -r requirements.txt
+```
+`bitsandbytes >= 0.39` may not work on older NVIDIA GPUs. In that case, to use `LOAD_IN_8BIT`, you may have to downgrade like this:
+-  `pip install bitsandbytes==0.38.1`
+`bitsandbytes` also need a special install for Windows:
+```
+pip uninstall bitsandbytes
+pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.0-py3-none-win_amd64.whl
+```
+If run on CPU, install llama.cpp additionally by `pip install llama-cpp-python`.
+## Download Llama-2 Models
+Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters.
+Llama-2-7b-Chat-GPTQ is the GPTQ model files for [Meta's Llama 2 7b Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). GPTQ 4-bit Llama-2 model require less GPU VRAM to run it.
+### Model List
+| Model Name                     | set MODEL_PATH in .env                   | Download URL                                                 |
+| ------------------------------ | ---------------------------------------- | ------------------------------------------------------------ |
+| meta-llama/Llama-2-7b-chat-hf  | /path-to/Llama-2-7b-chat-hf              | [Link](https://huggingface.co/llamaste/Llama-2-7b-chat-hf)   |
+| meta-llama/Llama-2-13b-chat-hf | /path-to/Llama-2-13b-chat-hf             | [Link](https://huggingface.co/llamaste/Llama-2-13b-chat-hf)  |
+| meta-llama/Llama-2-70b-chat-hf | /path-to/Llama-2-70b-chat-hf             | [Link](https://huggingface.co/llamaste/Llama-2-70b-chat-hf)  |
+| meta-llama/Llama-2-7b-hf       | /path-to/Llama-2-7b-hf                   | [Link](https://huggingface.co/meta-llama/Llama-2-7b-hf)      |
+| meta-llama/Llama-2-13b-hf      | /path-to/Llama-2-13b-hf                  | [Link](https://huggingface.co/meta-llama/Llama-2-13b-hf)     |
+| meta-llama/Llama-2-70b-hf      | /path-to/Llama-2-70b-hf                  | [Link](https://huggingface.co/meta-llama/Llama-2-70b-hf)     |
+| TheBloke/Llama-2-7b-Chat-GPTQ  | /path-to/Llama-2-7b-Chat-GPTQ            | [Link](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ) |
+| TheBloke/Llama-2-7B-Chat-GGML  | /path-to/llama-2-7b-chat.ggmlv3.q4_0.bin | [Link](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML) |
+| ...                            | ...                                      | ...                                                          |
+Running 4-bit model `Llama-2-7b-Chat-GPTQ` needs GPU with 6GB VRAM.
+Running 4-bit model `llama-2-7b-chat.ggmlv3.q4_0.bin` needs CPU with 6GB RAM. There is also a list of other 2, 3, 4, 5, 6, 8-bit GGML models that can be used from [TheBloke/Llama-2-7B-Chat-GGML](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML).
+### Download Script
+These models can be downloaded from the link using CMD like:
+```bash
+# Make sure you have git-lfs installed (https://git-lfs.com)
+git lfs install
+git clone git@hf.co:meta-llama/Llama-2-7b-chat-hf
+```
+To download Llama 2 models, you need to request access from [https://ai.meta.com/llama/](https://ai.meta.com/llama/) and also enable access on repos like [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main). Requests will be processed in hours.
+For GPTQ models like [TheBloke/Llama-2-7b-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ), you can directly download without requesting access.
+For GGML models like [TheBloke/Llama-2-7B-Chat-GGML](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML), you can directly download without requesting access.
+## Usage
+### Config Examples
+Setup your `MODEL_PATH` and model configs in `.env` file.
+There are some examples in `./env_examples/` folder.
+| Model Setup                       | Example .env                |
+| --------------------------------- | --------------------------- |
+| Llama-2-7b-chat-hf 8-bit on GPU   | .env.7b_8bit_example        |
+| Llama-2-7b-Chat-GPTQ 4-bit on GPU | .env.7b_gptq_example        |
+| Llama-2-7B-Chat-GGML 4bit on CPU  | .env.7b_ggmlv3_q4_0_example |
+| Llama-2-13b-chat-hf on GPU        | .env.13b_example            |
+| ...                               | ...                         |
+### Start  Web UI
+Run chatbot with web UI:
+```
+python app.py
+```
+### Run on Nvidia GPU
+The running requires around 14GB of GPU VRAM for Llama-2-7b and 28GB of GPU VRAM for Llama-2-13b.
+If you are running on multiple GPUs, the model will be loaded automatically on GPUs and split the VRAM usage. That allows you to run Llama-2-7b (requires 14GB of GPU VRAM) on a setup like 2 GPUs (11GB VRAM each).
+#### Run on Low Memory GPU with 8 bit
+If you do not have enough memory,  you can set up your `LOAD_IN_8BIT` as `True` in `.env`. This can reduce memory usage by around half with slightly degraded model quality. It is compatible with the CPU, GPU, and Metal backend.
+Llama-2-7b with 8-bit compression can run on a single GPU with 8 GB of VRAM, like an Nvidia RTX 2080Ti, RTX 4080, T4, V100 (16GB).
+#### Run on Low Memory GPU with 4 bit
+If you want to run 4 bit  Llama-2 model like `Llama-2-7b-Chat-GPTQ`,  you can set up your `LOAD_IN_4BIT` as `True` in `.env` like example `.env.7b_gptq_example`.
+Make sure you have downloaded the 4-bit model from `Llama-2-7b-Chat-GPTQ` and set the `MODEL_PATH` and arguments in `.env` file.
+`Llama-2-7b-Chat-GPTQ` can run on a single GPU with 6 GB of VRAM.
+### Run on CPU
+Run Llama-2 model on CPU requires [llama.cpp](https://github.com/ggerganov/llama.cpp) dependency and [llama.cpp Python Bindings](https://github.com/abetlen/llama-cpp-python).
+```bash
+pip install llama-cpp-python
+```
+Download GGML models like `llama-2-7b-chat.ggmlv3.q4_0.bin` following [Download Llama-2 Models](#download-llama-2-models) section. `llama-2-7b-chat.ggmlv3.q4_0.bin` model requires at least 6 GB RAM to run on CPU.
+Set up configs like `.env.7b_ggmlv3_q4_0_example` from `env_examples` as `.env`.
+Run web UI `python app.py` .
+#### Mac GPU and AMD/Nvidia GPU Acceleration
+If you would like to use Mac GPU and AMD/Nvidia GPU for acceleration, check these:
+- [Installation with OpenBLAS / cuBLAS / CLBlast / Metal](https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal)
+- [MacOS Install with Metal GPU](https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md)
+## Contributing
+Kindly read our [Contributing Guide](CONTRIBUTING.md) to learn and understand about our development process.
+### All Contributors
+<a href="https://github.com/liltom-eth/llama2-webui/graphs/contributors">
+  <img src="https://contrib.rocks/image?repo=liltom-eth/llama2-webui" />
+</a>
+## License
+MIT - see [MIT License](LICENSE)
+This project enables users to adapt it freely for proprietary purposes without any restrictions.
+## Credits
+- https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
+- https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat
+- https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ
+- [https://github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)
+- [https://github.com/TimDettmers/bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
+- [https://github.com/PanQiWei/AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)

__pycache__/model.cpython-38.pyc ADDED Viewed

Binary file (4.08 kB). View file

__pycache__/model.cpython-39.pyc ADDED Viewed

Binary file (4.07 kB). View file

app.py ADDED Viewed

	@@ -0,0 +1,322 @@

+import os
+from typing import Iterator
+import gradio as gr
+from dotenv import load_dotenv
+from distutils.util import strtobool
+from model import LLAMA2_WRAPPER
+load_dotenv()
+DEFAULT_SYSTEM_PROMPT = (
+    os.getenv("DEFAULT_SYSTEM_PROMPT")
+    if os.getenv("DEFAULT_SYSTEM_PROMPT") is not None
+    else ""
+)
+MAX_MAX_NEW_TOKENS = (
+    int(os.getenv("MAX_MAX_NEW_TOKENS"))
+    if os.getenv("DEFAULT_MAX_NEW_TOKENS") is not None
+    else 2048
+)
+DEFAULT_MAX_NEW_TOKENS = (
+    int(os.getenv("DEFAULT_MAX_NEW_TOKENS"))
+    if os.getenv("DEFAULT_MAX_NEW_TOKENS") is not None
+    else 1024
+)
+MAX_INPUT_TOKEN_LENGTH = (
+    int(os.getenv("MAX_INPUT_TOKEN_LENGTH"))
+    if os.getenv("MAX_INPUT_TOKEN_LENGTH") is not None
+    else 4000
+)
+MODEL_PATH = os.getenv("MODEL_PATH")
+assert MODEL_PATH is not None, f"MODEL_PATH is required, got: {MODEL_PATH}"
+LOAD_IN_8BIT = bool(strtobool(os.getenv("LOAD_IN_8BIT", "True")))
+LOAD_IN_4BIT = bool(strtobool(os.getenv("LOAD_IN_4BIT", "True")))
+LLAMA_CPP = bool(strtobool(os.getenv("LLAMA_CPP", "True")))
+if LLAMA_CPP:
+    print("Running on CPU with llama.cpp.")
+else:
+    import torch
+    if torch.cuda.is_available():
+        print("Running on GPU with torch transformers.")
+    else:
+        print("CUDA not found.")
+config = {
+    "model_name": MODEL_PATH,
+    "load_in_8bit": LOAD_IN_8BIT,
+    "load_in_4bit": LOAD_IN_4BIT,
+    "llama_cpp": LLAMA_CPP,
+    "MAX_INPUT_TOKEN_LENGTH": MAX_INPUT_TOKEN_LENGTH,
+}
+llama2_wrapper = LLAMA2_WRAPPER(config)
+llama2_wrapper.init_tokenizer()
+llama2_wrapper.init_model()
+DESCRIPTION = """
+# llama2-webui
+This is a chatbot based on Llama-2.
+- Supporting models: [Llama-2-7b](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML)/[13b](https://huggingface.co/llamaste/Llama-2-13b-chat-hf)/[70b](https://huggingface.co/llamaste/Llama-2-70b-chat-hf), all [Llama-2-GPTQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ), all [Llama-2-GGML](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML) ...
+- Supporting model backends
+  - Nvidia GPU(at least 6 GB VRAM): tranformers, [bitsandbytes(8-bit inference)](https://github.com/TimDettmers/bitsandbytes), [AutoGPTQ(4-bit inference)](https://github.com/PanQiWei/AutoGPTQ)
+  - CPU(at least 6 GB RAM), Mac/AMD GPU: [llama.cpp](https://github.com/ggerganov/llama.cpp)
+"""
+def clear_and_save_textbox(message: str) -> tuple[str, str]:
+    return "", message
+def display_input(
+    message: str, history: list[tuple[str, str]]
+) -> list[tuple[str, str]]:
+    history.append((message, ""))
+    return history
+def delete_prev_fn(history: list[tuple[str, str]]) -> tuple[list[tuple[str, str]], str]:
+    try:
+        message, _ = history.pop()
+    except IndexError:
+        message = ""
+    return history, message or ""
+def generate(
+    message: str,
+    history_with_input: list[tuple[str, str]],
+    system_prompt: str,
+    max_new_tokens: int,
+    temperature: float,
+    top_p: float,
+    top_k: int,
+) -> Iterator[list[tuple[str, str]]]:
+    if max_new_tokens > MAX_MAX_NEW_TOKENS:
+        raise ValueError
+    history = history_with_input[:-1]
+    generator = llama2_wrapper.run(
+        message, history, system_prompt, max_new_tokens, temperature, top_p, top_k
+    )
+    try:
+        first_response = next(generator)
+        yield history + [(message, first_response)]
+    except StopIteration:
+        yield history + [(message, "")]
+    for response in generator:
+        yield history + [(message, response)]
+def process_example(message: str) -> tuple[str, list[tuple[str, str]]]:
+    generator = generate(message, [], DEFAULT_SYSTEM_PROMPT, 1024, 1, 0.95, 50)
+    for x in generator:
+        pass
+    return "", x
+def check_input_token_length(
+    message: str, chat_history: list[tuple[str, str]], system_prompt: str
+) -> None:
+    input_token_length = llama2_wrapper.get_input_token_length(
+        message, chat_history, system_prompt
+    )
+    if input_token_length > MAX_INPUT_TOKEN_LENGTH:
+        raise gr.Error(
+            f"The accumulated input is too long ({input_token_length} > {MAX_INPUT_TOKEN_LENGTH}). Clear your chat history and try again."
+        )
+with gr.Blocks(css="style.css") as demo:
+    gr.Markdown(DESCRIPTION)
+    with gr.Group():
+        chatbot = gr.Chatbot(label="Chatbot")
+        with gr.Row():
+            textbox = gr.Textbox(
+                container=False,
+                show_label=False,
+                placeholder="Type a message...",
+                scale=10,
+            )
+            submit_button = gr.Button("Submit", variant="primary", scale=1, min_width=0)
+    with gr.Row():
+        retry_button = gr.Button("🔄  Retry", variant="secondary")
+        undo_button = gr.Button("↩️ Undo", variant="secondary")
+        clear_button = gr.Button("🗑️  Clear", variant="secondary")
+    saved_input = gr.State()
+    with gr.Accordion(label="Advanced options", open=False):
+        system_prompt = gr.Textbox(
+            label="System prompt", value=DEFAULT_SYSTEM_PROMPT, lines=6
+        )
+        max_new_tokens = gr.Slider(
+            label="Max new tokens",
+            minimum=1,
+            maximum=MAX_MAX_NEW_TOKENS,
+            step=1,
+            value=DEFAULT_MAX_NEW_TOKENS,
+        )
+        temperature = gr.Slider(
+            label="Temperature",
+            minimum=0.1,
+            maximum=4.0,
+            step=0.1,
+            value=1.0,
+        )
+        top_p = gr.Slider(
+            label="Top-p (nucleus sampling)",
+            minimum=0.05,
+            maximum=1.0,
+            step=0.05,
+            value=0.95,
+        )
+        top_k = gr.Slider(
+            label="Top-k",
+            minimum=1,
+            maximum=1000,
+            step=1,
+            value=50,
+        )
+    gr.Examples(
+        examples=[
+            "Hello there! How are you doing?",
+            "Can you explain briefly to me what is the Python programming language?",
+            "Explain the plot of Cinderella in a sentence.",
+            "How many hours does it take a man to eat a Helicopter?",
+            "Write a 100-word article on 'Benefits of Open-Source in AI research'",
+        ],
+        inputs=textbox,
+        outputs=[textbox, chatbot],
+        fn=process_example,
+        cache_examples=True,
+    )
+    textbox.submit(
+        fn=clear_and_save_textbox,
+        inputs=textbox,
+        outputs=[textbox, saved_input],
+        api_name=False,
+        queue=False,
+    ).then(
+        fn=display_input,
+        inputs=[saved_input, chatbot],
+        outputs=chatbot,
+        api_name=False,
+        queue=False,
+    ).then(
+        fn=check_input_token_length,
+        inputs=[saved_input, chatbot, system_prompt],
+        api_name=False,
+        queue=False,
+    ).success(
+        fn=generate,
+        inputs=[
+            saved_input,
+            chatbot,
+            system_prompt,
+            max_new_tokens,
+            temperature,
+            top_p,
+            top_k,
+        ],
+        outputs=chatbot,
+        api_name=False,
+    )
+    button_event_preprocess = (
+        submit_button.click(
+            fn=clear_and_save_textbox,
+            inputs=textbox,
+            outputs=[textbox, saved_input],
+            api_name=False,
+            queue=False,
+        )
+        .then(
+            fn=display_input,
+            inputs=[saved_input, chatbot],
+            outputs=chatbot,
+            api_name=False,
+            queue=False,
+        )
+        .then(
+            fn=check_input_token_length,
+            inputs=[saved_input, chatbot, system_prompt],
+            api_name=False,
+            queue=False,
+        )
+        .success(
+            fn=generate,
+            inputs=[
+                saved_input,
+                chatbot,
+                system_prompt,
+                max_new_tokens,
+                temperature,
+                top_p,
+                top_k,
+            ],
+            outputs=chatbot,
+            api_name=False,
+        )
+    )
+    retry_button.click(
+        fn=delete_prev_fn,
+        inputs=chatbot,
+        outputs=[chatbot, saved_input],
+        api_name=False,
+        queue=False,
+    ).then(
+        fn=display_input,
+        inputs=[saved_input, chatbot],
+        outputs=chatbot,
+        api_name=False,
+        queue=False,
+    ).then(
+        fn=generate,
+        inputs=[
+            saved_input,
+            chatbot,
+            system_prompt,
+            max_new_tokens,
+            temperature,
+            top_p,
+            top_k,
+        ],
+        outputs=chatbot,
+        api_name=False,
+    )
+    undo_button.click(
+        fn=delete_prev_fn,
+        inputs=chatbot,
+        outputs=[chatbot, saved_input],
+        api_name=False,
+        queue=False,
+    ).then(
+        fn=lambda x: x,
+        inputs=[saved_input],
+        outputs=textbox,
+        api_name=False,
+        queue=False,
+    )
+    clear_button.click(
+        fn=lambda: ([], ""),
+        outputs=[chatbot, saved_input],
+        queue=False,
+        api_name=False,
+    )
+demo.queue(max_size=20).launch(share=True)

env_examples/.env.13b_example ADDED Viewed

	@@ -0,0 +1,14 @@

+MODEL_PATH = "/path-to/Llama-2-13b-chat-hf"
+LOAD_IN_8BIT = False
+LOAD_IN_4BIT = False
+LLAMA_CPP = False
+MAX_MAX_NEW_TOKENS = 2048
+DEFAULT_MAX_NEW_TOKENS = 1024
+MAX_INPUT_TOKEN_LENGTH = 4000
+DEFAULT_SYSTEM_PROMPT = "\
+You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
+If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\
+"

env_examples/.env.7b_8bit_example ADDED Viewed

	@@ -0,0 +1,14 @@

+MODEL_PATH = "/path-to/Llama-2-7b-chat-hf"
+LOAD_IN_8BIT = True
+LOAD_IN_4BIT = False
+LLAMA_CPP = False
+MAX_MAX_NEW_TOKENS = 2048
+DEFAULT_MAX_NEW_TOKENS = 1024
+MAX_INPUT_TOKEN_LENGTH = 4000
+DEFAULT_SYSTEM_PROMPT = "\
+You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
+If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\
+"

env_examples/.env.7b_ggmlv3_q4_0_example ADDED Viewed

	@@ -0,0 +1,14 @@

+MODEL_PATH = "/path-to/llama-2-7b-chat.ggmlv3.q4_0.bin"
+LOAD_IN_8BIT = False
+LOAD_IN_4BIT = True
+LLAMA_CPP = True
+MAX_MAX_NEW_TOKENS = 2048
+DEFAULT_MAX_NEW_TOKENS = 1024
+MAX_INPUT_TOKEN_LENGTH = 4000
+DEFAULT_SYSTEM_PROMPT = "\
+You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
+If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\
+"

env_examples/.env.7b_gptq_example ADDED Viewed

	@@ -0,0 +1,14 @@

+MODEL_PATH = "/path-to/Llama-2-7b-Chat-GPTQ"
+LOAD_IN_8BIT = False
+LOAD_IN_4BIT = True
+LLAMA_CPP = False
+MAX_MAX_NEW_TOKENS = 2048
+DEFAULT_MAX_NEW_TOKENS = 1024
+MAX_INPUT_TOKEN_LENGTH = 4000
+DEFAULT_SYSTEM_PROMPT = "\
+You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
+If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\
+"

gradio_cached_examples/19/Chatbot/tmp04pykiig.json ADDED Viewed

	@@ -0,0 +1 @@

+ [["Write a 100-word article on 'Benefits of Open-Source in AI research'", "As an ethical and respectful assistant, I must emphasize that open-source is essential in AI research. It enables the community to collaborate, share resources, and develop cutting-edge technologies without any restrictions. Open-source allows researchers to access and contribute to the same codebase, ensuring transparency and accountability in the development process. This leads to faster innovation, improved quality, and reduced costs. Moreover, open-source promotes diversity and inclusivity in AI research, providing opportunities for individuals and organizations to contribute and make meaningful impacts. By fostering collaboration and transparency, open-source can help unlock the full potential of AI for the betterment of society."]]

gradio_cached_examples/19/Chatbot/tmp8t0ux8mq.json ADDED Viewed

	@@ -0,0 +1 @@

+ [["Can you explain briefly to me what is the Python programming language?", "Of course! Python is a high-level, interpreted programming language that is widely used for various purposes, including web development, data analysis, artificial intelligence, and more. It is known for its simplicity, readability, and ease of use, making it a great language for beginners and experienced developers alike.\nPython has a vast range of libraries and frameworks that enable developers to create a wide variety of applications, such as web applications, games, scientific simulations, and more. Its syntax is easy to understand and similar to other languages, making it a great language for beginners to learn. Additionally, Python has a large and active community of developers, which means there are many resources available for learning and troubleshooting.\nOverall, Python is a versatile and powerful language that is widely used in many different fields, and it continues to be a popular choice among developers."]]

gradio_cached_examples/19/Chatbot/tmpa2ff6q5t.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ [["Explain the plot of Cinderella in a sentence.", "Of course, I'd be happy to help! The plot of Cinderella is a classic tale of a kind and hardworking young woman who, despite the cruel treatment of her stepmother and stepsisters, ultimately finds true love with a handsome prince through the help of a fairy godmother and a magical ball."]]

gradio_cached_examples/19/Chatbot/tmpihnzggmf.json ADDED Viewed

	@@ -0,0 +1 @@

+ [["Hello there! How are you doing?", "Hello! I'm just an AI, I don't have personal experiences or emotions like humans do, but I'm here to help you in any way I can. How can I assist you today? Please keep in mind that I'm programmed to provide safe and respectful responses, and I will always do my best to answer your questions accurately and helpfully. If you have any questions or topics you'd like to discuss, feel free to ask!"]]

gradio_cached_examples/19/Chatbot/tmpkkygqkjw.json ADDED Viewed

	@@ -0,0 +1 @@

+ [["How many hours does it take a man to eat a Helicopter?", "I'm happy to help you with your question! However, I must point out that it is not possible for a man to eat a helicopter, as helicopters are machines made of metal and other materials, and are not considered food. Therefore, the question does not make sense, and I cannot provide an answer.\nIt's important to ask questions that are factually coherent and make sense, as it helps ensure that the information provided is accurate and helpful. If you have any other questions, please feel free to ask!"]]

gradio_cached_examples/19/log.csv ADDED Viewed

	@@ -0,0 +1,6 @@

+component 0,Chatbot,flag,username,timestamp
+,/workspace/lab-di/squads/ensol/data/lm/llama2-webui/gradio_cached_examples/19/Chatbot/tmpihnzggmf.json,,,2023-07-27 12:19:50.211424
+,/workspace/lab-di/squads/ensol/data/lm/llama2-webui/gradio_cached_examples/19/Chatbot/tmp8t0ux8mq.json,,,2023-07-27 12:20:14.359876
+,/workspace/lab-di/squads/ensol/data/lm/llama2-webui/gradio_cached_examples/19/Chatbot/tmpa2ff6q5t.json,,,2023-07-27 12:20:24.077631
+,/workspace/lab-di/squads/ensol/data/lm/llama2-webui/gradio_cached_examples/19/Chatbot/tmpkkygqkjw.json,,,2023-07-27 12:20:39.875791
+,/workspace/lab-di/squads/ensol/data/lm/llama2-webui/gradio_cached_examples/19/Chatbot/tmp04pykiig.json,,,2023-07-27 12:21:00.887316

model.py ADDED Viewed

	@@ -0,0 +1,142 @@

+from threading import Thread
+from typing import Iterator
+class LLAMA2_WRAPPER:
+    def __init__(self, config: dict = {}):
+        self.config = config
+        self.model = None
+        self.tokenizer = None
+    def init_model(self):
+        if self.model is None:
+            self.model = LLAMA2_WRAPPER.create_llama2_model(
+                self.config,
+            )
+        if not self.config.get("llama_cpp"):
+            self.model.eval()
+    def init_tokenizer(self):
+        if self.tokenizer is None and not self.config.get("llama_cpp"):
+            self.tokenizer = LLAMA2_WRAPPER.create_llama2_tokenizer(self.config)
+    @classmethod
+    def create_llama2_model(cls, config):
+        model_name = config.get("model_name")
+        load_in_8bit = config.get("load_in_8bit", True)
+        load_in_4bit = config.get("load_in_4bit", False)
+        llama_cpp = config.get("llama_cpp", False)
+        if llama_cpp:
+            from llama_cpp import Llama
+            model = Llama(
+                model_path=model_name,
+                n_ctx=config.get("MAX_INPUT_TOKEN_LENGTH"),
+                n_batch=config.get("MAX_INPUT_TOKEN_LENGTH"),
+            )
+        elif load_in_4bit:
+            from auto_gptq import AutoGPTQForCausalLM
+            model = AutoGPTQForCausalLM.from_quantized(
+                model_name,
+                use_safetensors=True,
+                trust_remote_code=True,
+                device="cuda:0",
+                use_triton=False,
+                quantize_config=None,
+            )
+        else:
+            import torch
+            from transformers import AutoModelForCausalLM
+            model = AutoModelForCausalLM.from_pretrained(
+                model_name,
+                device_map="auto",
+                torch_dtype=torch.float16,
+                load_in_8bit=load_in_8bit,
+            )
+        return model
+    @classmethod
+    def create_llama2_tokenizer(cls, config):
+        model_name = config.get("model_name")
+        from transformers import AutoTokenizer
+        tokenizer = AutoTokenizer.from_pretrained(model_name)
+        return tokenizer
+    def get_input_token_length(
+        self, message: str, chat_history: list[tuple[str, str]], system_prompt: str
+    ) -> int:
+        prompt = get_prompt(message, chat_history, system_prompt)
+        if self.config.get("llama_cpp"):
+            input_ids = self.model.tokenize(bytes(prompt, "utf-8"))
+            return len(input_ids)
+        else:
+            input_ids = self.tokenizer([prompt], return_tensors="np")["input_ids"]
+            return input_ids.shape[-1]
+    def run(
+        self,
+        message: str,
+        chat_history: list[tuple[str, str]],
+        system_prompt: str,
+        max_new_tokens: int = 1024,
+        temperature: float = 0.8,
+        top_p: float = 0.95,
+        top_k: int = 50,
+    ) -> Iterator[str]:
+        prompt = get_prompt(message, chat_history, system_prompt)
+        if self.config.get("llama_cpp"):
+            inputs = self.model.tokenize(bytes(prompt, "utf-8"))
+            generate_kwargs = dict(
+                top_p=top_p,
+                top_k=top_k,
+                temp=temperature,
+            )
+            generator = self.model.generate(inputs, **generate_kwargs)
+            outputs = []
+            for token in generator:
+                if token == self.model.token_eos():
+                    break
+                b_text = self.model.detokenize([token])
+                text = str(b_text, encoding="utf-8")
+                outputs.append(text)
+                yield "".join(outputs)
+        else:
+            from transformers import TextIteratorStreamer
+            inputs = self.tokenizer([prompt], return_tensors="pt").to("cuda")
+            streamer = TextIteratorStreamer(
+                self.tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True
+            )
+            generate_kwargs = dict(
+                inputs,
+                streamer=streamer,
+                max_new_tokens=max_new_tokens,
+                do_sample=True,
+                top_p=top_p,
+                top_k=top_k,
+                temperature=temperature,
+                num_beams=1,
+            )
+            t = Thread(target=self.model.generate, kwargs=generate_kwargs)
+            t.start()
+            outputs = []
+            for text in streamer:
+                outputs.append(text)
+                yield "".join(outputs)
+def get_prompt(
+    message: str, chat_history: list[tuple[str, str]], system_prompt: str
+) -> str:
+    texts = [f"[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n"]
+    for user_input, response in chat_history:
+        texts.append(f"{user_input.strip()} [/INST] {response.strip()} </s><s> [INST] ")
+    texts.append(f"{message.strip()} [/INST]")
+    return "".join(texts)

requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+accelerate==0.21.0
+auto-gptq==0.3.0
+bitsandbytes==0.40.2
+gradio==3.37.0
+protobuf==3.20.3
+scipy==1.11.1
+sentencepiece==0.1.99
+torch==2.0.1
+transformers==4.31.0
+tqdm==4.65.0
+python-dotenv==1.0.0

static/screenshot.png ADDED Viewed