Spaces:

rodrigomasini
/

advanced-ui-for-gw

Runtime error

App Files Files Community

rodrigomasini commited on Feb 27

Commit

e5ac6f5

•

1 Parent(s): d612691

Upload 8 files

Browse files

Files changed (8) hide show

docs/Docker.md +203 -0
docs/ExLlama.md +22 -0
docs/Extensions.md +244 -0
docs/GPTQ-models-(4-bit-mode).md +228 -0
docs/LLaMA-model.md +56 -0
docs/README.md +21 -0
docs/System-requirements.md +42 -0
docs/llama.cpp.md +42 -0

docs/Docker.md ADDED Viewed

	@@ -0,0 +1,203 @@

+Docker Compose is a way of installing and launching the web UI in an isolated Ubuntu image using only a few commands.
+In order to create the image as described in the main README, you must have docker compose 2.17 or higher:
+```
+~$ docker compose version
+Docker Compose version v2.17.2
+```
+Make sure to also create the necessary symbolic links:
+```
+cd text-generation-webui
+ln -s docker/{Dockerfile,docker-compose.yml,.dockerignore} .
+cp docker/.env.example .env
+# Edit .env and set TORCH_CUDA_ARCH_LIST based on your GPU model
+docker compose up --build
+```
+# Table of contents
+* [Docker Compose installation instructions](#docker-compose-installation-instructions)
+* [Repository with additional Docker files](#dedicated-docker-repository)
+# Docker Compose installation instructions
+By [@loeken](https://github.com/loeken).
+- [Ubuntu 22.04](#ubuntu-2204)
+  - [0. youtube video](#0-youtube-video)
+  - [1. update the drivers](#1-update-the-drivers)
+  - [2. reboot](#2-reboot)
+  - [3. install docker](#3-install-docker)
+  - [4. docker \& container toolkit](#4-docker--container-toolkit)
+  - [5. clone the repo](#5-clone-the-repo)
+  - [6. prepare models](#6-prepare-models)
+  - [7. prepare .env file](#7-prepare-env-file)
+  - [8. startup docker container](#8-startup-docker-container)
+- [Manjaro](#manjaro)
+  - [update the drivers](#update-the-drivers)
+  - [reboot](#reboot)
+  - [docker \& container toolkit](#docker--container-toolkit)
+  - [continue with ubuntu task](#continue-with-ubuntu-task)
+- [Windows](#windows)
+  - [0. youtube video](#0-youtube-video-1)
+  - [1. choco package manager](#1-choco-package-manager)
+  - [2. install drivers/dependencies](#2-install-driversdependencies)
+  - [3. install wsl](#3-install-wsl)
+  - [4. reboot](#4-reboot)
+  - [5. git clone \&\& startup](#5-git-clone--startup)
+  - [6. prepare models](#6-prepare-models-1)
+  - [7. startup](#7-startup)
+- [notes](#notes)
+## Ubuntu 22.04
+### 0. youtube video
+A video walking you through the setup can be found here:
+[![oobabooga text-generation-webui setup in docker on ubuntu 22.04](https://img.youtube.com/vi/ELkKWYh8qOk/0.jpg)](https://www.youtube.com/watch?v=ELkKWYh8qOk)
+### 1. update the drivers
+in the the “software updater” update drivers to the last version of the prop driver.
+### 2. reboot
+to switch using to new driver
+### 3. install docker
+```bash
+sudo apt update
+sudo apt-get install curl
+sudo mkdir -m 0755 -p /etc/apt/keyrings
+curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
+echo \
+  "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
+  "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
+  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
+sudo apt update
+sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-compose -y
+sudo usermod -aG docker $USER
+newgrp docker
+```
+### 4. docker & container toolkit
+```bash
+curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
+echo "deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/ubuntu22.04/amd64 /" | \
+sudo tee /etc/apt/sources.list.d/nvidia.list > /dev/null
+sudo apt update
+sudo apt install nvidia-docker2 nvidia-container-runtime -y
+sudo systemctl restart docker
+```
+### 5. clone the repo
+```
+git clone https://github.com/oobabooga/text-generation-webui
+cd text-generation-webui
+```
+### 6. prepare models
+download and place the models inside the models folder. tested with:
+4bit
+https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483891617
+https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483941105
+8bit:
+https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1484235789
+### 7. prepare .env file
+edit .env values to your needs.
+```bash
+cp .env.example .env
+nano .env
+```
+### 8. startup docker container
+```bash
+docker compose up --build
+```
+## Manjaro
+manjaro/arch is similar to ubuntu just the dependency installation is more convenient
+### update the drivers
+```bash
+sudo mhwd -a pci nonfree 0300
+```
+### reboot
+```bash
+reboot
+```
+### docker & container toolkit
+```bash
+yay -S docker docker-compose buildkit gcc nvidia-docker
+sudo usermod -aG docker $USER
+newgrp docker
+sudo systemctl restart docker # required by nvidia-container-runtime
+```
+### continue with ubuntu task
+continue at [5. clone the repo](#5-clone-the-repo)
+## Windows
+### 0. youtube video
+A video walking you through the setup can be found here:
+[![oobabooga text-generation-webui setup in docker on windows 11](https://img.youtube.com/vi/ejH4w5b5kFQ/0.jpg)](https://www.youtube.com/watch?v=ejH4w5b5kFQ)
+### 1. choco package manager
+install package manager  (https://chocolatey.org/ )
+```
+Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))
+```
+### 2. install drivers/dependencies
+```
+choco install nvidia-display-driver cuda git docker-desktop
+```
+### 3. install wsl
+wsl --install
+### 4. reboot
+after reboot enter username/password in wsl
+### 5. git clone && startup
+clone the repo and edit .env values to your needs.
+```
+cd Desktop
+git clone https://github.com/oobabooga/text-generation-webui
+cd text-generation-webui
+COPY .env.example .env
+notepad .env
+```
+### 6. prepare models
+download and place the models inside the models folder. tested with:
+4bit https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483891617 https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483941105
+8bit: https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1484235789
+### 7. startup
+```
+docker compose up
+```
+## notes
+on older ubuntus you can manually install the docker compose plugin like this:
+```
+DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}
+mkdir -p $DOCKER_CONFIG/cli-plugins
+curl -SL https://github.com/docker/compose/releases/download/v2.17.2/docker-compose-linux-x86_64 -o $DOCKER_CONFIG/cli-plugins/docker-compose
+chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose
+export PATH="$HOME/.docker/cli-plugins:$PATH"
+```
+# Dedicated docker repository
+An external repository maintains a docker wrapper for this project as well as several pre-configured 'one-click' `docker compose` variants (e.g., updated branches of GPTQ). It can be found at: [Atinoda/text-generation-webui-docker](https://github.com/Atinoda/text-generation-webui-docker).

docs/ExLlama.md ADDED Viewed

	@@ -0,0 +1,22 @@

+# ExLlama
+### About
+ExLlama is an extremely optimized GPTQ backend for LLaMA models. It features much lower VRAM usage and much higher speeds due to not relying on unoptimized transformers code.
+### Usage
+Configure text-generation-webui to use exllama via the UI or command line:
+   - In the "Model" tab, set "Loader" to "exllama"
+   - Specify `--loader exllama` on the command line
+### Manual setup
+No additional installation steps are necessary since an exllama package is already included in the requirements.txt. If this package fails to install for some reason, you can install it manually by cloning the original repository into your `repositories/` folder:
+```
+mkdir repositories
+cd repositories
+git clone https://github.com/turboderp/exllama
+```

docs/Extensions.md ADDED Viewed

	@@ -0,0 +1,244 @@

+# Extensions
+Extensions are defined by files named `script.py` inside subfolders of `text-generation-webui/extensions`. They are loaded at startup if the folder name is specified after the `--extensions` flag.
+For instance, `extensions/silero_tts/script.py` gets loaded with `python server.py --extensions silero_tts`.
+## [text-generation-webui-extensions](https://github.com/oobabooga/text-generation-webui-extensions)
+The repository above contains a directory of user extensions.
+If you create an extension, you are welcome to host it in a GitHub repository and submit a PR adding it to the list.
+## Built-in extensions
+|Extension|Description|
+|---------|-----------|
+|[api](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/api)| Creates an API with two endpoints, one for streaming at `/api/v1/stream` port 5005 and another for blocking at `/api/v1/generate` port 5000. This is the main API for the webui. |
+|[openai](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/openai)| Creates an API that mimics the OpenAI API and can be used as a drop-in replacement. |
+|[multimodal](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal) | Adds multimodality support (text+images). For a detailed description see [README.md](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal/README.md) in the extension directory. |
+|[google_translate](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/google_translate)| Automatically translates inputs and outputs using Google Translate.|
+|[silero_tts](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/silero_tts)| Text-to-speech extension using [Silero](https://github.com/snakers4/silero-models). When used in chat mode, responses are replaced with an audio widget. |
+|[elevenlabs_tts](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/elevenlabs_tts)| Text-to-speech extension using the [ElevenLabs](https://beta.elevenlabs.io/) API. You need an API key to use it. |
+|[whisper_stt](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/whisper_stt)| Allows you to enter your inputs in chat mode using your microphone. |
+|[sd_api_pictures](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/sd_api_pictures)| Allows you to request pictures from the bot in chat mode, which will be generated using the AUTOMATIC1111 Stable Diffusion API. See examples [here](https://github.com/oobabooga/text-generation-webui/pull/309). |
+|[character_bias](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/character_bias)| Just a very simple example that adds a hidden string at the beginning of the bot's reply in chat mode. |
+|[send_pictures](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/send_pictures/)| Creates an image upload field that can be used to send images to the bot in chat mode. Captions are automatically generated using BLIP. |
+|[gallery](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/gallery/)| Creates a gallery with the chat characters and their pictures. |
+|[superbooga](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/superbooga)| An extension that uses ChromaDB to create an arbitrarily large pseudocontext, taking as input text files, URLs, or pasted text. Based on https://github.com/kaiokendev/superbig. |
+|[ngrok](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/ngrok)| Allows you to access the web UI remotely using the ngrok reverse tunnel service (free). It's an alternative to the built-in Gradio `--share` feature. |
+|[perplexity_colors](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/perplexity_colors)| Colors each token in the output text by its associated probability, as derived from the model logits. |
+## How to write an extension
+The extensions framework is based on special functions and variables that you can define in `script.py`. The functions are the following:
+| Function        | Description |
+|-------------|-------------|
+| `def setup()` | Is executed when the extension gets imported. |
+| `def ui()` | Creates custom gradio elements when the UI is launched. |
+| `def custom_css()` | Returns custom CSS as a string. It is applied whenever the web UI is loaded. |
+| `def custom_js()` | Same as above but for javascript. |
+| `def input_modifier(string, state)`  | Modifies the input string before it enters the model. In chat mode, it is applied to the user message. Otherwise, it is applied to the entire prompt. |
+| `def output_modifier(string, state)`  | Modifies the output string before it is presented in the UI. In chat mode, it is applied to the bot's reply. Otherwise, it is applied to the entire output. |
+| `def chat_input_modifier(text, visible_text, state)` | Modifies both the visible and internal inputs in chat mode. Can be used to hijack the chat input with custom content. |
+| `def bot_prefix_modifier(string, state)`  | Applied in chat mode to the prefix for the bot's reply. |
+| `def state_modifier(state)`  | Modifies the dictionary containing the UI input parameters before it is used by the text generation functions. |
+| `def history_modifier(history)`  | Modifies the chat history before the text generation in chat mode begins. |
+| `def custom_generate_reply(...)` | Overrides the main text generation function. |
+| `def custom_generate_chat_prompt(...)` | Overrides the prompt generator in chat mode. |
+| `def tokenizer_modifier(state, prompt, input_ids, input_embeds)` | Modifies the `input_ids`/`input_embeds` fed to the model. Should return `prompt`, `input_ids`, `input_embeds`. See the `multimodal` extension for an example. |
+| `def custom_tokenized_length(prompt)` | Used in conjunction with `tokenizer_modifier`, returns the length in tokens of `prompt`. See the `multimodal` extension for an example. |
+Additionally, you can define a special `params` dictionary. In it, the `display_name` key is used to define the displayed name of the extension in the UI, and the `is_tab` key is used to define whether the extension should appear in a new tab. By default, extensions appear at the bottom of the "Text generation" tab.
+Example:
+```python
+params = {
+    "display_name": "Google Translate",
+    "is_tab": True,
+}
+```
+The `params` dict may also contain variables that you want to be customizable through a `settings.yaml` file. For instance, assuming the extension is in `extensions/google_translate`, the variable `language string` in
+```python
+params = {
+    "display_name": "Google Translate",
+    "is_tab": True,
+    "language string": "jp"
+}
+```
+can be customized by adding a key called `google_translate-language string` to `settings.yaml`:
+```python
+google_translate-language string: 'fr'
+```
+That is, the syntax for the key is `extension_name-variable_name`.
+## Using multiple extensions at the same time
+You can activate more than one extension at a time by providing their names separated by spaces after `--extensions`. The input, output, and bot prefix modifiers will be applied in the specified order.
+Example:
+```
+python server.py --extensions enthusiasm translate # First apply enthusiasm, then translate
+python server.py --extensions translate enthusiasm # First apply translate, then enthusiasm
+```
+Do note, that for:
+- `custom_generate_chat_prompt`
+- `custom_generate_reply`
+- `custom_tokenized_length`
+only the first declaration encountered will be used and the rest will be ignored.
+## A full example
+The source code below can be found at [extensions/example/script.py](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/example/script.py).
+```python
+"""
+An example of extension. It does nothing, but you can add transformations
+before the return statements to customize the webui behavior.
+Starting from history_modifier and ending in output_modifier, the
+functions are declared in the same order that they are called at
+generation time.
+"""
+import gradio as gr
+import torch
+from transformers import LogitsProcessor
+from modules import chat, shared
+from modules.text_generation import (
+    decode,
+    encode,
+    generate_reply,
+)
+params = {
+    "display_name": "Example Extension",
+    "is_tab": False,
+}
+class MyLogits(LogitsProcessor):
+    """
+    Manipulates the probabilities for the next token before it gets sampled.
+    Used in the logits_processor_modifier function below.
+    """
+    def __init__(self):
+        pass
+    def __call__(self, input_ids, scores):
+        # probs = torch.softmax(scores, dim=-1, dtype=torch.float)
+        # probs[0] /= probs[0].sum()
+        # scores = torch.log(probs / (1 - probs))
+        return scores
+def history_modifier(history):
+    """
+    Modifies the chat history.
+    Only used in chat mode.
+    """
+    return history
+def state_modifier(state):
+    """
+    Modifies the state variable, which is a dictionary containing the input
+    values in the UI like sliders and checkboxes.
+    """
+    return state
+def chat_input_modifier(text, visible_text, state):
+    """
+    Modifies the user input string in chat mode (visible_text).
+    You can also modify the internal representation of the user
+    input (text) to change how it will appear in the prompt.
+    """
+    return text, visible_text
+def input_modifier(string, state):
+    """
+    In default/notebook modes, modifies the whole prompt.
+    In chat mode, it is the same as chat_input_modifier but only applied
+    to "text", here called "string", and not to "visible_text".
+    """
+    return string
+def bot_prefix_modifier(string, state):
+    """
+    Modifies the prefix for the next bot reply in chat mode.
+    By default, the prefix will be something like "Bot Name:".
+    """
+    return string
+def tokenizer_modifier(state, prompt, input_ids, input_embeds):
+    """
+    Modifies the input ids and embeds.
+    Used by the multimodal extension to put image embeddings in the prompt.
+    Only used by loaders that use the transformers library for sampling.
+    """
+    return prompt, input_ids, input_embeds
+def logits_processor_modifier(processor_list, input_ids):
+    """
+    Adds logits processors to the list, allowing you to access and modify
+    the next token probabilities.
+    Only used by loaders that use the transformers library for sampling.
+    """
+    processor_list.append(MyLogits())
+    return processor_list
+def output_modifier(string, state):
+    """
+    Modifies the LLM output before it gets presented.
+    In chat mode, the modified version goes into history['visible'],
+    and the original version goes into history['internal'].
+    """
+    return string
+def custom_generate_chat_prompt(user_input, state, **kwargs):
+    """
+    Replaces the function that generates the prompt from the chat history.
+    Only used in chat mode.
+    """
+    result = chat.generate_chat_prompt(user_input, state, **kwargs)
+    return result
+def custom_css():
+    """
+    Returns a CSS string that gets appended to the CSS for the webui.
+    """
+    return ''
+def custom_js():
+    """
+    Returns a javascript string that gets appended to the javascript
+    for the webui.
+    """
+    return ''
+def setup():
+    """
+    Gets executed only once, when the extension is imported.
+    """
+    pass
+def ui():
+    """
+    Gets executed when the UI is drawn. Custom gradio elements and
+    their corresponding event handlers should be defined here.
+    To learn about gradio components, check out the docs:
+    https://gradio.app/docs/
+    """
+    pass
+```

docs/GPTQ-models-(4-bit-mode).md ADDED Viewed

	@@ -0,0 +1,228 @@

+GPTQ is a clever quantization algorithm that lightly reoptimizes the weights during quantization so that the accuracy loss is compensated relative to a round-to-nearest quantization. See the paper for more details: https://arxiv.org/abs/2210.17323
+4-bit GPTQ models reduce VRAM usage by about 75%. So LLaMA-7B fits into a 6GB GPU, and LLaMA-30B fits into a 24GB GPU.
+## Overview
+There are two ways of loading GPTQ models in the web UI at the moment:
+* Using AutoGPTQ:
+  * supports more models
+  * standardized (no need to guess any parameter)
+  * is a proper Python library
+  * ~no wheels are presently available so it requires manual compilation~
+  * supports loading both triton and cuda models
+* Using GPTQ-for-LLaMa directly:
+  * faster CPU offloading
+  * faster multi-GPU inference
+  * supports loading LoRAs using a monkey patch
+  * requires you to manually figure out the wbits/groupsize/model_type parameters for the model to be able to load it
+  * supports either only cuda or only triton depending on the branch
+For creating new quantizations, I recommend using AutoGPTQ: https://github.com/PanQiWei/AutoGPTQ
+## AutoGPTQ
+### Installation
+No additional steps are necessary as AutoGPTQ is already in the `requirements.txt` for the webui. If you still want or need to install it manually for whatever reason, these are the commands:
+```
+conda activate textgen
+git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
+pip install .
+```
+The last command requires `nvcc` to be installed (see the [instructions above](https://github.com/oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md#step-1-install-nvcc)).
+### Usage
+When you quantize a model using AutoGPTQ, a folder containing a filed called `quantize_config.json` will be generated. Place that folder inside your `models/` folder and load it with the `--autogptq` flag:
+```
+python server.py --autogptq --model model_name
+```
+Alternatively, check the `autogptq` box in the "Model" tab of the UI before loading the model.
+### Offloading
+In order to do CPU offloading or multi-gpu inference with AutoGPTQ, use the `--gpu-memory` flag. It is currently somewhat slower than offloading with the `--pre_layer` option in GPTQ-for-LLaMA.
+For CPU offloading:
+```
+python server.py --autogptq --gpu-memory 3000MiB --model model_name
+```
+For multi-GPU inference:
+```
+python server.py --autogptq --gpu-memory 3000MiB 6000MiB --model model_name
+```
+### Using LoRAs with AutoGPTQ
+Not supported yet.
+## GPTQ-for-LLaMa
+GPTQ-for-LLaMa is the original adaptation of GPTQ for the LLaMA model. It was made possible by [@qwopqwop200](https://github.com/qwopqwop200/GPTQ-for-LLaMa): https://github.com/qwopqwop200/GPTQ-for-LLaMa
+Different branches of GPTQ-for-LLaMa are currently available, including:
+| Branch | Comment |
+|----|----|
+| [Old CUDA branch (recommended)](https://github.com/oobabooga/GPTQ-for-LLaMa/) | The fastest branch, works on Windows and Linux. |
+| [Up-to-date triton branch](https://github.com/qwopqwop200/GPTQ-for-LLaMa) | Slightly more precise than the old CUDA branch from 13b upwards, significantly more precise for 7b. 2x slower for small context size and only works on Linux. |
+| [Up-to-date CUDA branch](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda) | As precise as the up-to-date triton branch, 10x slower than the old cuda branch for small context size. |
+Overall, I recommend using the old CUDA branch. It is included by default in the one-click-installer for this web UI.
+### Installation
+Start by cloning GPTQ-for-LLaMa into your `text-generation-webui/repositories` folder:
+```
+mkdir repositories
+cd repositories
+git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
+```
+If you want to you to use the up-to-date CUDA or triton branches instead of the old CUDA branch, use these commands:
+```
+git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda
+```
+```
+git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b triton
+```
+Next you need to install the CUDA extensions. You can do that either by installing the precompiled wheels, or by compiling the wheels yourself.
+### Precompiled wheels
+Kindly provided by our friend jllllll: https://github.com/jllllll/GPTQ-for-LLaMa-Wheels
+Windows:
+```
+pip install https://github.com/jllllll/GPTQ-for-LLaMa-Wheels/raw/main/quant_cuda-0.0.0-cp310-cp310-win_amd64.whl
+```
+Linux:
+```
+pip install https://github.com/jllllll/GPTQ-for-LLaMa-Wheels/raw/Linux-x64/quant_cuda-0.0.0-cp310-cp310-linux_x86_64.whl
+```
+### Manual installation
+#### Step 1: install nvcc
+```
+conda activate textgen
+conda install -c conda-forge cudatoolkit-dev
+```
+The command above takes some 10 minutes to run and shows no progress bar or updates along the way.
+You are also going to need to have a C++ compiler installed. On Linux, `sudo apt install build-essential` or equivalent is enough.
+If you're using an older version of CUDA toolkit (e.g. 11.7) but the latest version of `gcc` and `g++` (12.0+), you should downgrade with: `conda install -c conda-forge gxx==11.3.0`. Kernel compilation will fail otherwise.
+#### Step 2: compile the CUDA extensions
+```
+cd repositories/GPTQ-for-LLaMa
+python setup_cuda.py install
+```
+### Getting pre-converted LLaMA weights
+* Direct download (recommended):
+https://huggingface.co/Neko-Institute-of-Science/LLaMA-7B-4bit-128g
+https://huggingface.co/Neko-Institute-of-Science/LLaMA-13B-4bit-128g
+https://huggingface.co/Neko-Institute-of-Science/LLaMA-30B-4bit-128g
+https://huggingface.co/Neko-Institute-of-Science/LLaMA-65B-4bit-128g
+These models were converted with `desc_act=True`. They work just fine with ExLlama. For AutoGPTQ, they will only work on Linux with the `triton` option checked.
+* Torrent:
+https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483891617
+https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483941105
+These models were converted with `desc_act=False`. As such, they are less accurate, but they work with AutoGPTQ on Windows. The `128g` versions are better from 13b upwards, and worse for 7b. The tokenizer files in the torrents are outdated, in particular the files called `tokenizer_config.json` and `special_tokens_map.json`. Here you can find those files: https://huggingface.co/oobabooga/llama-tokenizer
+### Starting the web UI:
+Use the `--gptq-for-llama` flag.
+For the models converted without `group-size`:
+```
+python server.py --model llama-7b-4bit --gptq-for-llama
+```
+For the models converted with `group-size`:
+```
+python server.py --model llama-13b-4bit-128g  --gptq-for-llama --wbits 4 --groupsize 128
+```
+The command-line flags `--wbits` and `--groupsize` are automatically detected based on the folder names in many cases.
+### CPU offloading
+It is possible to offload part of the layers of the 4-bit model to the CPU with the `--pre_layer` flag. The higher the number after `--pre_layer`, the more layers will be allocated to the GPU.
+With this command, I can run llama-7b with 4GB VRAM:
+```
+python server.py --model llama-7b-4bit --pre_layer 20
+```
+This is the performance:
+```
+Output generated in 123.79 seconds (1.61 tokens/s, 199 tokens)
+```
+You can also use multiple GPUs with `pre_layer` if using the oobabooga fork of GPTQ, eg `--pre_layer 30 60` will load a LLaMA-30B model half onto your first GPU and half onto your second, or `--pre_layer 20 40` will load 20 layers onto GPU-0, 20 layers onto GPU-1, and 20 layers offloaded to CPU.
+### Using LoRAs with GPTQ-for-LLaMa
+This requires using a monkey patch that is supported by this web UI: https://github.com/johnsmith0031/alpaca_lora_4bit
+To use it:
+1. Clone `johnsmith0031/alpaca_lora_4bit` into the repositories folder:
+```
+cd text-generation-webui/repositories
+git clone https://github.com/johnsmith0031/alpaca_lora_4bit
+```
+⚠️  I have tested it with the following commit specifically: `2f704b93c961bf202937b10aac9322b092afdce0`
+2. Install https://github.com/sterlind/GPTQ-for-LLaMa with this command:
+```
+pip install git+https://github.com/sterlind/GPTQ-for-LLaMa.git@lora_4bit
+```
+3. Start the UI with the `--monkey-patch` flag:
+```
+python server.py --model llama-7b-4bit-128g --listen --lora tloen_alpaca-lora-7b --monkey-patch
+```

docs/LLaMA-model.md ADDED Viewed

	@@ -0,0 +1,56 @@

+LLaMA is a Large Language Model developed by Meta AI.
+It was trained on more tokens than previous models. The result is that the smallest version with 7 billion parameters has similar performance to GPT-3 with 175 billion parameters.
+This guide will cover usage through the official `transformers` implementation. For 4-bit mode, head over to [GPTQ models (4 bit mode)
+](GPTQ-models-(4-bit-mode).md).
+## Getting the weights
+### Option 1: pre-converted weights
+* Direct download (recommended):
+https://huggingface.co/Neko-Institute-of-Science/LLaMA-7B-HF
+https://huggingface.co/Neko-Institute-of-Science/LLaMA-13B-HF
+https://huggingface.co/Neko-Institute-of-Science/LLaMA-30B-HF
+https://huggingface.co/Neko-Institute-of-Science/LLaMA-65B-HF
+* Torrent:
+https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1484235789
+The tokenizer files in the torrent above are outdated, in particular the files called `tokenizer_config.json` and `special_tokens_map.json`. Here you can find those files: https://huggingface.co/oobabooga/llama-tokenizer
+### Option 2: convert the weights yourself
+1. Install the `protobuf` library:
+```
+pip install protobuf==3.20.1
+```
+2. Use the script below to convert the model in `.pth` format that you, a fellow academic, downloaded using Meta's official link.
+If you have `transformers` installed in place:
+```
+python -m transformers.models.llama.convert_llama_weights_to_hf --input_dir /path/to/LLaMA --model_size 7B --output_dir /tmp/outputs/llama-7b
+```
+Otherwise download [convert_llama_weights_to_hf.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py) first and run:
+```
+python convert_llama_weights_to_hf.py --input_dir /path/to/LLaMA --model_size 7B --output_dir /tmp/outputs/llama-7b
+```
+3. Move the `llama-7b` folder inside your `text-generation-webui/models` folder.
+## Starting the web UI
+```python
+python server.py --model llama-7b
+```

docs/README.md ADDED Viewed

	@@ -0,0 +1,21 @@

+# text-generation-webui documentation
+## Table of contents
+* [Audio Notification](Audio-Notification.md)
+* [Chat mode](Chat-mode.md)
+* [DeepSpeed](DeepSpeed.md)
+* [Docker](Docker.md)
+* [ExLlama](ExLlama.md)
+* [Extensions](Extensions.md)
+* [GPTQ models (4 bit mode)](GPTQ-models-(4-bit-mode).md)
+* [LLaMA model](LLaMA-model.md)
+* [llama.cpp](llama.cpp.md)
+* [LoRA](LoRA.md)
+* [Low VRAM guide](Low-VRAM-guide.md)
+* [RWKV model](RWKV-model.md)
+* [Spell book](Spell-book.md)
+* [System requirements](System-requirements.md)
+* [Training LoRAs](Training-LoRAs.md)
+* [Windows installation guide](Windows-installation-guide.md)
+* [WSL installation guide](WSL-installation-guide.md)

docs/System-requirements.md ADDED Viewed

	@@ -0,0 +1,42 @@

+These are the VRAM and RAM requirements (in MiB) to run some examples of models **in 16-bit (default) precision**:
+| model                  |   VRAM (GPU) |     RAM |
+|:-----------------------|-------------:|--------:|
+| arxiv_ai_gpt2          |      1512.37 | 5824.2  |
+| blenderbot-1B-distill  |      2441.75 | 4425.91 |
+| opt-1.3b               |      2509.61 | 4427.79 |
+| gpt-neo-1.3b           |      2605.27 | 5851.58 |
+| opt-2.7b               |      5058.05 | 4863.95 |
+| gpt4chan_model_float16 |     11653.7  | 4437.71 |
+| gpt-j-6B               |     11653.7  | 5633.79 |
+| galactica-6.7b         |     12697.9  | 4429.89 |
+| opt-6.7b               |     12700    | 4368.66 |
+| bloomz-7b1-p3          |     13483.1  | 4470.34 |
+#### GPU mode with 8-bit precision
+Allows you to load models that would not normally fit into your GPU. Enabled by default for 13b and 20b models in this web UI.
+| model          |   VRAM (GPU) |     RAM |
+|:---------------|-------------:|--------:|
+| opt-13b        |      12528.1 | 1152.39 |
+| gpt-neox-20b   |      20384   | 2291.7  |
+#### CPU mode (32-bit precision)
+A lot slower, but does not require a GPU.
+On my i5-12400F, 6B models take around 10-20 seconds to respond in chat mode, and around 5 minutes to generate a 200 tokens completion.
+| model                  |      RAM |
+|:-----------------------|---------:|
+| arxiv_ai_gpt2          |  4430.82 |
+| gpt-neo-1.3b           |  6089.31 |
+| opt-1.3b               |  8411.12 |
+| blenderbot-1B-distill  |  8508.16 |
+| opt-2.7b               | 14969.3  |
+| bloomz-7b1-p3          | 21371.2  |
+| gpt-j-6B               | 24200.3  |
+| gpt4chan_model         | 24246.3  |
+| galactica-6.7b         | 26561.4  |
+| opt-6.7b               | 29596.6  |

docs/llama.cpp.md ADDED Viewed

	@@ -0,0 +1,42 @@

+# llama.cpp
+llama.cpp is the best backend in two important scenarios:
+1) You don't have a GPU.
+2) You want to run a model that doesn't fit into your GPU.
+## Setting up the models
+#### Pre-converted
+Download the ggml model directly into your `text-generation-webui/models` folder, making sure that its name contains `ggml` somewhere and ends in `.bin`. It's a single file.
+`q4_K_M` quantization is recommended.
+#### Convert Llama yourself
+Follow the instructions in the llama.cpp README to generate a ggml: https://github.com/ggerganov/llama.cpp#prepare-data--run
+## GPU acceleration
+Enabled with the `--n-gpu-layers` parameter.
+* If you have enough VRAM, use a high number like `--n-gpu-layers 1000` to offload all layers to the GPU.
+* Otherwise, start with a low number like `--n-gpu-layers 10` and then gradually increase it until you run out of memory.
+This feature works out of the box for NVIDIA GPUs on Linux (amd64) or Windows. For other GPUs, you need to uninstall `llama-cpp-python` with
+```
+pip uninstall -y llama-cpp-python
+```
+and then recompile it using the commands here: https://pypi.org/project/llama-cpp-python/
+#### macOS
+For macOS, these are the commands:
+```
+pip uninstall -y llama-cpp-python
+CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir
+```