rodrigomasini commited on
Commit
e5ac6f5
1 Parent(s): d612691

Upload 8 files

Browse files
docs/Docker.md ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Docker Compose is a way of installing and launching the web UI in an isolated Ubuntu image using only a few commands.
2
+
3
+ In order to create the image as described in the main README, you must have docker compose 2.17 or higher:
4
+
5
+ ```
6
+ ~$ docker compose version
7
+ Docker Compose version v2.17.2
8
+ ```
9
+
10
+ Make sure to also create the necessary symbolic links:
11
+
12
+ ```
13
+ cd text-generation-webui
14
+ ln -s docker/{Dockerfile,docker-compose.yml,.dockerignore} .
15
+ cp docker/.env.example .env
16
+ # Edit .env and set TORCH_CUDA_ARCH_LIST based on your GPU model
17
+ docker compose up --build
18
+ ```
19
+
20
+ # Table of contents
21
+
22
+ * [Docker Compose installation instructions](#docker-compose-installation-instructions)
23
+ * [Repository with additional Docker files](#dedicated-docker-repository)
24
+
25
+ # Docker Compose installation instructions
26
+
27
+ By [@loeken](https://github.com/loeken).
28
+
29
+ - [Ubuntu 22.04](#ubuntu-2204)
30
+ - [0. youtube video](#0-youtube-video)
31
+ - [1. update the drivers](#1-update-the-drivers)
32
+ - [2. reboot](#2-reboot)
33
+ - [3. install docker](#3-install-docker)
34
+ - [4. docker \& container toolkit](#4-docker--container-toolkit)
35
+ - [5. clone the repo](#5-clone-the-repo)
36
+ - [6. prepare models](#6-prepare-models)
37
+ - [7. prepare .env file](#7-prepare-env-file)
38
+ - [8. startup docker container](#8-startup-docker-container)
39
+ - [Manjaro](#manjaro)
40
+ - [update the drivers](#update-the-drivers)
41
+ - [reboot](#reboot)
42
+ - [docker \& container toolkit](#docker--container-toolkit)
43
+ - [continue with ubuntu task](#continue-with-ubuntu-task)
44
+ - [Windows](#windows)
45
+ - [0. youtube video](#0-youtube-video-1)
46
+ - [1. choco package manager](#1-choco-package-manager)
47
+ - [2. install drivers/dependencies](#2-install-driversdependencies)
48
+ - [3. install wsl](#3-install-wsl)
49
+ - [4. reboot](#4-reboot)
50
+ - [5. git clone \&\& startup](#5-git-clone--startup)
51
+ - [6. prepare models](#6-prepare-models-1)
52
+ - [7. startup](#7-startup)
53
+ - [notes](#notes)
54
+
55
+ ## Ubuntu 22.04
56
+
57
+ ### 0. youtube video
58
+ A video walking you through the setup can be found here:
59
+
60
+ [![oobabooga text-generation-webui setup in docker on ubuntu 22.04](https://img.youtube.com/vi/ELkKWYh8qOk/0.jpg)](https://www.youtube.com/watch?v=ELkKWYh8qOk)
61
+
62
+
63
+ ### 1. update the drivers
64
+ in the the “software updater” update drivers to the last version of the prop driver.
65
+
66
+ ### 2. reboot
67
+ to switch using to new driver
68
+
69
+ ### 3. install docker
70
+ ```bash
71
+ sudo apt update
72
+ sudo apt-get install curl
73
+ sudo mkdir -m 0755 -p /etc/apt/keyrings
74
+ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
75
+ echo \
76
+ "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
77
+ "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
78
+ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
79
+ sudo apt update
80
+ sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-compose -y
81
+ sudo usermod -aG docker $USER
82
+ newgrp docker
83
+ ```
84
+
85
+ ### 4. docker & container toolkit
86
+ ```bash
87
+ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
88
+ echo "deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/ubuntu22.04/amd64 /" | \
89
+ sudo tee /etc/apt/sources.list.d/nvidia.list > /dev/null
90
+ sudo apt update
91
+ sudo apt install nvidia-docker2 nvidia-container-runtime -y
92
+ sudo systemctl restart docker
93
+ ```
94
+
95
+ ### 5. clone the repo
96
+ ```
97
+ git clone https://github.com/oobabooga/text-generation-webui
98
+ cd text-generation-webui
99
+ ```
100
+
101
+ ### 6. prepare models
102
+ download and place the models inside the models folder. tested with:
103
+
104
+ 4bit
105
+ https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483891617
106
+ https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483941105
107
+
108
+ 8bit:
109
+ https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1484235789
110
+
111
+ ### 7. prepare .env file
112
+ edit .env values to your needs.
113
+ ```bash
114
+ cp .env.example .env
115
+ nano .env
116
+ ```
117
+
118
+ ### 8. startup docker container
119
+ ```bash
120
+ docker compose up --build
121
+ ```
122
+
123
+ ## Manjaro
124
+ manjaro/arch is similar to ubuntu just the dependency installation is more convenient
125
+
126
+ ### update the drivers
127
+ ```bash
128
+ sudo mhwd -a pci nonfree 0300
129
+ ```
130
+ ### reboot
131
+ ```bash
132
+ reboot
133
+ ```
134
+ ### docker & container toolkit
135
+ ```bash
136
+ yay -S docker docker-compose buildkit gcc nvidia-docker
137
+ sudo usermod -aG docker $USER
138
+ newgrp docker
139
+ sudo systemctl restart docker # required by nvidia-container-runtime
140
+ ```
141
+
142
+ ### continue with ubuntu task
143
+ continue at [5. clone the repo](#5-clone-the-repo)
144
+
145
+ ## Windows
146
+ ### 0. youtube video
147
+ A video walking you through the setup can be found here:
148
+ [![oobabooga text-generation-webui setup in docker on windows 11](https://img.youtube.com/vi/ejH4w5b5kFQ/0.jpg)](https://www.youtube.com/watch?v=ejH4w5b5kFQ)
149
+
150
+ ### 1. choco package manager
151
+ install package manager (https://chocolatey.org/ )
152
+ ```
153
+ Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))
154
+ ```
155
+
156
+ ### 2. install drivers/dependencies
157
+ ```
158
+ choco install nvidia-display-driver cuda git docker-desktop
159
+ ```
160
+
161
+ ### 3. install wsl
162
+ wsl --install
163
+
164
+ ### 4. reboot
165
+ after reboot enter username/password in wsl
166
+
167
+ ### 5. git clone && startup
168
+ clone the repo and edit .env values to your needs.
169
+ ```
170
+ cd Desktop
171
+ git clone https://github.com/oobabooga/text-generation-webui
172
+ cd text-generation-webui
173
+ COPY .env.example .env
174
+ notepad .env
175
+ ```
176
+
177
+ ### 6. prepare models
178
+ download and place the models inside the models folder. tested with:
179
+
180
+ 4bit https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483891617 https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483941105
181
+
182
+ 8bit: https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1484235789
183
+
184
+ ### 7. startup
185
+ ```
186
+ docker compose up
187
+ ```
188
+
189
+ ## notes
190
+
191
+ on older ubuntus you can manually install the docker compose plugin like this:
192
+ ```
193
+ DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}
194
+ mkdir -p $DOCKER_CONFIG/cli-plugins
195
+ curl -SL https://github.com/docker/compose/releases/download/v2.17.2/docker-compose-linux-x86_64 -o $DOCKER_CONFIG/cli-plugins/docker-compose
196
+ chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose
197
+ export PATH="$HOME/.docker/cli-plugins:$PATH"
198
+ ```
199
+
200
+ # Dedicated docker repository
201
+
202
+ An external repository maintains a docker wrapper for this project as well as several pre-configured 'one-click' `docker compose` variants (e.g., updated branches of GPTQ). It can be found at: [Atinoda/text-generation-webui-docker](https://github.com/Atinoda/text-generation-webui-docker).
203
+
docs/ExLlama.md ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ExLlama
2
+
3
+ ### About
4
+
5
+ ExLlama is an extremely optimized GPTQ backend for LLaMA models. It features much lower VRAM usage and much higher speeds due to not relying on unoptimized transformers code.
6
+
7
+ ### Usage
8
+
9
+ Configure text-generation-webui to use exllama via the UI or command line:
10
+ - In the "Model" tab, set "Loader" to "exllama"
11
+ - Specify `--loader exllama` on the command line
12
+
13
+ ### Manual setup
14
+
15
+ No additional installation steps are necessary since an exllama package is already included in the requirements.txt. If this package fails to install for some reason, you can install it manually by cloning the original repository into your `repositories/` folder:
16
+
17
+ ```
18
+ mkdir repositories
19
+ cd repositories
20
+ git clone https://github.com/turboderp/exllama
21
+ ```
22
+
docs/Extensions.md ADDED
@@ -0,0 +1,244 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Extensions
2
+
3
+ Extensions are defined by files named `script.py` inside subfolders of `text-generation-webui/extensions`. They are loaded at startup if the folder name is specified after the `--extensions` flag.
4
+
5
+ For instance, `extensions/silero_tts/script.py` gets loaded with `python server.py --extensions silero_tts`.
6
+
7
+ ## [text-generation-webui-extensions](https://github.com/oobabooga/text-generation-webui-extensions)
8
+
9
+ The repository above contains a directory of user extensions.
10
+
11
+ If you create an extension, you are welcome to host it in a GitHub repository and submit a PR adding it to the list.
12
+
13
+ ## Built-in extensions
14
+
15
+ |Extension|Description|
16
+ |---------|-----------|
17
+ |[api](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/api)| Creates an API with two endpoints, one for streaming at `/api/v1/stream` port 5005 and another for blocking at `/api/v1/generate` port 5000. This is the main API for the webui. |
18
+ |[openai](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/openai)| Creates an API that mimics the OpenAI API and can be used as a drop-in replacement. |
19
+ |[multimodal](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal) | Adds multimodality support (text+images). For a detailed description see [README.md](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal/README.md) in the extension directory. |
20
+ |[google_translate](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/google_translate)| Automatically translates inputs and outputs using Google Translate.|
21
+ |[silero_tts](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/silero_tts)| Text-to-speech extension using [Silero](https://github.com/snakers4/silero-models). When used in chat mode, responses are replaced with an audio widget. |
22
+ |[elevenlabs_tts](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/elevenlabs_tts)| Text-to-speech extension using the [ElevenLabs](https://beta.elevenlabs.io/) API. You need an API key to use it. |
23
+ |[whisper_stt](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/whisper_stt)| Allows you to enter your inputs in chat mode using your microphone. |
24
+ |[sd_api_pictures](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/sd_api_pictures)| Allows you to request pictures from the bot in chat mode, which will be generated using the AUTOMATIC1111 Stable Diffusion API. See examples [here](https://github.com/oobabooga/text-generation-webui/pull/309). |
25
+ |[character_bias](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/character_bias)| Just a very simple example that adds a hidden string at the beginning of the bot's reply in chat mode. |
26
+ |[send_pictures](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/send_pictures/)| Creates an image upload field that can be used to send images to the bot in chat mode. Captions are automatically generated using BLIP. |
27
+ |[gallery](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/gallery/)| Creates a gallery with the chat characters and their pictures. |
28
+ |[superbooga](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/superbooga)| An extension that uses ChromaDB to create an arbitrarily large pseudocontext, taking as input text files, URLs, or pasted text. Based on https://github.com/kaiokendev/superbig. |
29
+ |[ngrok](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/ngrok)| Allows you to access the web UI remotely using the ngrok reverse tunnel service (free). It's an alternative to the built-in Gradio `--share` feature. |
30
+ |[perplexity_colors](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/perplexity_colors)| Colors each token in the output text by its associated probability, as derived from the model logits. |
31
+
32
+ ## How to write an extension
33
+
34
+ The extensions framework is based on special functions and variables that you can define in `script.py`. The functions are the following:
35
+
36
+ | Function | Description |
37
+ |-------------|-------------|
38
+ | `def setup()` | Is executed when the extension gets imported. |
39
+ | `def ui()` | Creates custom gradio elements when the UI is launched. |
40
+ | `def custom_css()` | Returns custom CSS as a string. It is applied whenever the web UI is loaded. |
41
+ | `def custom_js()` | Same as above but for javascript. |
42
+ | `def input_modifier(string, state)` | Modifies the input string before it enters the model. In chat mode, it is applied to the user message. Otherwise, it is applied to the entire prompt. |
43
+ | `def output_modifier(string, state)` | Modifies the output string before it is presented in the UI. In chat mode, it is applied to the bot's reply. Otherwise, it is applied to the entire output. |
44
+ | `def chat_input_modifier(text, visible_text, state)` | Modifies both the visible and internal inputs in chat mode. Can be used to hijack the chat input with custom content. |
45
+ | `def bot_prefix_modifier(string, state)` | Applied in chat mode to the prefix for the bot's reply. |
46
+ | `def state_modifier(state)` | Modifies the dictionary containing the UI input parameters before it is used by the text generation functions. |
47
+ | `def history_modifier(history)` | Modifies the chat history before the text generation in chat mode begins. |
48
+ | `def custom_generate_reply(...)` | Overrides the main text generation function. |
49
+ | `def custom_generate_chat_prompt(...)` | Overrides the prompt generator in chat mode. |
50
+ | `def tokenizer_modifier(state, prompt, input_ids, input_embeds)` | Modifies the `input_ids`/`input_embeds` fed to the model. Should return `prompt`, `input_ids`, `input_embeds`. See the `multimodal` extension for an example. |
51
+ | `def custom_tokenized_length(prompt)` | Used in conjunction with `tokenizer_modifier`, returns the length in tokens of `prompt`. See the `multimodal` extension for an example. |
52
+
53
+ Additionally, you can define a special `params` dictionary. In it, the `display_name` key is used to define the displayed name of the extension in the UI, and the `is_tab` key is used to define whether the extension should appear in a new tab. By default, extensions appear at the bottom of the "Text generation" tab.
54
+
55
+ Example:
56
+
57
+ ```python
58
+ params = {
59
+ "display_name": "Google Translate",
60
+ "is_tab": True,
61
+ }
62
+ ```
63
+
64
+ The `params` dict may also contain variables that you want to be customizable through a `settings.yaml` file. For instance, assuming the extension is in `extensions/google_translate`, the variable `language string` in
65
+
66
+ ```python
67
+ params = {
68
+ "display_name": "Google Translate",
69
+ "is_tab": True,
70
+ "language string": "jp"
71
+ }
72
+ ```
73
+
74
+ can be customized by adding a key called `google_translate-language string` to `settings.yaml`:
75
+
76
+ ```python
77
+ google_translate-language string: 'fr'
78
+ ```
79
+
80
+ That is, the syntax for the key is `extension_name-variable_name`.
81
+
82
+ ## Using multiple extensions at the same time
83
+
84
+ You can activate more than one extension at a time by providing their names separated by spaces after `--extensions`. The input, output, and bot prefix modifiers will be applied in the specified order.
85
+
86
+ Example:
87
+
88
+ ```
89
+ python server.py --extensions enthusiasm translate # First apply enthusiasm, then translate
90
+ python server.py --extensions translate enthusiasm # First apply translate, then enthusiasm
91
+ ```
92
+
93
+ Do note, that for:
94
+ - `custom_generate_chat_prompt`
95
+ - `custom_generate_reply`
96
+ - `custom_tokenized_length`
97
+
98
+ only the first declaration encountered will be used and the rest will be ignored.
99
+
100
+ ## A full example
101
+
102
+ The source code below can be found at [extensions/example/script.py](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/example/script.py).
103
+
104
+ ```python
105
+ """
106
+ An example of extension. It does nothing, but you can add transformations
107
+ before the return statements to customize the webui behavior.
108
+
109
+ Starting from history_modifier and ending in output_modifier, the
110
+ functions are declared in the same order that they are called at
111
+ generation time.
112
+ """
113
+
114
+ import gradio as gr
115
+ import torch
116
+ from transformers import LogitsProcessor
117
+
118
+ from modules import chat, shared
119
+ from modules.text_generation import (
120
+ decode,
121
+ encode,
122
+ generate_reply,
123
+ )
124
+
125
+ params = {
126
+ "display_name": "Example Extension",
127
+ "is_tab": False,
128
+ }
129
+
130
+ class MyLogits(LogitsProcessor):
131
+ """
132
+ Manipulates the probabilities for the next token before it gets sampled.
133
+ Used in the logits_processor_modifier function below.
134
+ """
135
+ def __init__(self):
136
+ pass
137
+
138
+ def __call__(self, input_ids, scores):
139
+ # probs = torch.softmax(scores, dim=-1, dtype=torch.float)
140
+ # probs[0] /= probs[0].sum()
141
+ # scores = torch.log(probs / (1 - probs))
142
+ return scores
143
+
144
+ def history_modifier(history):
145
+ """
146
+ Modifies the chat history.
147
+ Only used in chat mode.
148
+ """
149
+ return history
150
+
151
+ def state_modifier(state):
152
+ """
153
+ Modifies the state variable, which is a dictionary containing the input
154
+ values in the UI like sliders and checkboxes.
155
+ """
156
+ return state
157
+
158
+ def chat_input_modifier(text, visible_text, state):
159
+ """
160
+ Modifies the user input string in chat mode (visible_text).
161
+ You can also modify the internal representation of the user
162
+ input (text) to change how it will appear in the prompt.
163
+ """
164
+ return text, visible_text
165
+
166
+ def input_modifier(string, state):
167
+ """
168
+ In default/notebook modes, modifies the whole prompt.
169
+
170
+ In chat mode, it is the same as chat_input_modifier but only applied
171
+ to "text", here called "string", and not to "visible_text".
172
+ """
173
+ return string
174
+
175
+ def bot_prefix_modifier(string, state):
176
+ """
177
+ Modifies the prefix for the next bot reply in chat mode.
178
+ By default, the prefix will be something like "Bot Name:".
179
+ """
180
+ return string
181
+
182
+ def tokenizer_modifier(state, prompt, input_ids, input_embeds):
183
+ """
184
+ Modifies the input ids and embeds.
185
+ Used by the multimodal extension to put image embeddings in the prompt.
186
+ Only used by loaders that use the transformers library for sampling.
187
+ """
188
+ return prompt, input_ids, input_embeds
189
+
190
+ def logits_processor_modifier(processor_list, input_ids):
191
+ """
192
+ Adds logits processors to the list, allowing you to access and modify
193
+ the next token probabilities.
194
+ Only used by loaders that use the transformers library for sampling.
195
+ """
196
+ processor_list.append(MyLogits())
197
+ return processor_list
198
+
199
+ def output_modifier(string, state):
200
+ """
201
+ Modifies the LLM output before it gets presented.
202
+
203
+ In chat mode, the modified version goes into history['visible'],
204
+ and the original version goes into history['internal'].
205
+ """
206
+ return string
207
+
208
+ def custom_generate_chat_prompt(user_input, state, **kwargs):
209
+ """
210
+ Replaces the function that generates the prompt from the chat history.
211
+ Only used in chat mode.
212
+ """
213
+ result = chat.generate_chat_prompt(user_input, state, **kwargs)
214
+ return result
215
+
216
+ def custom_css():
217
+ """
218
+ Returns a CSS string that gets appended to the CSS for the webui.
219
+ """
220
+ return ''
221
+
222
+ def custom_js():
223
+ """
224
+ Returns a javascript string that gets appended to the javascript
225
+ for the webui.
226
+ """
227
+ return ''
228
+
229
+ def setup():
230
+ """
231
+ Gets executed only once, when the extension is imported.
232
+ """
233
+ pass
234
+
235
+ def ui():
236
+ """
237
+ Gets executed when the UI is drawn. Custom gradio elements and
238
+ their corresponding event handlers should be defined here.
239
+
240
+ To learn about gradio components, check out the docs:
241
+ https://gradio.app/docs/
242
+ """
243
+ pass
244
+ ```
docs/GPTQ-models-(4-bit-mode).md ADDED
@@ -0,0 +1,228 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ GPTQ is a clever quantization algorithm that lightly reoptimizes the weights during quantization so that the accuracy loss is compensated relative to a round-to-nearest quantization. See the paper for more details: https://arxiv.org/abs/2210.17323
2
+
3
+ 4-bit GPTQ models reduce VRAM usage by about 75%. So LLaMA-7B fits into a 6GB GPU, and LLaMA-30B fits into a 24GB GPU.
4
+
5
+ ## Overview
6
+
7
+ There are two ways of loading GPTQ models in the web UI at the moment:
8
+
9
+ * Using AutoGPTQ:
10
+ * supports more models
11
+ * standardized (no need to guess any parameter)
12
+ * is a proper Python library
13
+ * ~no wheels are presently available so it requires manual compilation~
14
+ * supports loading both triton and cuda models
15
+
16
+ * Using GPTQ-for-LLaMa directly:
17
+ * faster CPU offloading
18
+ * faster multi-GPU inference
19
+ * supports loading LoRAs using a monkey patch
20
+ * requires you to manually figure out the wbits/groupsize/model_type parameters for the model to be able to load it
21
+ * supports either only cuda or only triton depending on the branch
22
+
23
+ For creating new quantizations, I recommend using AutoGPTQ: https://github.com/PanQiWei/AutoGPTQ
24
+
25
+ ## AutoGPTQ
26
+
27
+ ### Installation
28
+
29
+ No additional steps are necessary as AutoGPTQ is already in the `requirements.txt` for the webui. If you still want or need to install it manually for whatever reason, these are the commands:
30
+
31
+ ```
32
+ conda activate textgen
33
+ git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
34
+ pip install .
35
+ ```
36
+
37
+ The last command requires `nvcc` to be installed (see the [instructions above](https://github.com/oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md#step-1-install-nvcc)).
38
+
39
+ ### Usage
40
+
41
+ When you quantize a model using AutoGPTQ, a folder containing a filed called `quantize_config.json` will be generated. Place that folder inside your `models/` folder and load it with the `--autogptq` flag:
42
+
43
+ ```
44
+ python server.py --autogptq --model model_name
45
+ ```
46
+
47
+ Alternatively, check the `autogptq` box in the "Model" tab of the UI before loading the model.
48
+
49
+ ### Offloading
50
+
51
+ In order to do CPU offloading or multi-gpu inference with AutoGPTQ, use the `--gpu-memory` flag. It is currently somewhat slower than offloading with the `--pre_layer` option in GPTQ-for-LLaMA.
52
+
53
+ For CPU offloading:
54
+
55
+ ```
56
+ python server.py --autogptq --gpu-memory 3000MiB --model model_name
57
+ ```
58
+
59
+ For multi-GPU inference:
60
+
61
+ ```
62
+ python server.py --autogptq --gpu-memory 3000MiB 6000MiB --model model_name
63
+ ```
64
+
65
+ ### Using LoRAs with AutoGPTQ
66
+
67
+ Not supported yet.
68
+
69
+ ## GPTQ-for-LLaMa
70
+
71
+ GPTQ-for-LLaMa is the original adaptation of GPTQ for the LLaMA model. It was made possible by [@qwopqwop200](https://github.com/qwopqwop200/GPTQ-for-LLaMa): https://github.com/qwopqwop200/GPTQ-for-LLaMa
72
+
73
+ Different branches of GPTQ-for-LLaMa are currently available, including:
74
+
75
+ | Branch | Comment |
76
+ |----|----|
77
+ | [Old CUDA branch (recommended)](https://github.com/oobabooga/GPTQ-for-LLaMa/) | The fastest branch, works on Windows and Linux. |
78
+ | [Up-to-date triton branch](https://github.com/qwopqwop200/GPTQ-for-LLaMa) | Slightly more precise than the old CUDA branch from 13b upwards, significantly more precise for 7b. 2x slower for small context size and only works on Linux. |
79
+ | [Up-to-date CUDA branch](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda) | As precise as the up-to-date triton branch, 10x slower than the old cuda branch for small context size. |
80
+
81
+ Overall, I recommend using the old CUDA branch. It is included by default in the one-click-installer for this web UI.
82
+
83
+ ### Installation
84
+
85
+ Start by cloning GPTQ-for-LLaMa into your `text-generation-webui/repositories` folder:
86
+
87
+ ```
88
+ mkdir repositories
89
+ cd repositories
90
+ git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
91
+ ```
92
+
93
+ If you want to you to use the up-to-date CUDA or triton branches instead of the old CUDA branch, use these commands:
94
+
95
+ ```
96
+ git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda
97
+ ```
98
+
99
+ ```
100
+ git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b triton
101
+ ```
102
+
103
+ Next you need to install the CUDA extensions. You can do that either by installing the precompiled wheels, or by compiling the wheels yourself.
104
+
105
+ ### Precompiled wheels
106
+
107
+ Kindly provided by our friend jllllll: https://github.com/jllllll/GPTQ-for-LLaMa-Wheels
108
+
109
+ Windows:
110
+
111
+ ```
112
+ pip install https://github.com/jllllll/GPTQ-for-LLaMa-Wheels/raw/main/quant_cuda-0.0.0-cp310-cp310-win_amd64.whl
113
+ ```
114
+
115
+ Linux:
116
+
117
+ ```
118
+ pip install https://github.com/jllllll/GPTQ-for-LLaMa-Wheels/raw/Linux-x64/quant_cuda-0.0.0-cp310-cp310-linux_x86_64.whl
119
+ ```
120
+
121
+ ### Manual installation
122
+
123
+ #### Step 1: install nvcc
124
+
125
+ ```
126
+ conda activate textgen
127
+ conda install -c conda-forge cudatoolkit-dev
128
+ ```
129
+
130
+ The command above takes some 10 minutes to run and shows no progress bar or updates along the way.
131
+
132
+ You are also going to need to have a C++ compiler installed. On Linux, `sudo apt install build-essential` or equivalent is enough.
133
+
134
+ If you're using an older version of CUDA toolkit (e.g. 11.7) but the latest version of `gcc` and `g++` (12.0+), you should downgrade with: `conda install -c conda-forge gxx==11.3.0`. Kernel compilation will fail otherwise.
135
+
136
+ #### Step 2: compile the CUDA extensions
137
+
138
+ ```
139
+ cd repositories/GPTQ-for-LLaMa
140
+ python setup_cuda.py install
141
+ ```
142
+
143
+ ### Getting pre-converted LLaMA weights
144
+
145
+ * Direct download (recommended):
146
+
147
+ https://huggingface.co/Neko-Institute-of-Science/LLaMA-7B-4bit-128g
148
+
149
+ https://huggingface.co/Neko-Institute-of-Science/LLaMA-13B-4bit-128g
150
+
151
+ https://huggingface.co/Neko-Institute-of-Science/LLaMA-30B-4bit-128g
152
+
153
+ https://huggingface.co/Neko-Institute-of-Science/LLaMA-65B-4bit-128g
154
+
155
+ These models were converted with `desc_act=True`. They work just fine with ExLlama. For AutoGPTQ, they will only work on Linux with the `triton` option checked.
156
+
157
+ * Torrent:
158
+
159
+ https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483891617
160
+
161
+ https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483941105
162
+
163
+ These models were converted with `desc_act=False`. As such, they are less accurate, but they work with AutoGPTQ on Windows. The `128g` versions are better from 13b upwards, and worse for 7b. The tokenizer files in the torrents are outdated, in particular the files called `tokenizer_config.json` and `special_tokens_map.json`. Here you can find those files: https://huggingface.co/oobabooga/llama-tokenizer
164
+
165
+ ### Starting the web UI:
166
+
167
+ Use the `--gptq-for-llama` flag.
168
+
169
+ For the models converted without `group-size`:
170
+
171
+ ```
172
+ python server.py --model llama-7b-4bit --gptq-for-llama
173
+ ```
174
+
175
+ For the models converted with `group-size`:
176
+
177
+ ```
178
+ python server.py --model llama-13b-4bit-128g --gptq-for-llama --wbits 4 --groupsize 128
179
+ ```
180
+
181
+ The command-line flags `--wbits` and `--groupsize` are automatically detected based on the folder names in many cases.
182
+
183
+ ### CPU offloading
184
+
185
+ It is possible to offload part of the layers of the 4-bit model to the CPU with the `--pre_layer` flag. The higher the number after `--pre_layer`, the more layers will be allocated to the GPU.
186
+
187
+ With this command, I can run llama-7b with 4GB VRAM:
188
+
189
+ ```
190
+ python server.py --model llama-7b-4bit --pre_layer 20
191
+ ```
192
+
193
+ This is the performance:
194
+
195
+ ```
196
+ Output generated in 123.79 seconds (1.61 tokens/s, 199 tokens)
197
+ ```
198
+
199
+ You can also use multiple GPUs with `pre_layer` if using the oobabooga fork of GPTQ, eg `--pre_layer 30 60` will load a LLaMA-30B model half onto your first GPU and half onto your second, or `--pre_layer 20 40` will load 20 layers onto GPU-0, 20 layers onto GPU-1, and 20 layers offloaded to CPU.
200
+
201
+ ### Using LoRAs with GPTQ-for-LLaMa
202
+
203
+ This requires using a monkey patch that is supported by this web UI: https://github.com/johnsmith0031/alpaca_lora_4bit
204
+
205
+ To use it:
206
+
207
+ 1. Clone `johnsmith0031/alpaca_lora_4bit` into the repositories folder:
208
+
209
+ ```
210
+ cd text-generation-webui/repositories
211
+ git clone https://github.com/johnsmith0031/alpaca_lora_4bit
212
+ ```
213
+
214
+ ⚠️ I have tested it with the following commit specifically: `2f704b93c961bf202937b10aac9322b092afdce0`
215
+
216
+ 2. Install https://github.com/sterlind/GPTQ-for-LLaMa with this command:
217
+
218
+ ```
219
+ pip install git+https://github.com/sterlind/GPTQ-for-LLaMa.git@lora_4bit
220
+ ```
221
+
222
+ 3. Start the UI with the `--monkey-patch` flag:
223
+
224
+ ```
225
+ python server.py --model llama-7b-4bit-128g --listen --lora tloen_alpaca-lora-7b --monkey-patch
226
+ ```
227
+
228
+
docs/LLaMA-model.md ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ LLaMA is a Large Language Model developed by Meta AI.
2
+
3
+ It was trained on more tokens than previous models. The result is that the smallest version with 7 billion parameters has similar performance to GPT-3 with 175 billion parameters.
4
+
5
+ This guide will cover usage through the official `transformers` implementation. For 4-bit mode, head over to [GPTQ models (4 bit mode)
6
+ ](GPTQ-models-(4-bit-mode).md).
7
+
8
+ ## Getting the weights
9
+
10
+ ### Option 1: pre-converted weights
11
+
12
+ * Direct download (recommended):
13
+
14
+ https://huggingface.co/Neko-Institute-of-Science/LLaMA-7B-HF
15
+
16
+ https://huggingface.co/Neko-Institute-of-Science/LLaMA-13B-HF
17
+
18
+ https://huggingface.co/Neko-Institute-of-Science/LLaMA-30B-HF
19
+
20
+ https://huggingface.co/Neko-Institute-of-Science/LLaMA-65B-HF
21
+
22
+ * Torrent:
23
+
24
+ https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1484235789
25
+
26
+ The tokenizer files in the torrent above are outdated, in particular the files called `tokenizer_config.json` and `special_tokens_map.json`. Here you can find those files: https://huggingface.co/oobabooga/llama-tokenizer
27
+
28
+ ### Option 2: convert the weights yourself
29
+
30
+ 1. Install the `protobuf` library:
31
+
32
+ ```
33
+ pip install protobuf==3.20.1
34
+ ```
35
+
36
+ 2. Use the script below to convert the model in `.pth` format that you, a fellow academic, downloaded using Meta's official link.
37
+
38
+ If you have `transformers` installed in place:
39
+
40
+ ```
41
+ python -m transformers.models.llama.convert_llama_weights_to_hf --input_dir /path/to/LLaMA --model_size 7B --output_dir /tmp/outputs/llama-7b
42
+ ```
43
+
44
+ Otherwise download [convert_llama_weights_to_hf.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py) first and run:
45
+
46
+ ```
47
+ python convert_llama_weights_to_hf.py --input_dir /path/to/LLaMA --model_size 7B --output_dir /tmp/outputs/llama-7b
48
+ ```
49
+
50
+ 3. Move the `llama-7b` folder inside your `text-generation-webui/models` folder.
51
+
52
+ ## Starting the web UI
53
+
54
+ ```python
55
+ python server.py --model llama-7b
56
+ ```
docs/README.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # text-generation-webui documentation
2
+
3
+ ## Table of contents
4
+
5
+ * [Audio Notification](Audio-Notification.md)
6
+ * [Chat mode](Chat-mode.md)
7
+ * [DeepSpeed](DeepSpeed.md)
8
+ * [Docker](Docker.md)
9
+ * [ExLlama](ExLlama.md)
10
+ * [Extensions](Extensions.md)
11
+ * [GPTQ models (4 bit mode)](GPTQ-models-(4-bit-mode).md)
12
+ * [LLaMA model](LLaMA-model.md)
13
+ * [llama.cpp](llama.cpp.md)
14
+ * [LoRA](LoRA.md)
15
+ * [Low VRAM guide](Low-VRAM-guide.md)
16
+ * [RWKV model](RWKV-model.md)
17
+ * [Spell book](Spell-book.md)
18
+ * [System requirements](System-requirements.md)
19
+ * [Training LoRAs](Training-LoRAs.md)
20
+ * [Windows installation guide](Windows-installation-guide.md)
21
+ * [WSL installation guide](WSL-installation-guide.md)
docs/System-requirements.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ These are the VRAM and RAM requirements (in MiB) to run some examples of models **in 16-bit (default) precision**:
2
+
3
+ | model | VRAM (GPU) | RAM |
4
+ |:-----------------------|-------------:|--------:|
5
+ | arxiv_ai_gpt2 | 1512.37 | 5824.2 |
6
+ | blenderbot-1B-distill | 2441.75 | 4425.91 |
7
+ | opt-1.3b | 2509.61 | 4427.79 |
8
+ | gpt-neo-1.3b | 2605.27 | 5851.58 |
9
+ | opt-2.7b | 5058.05 | 4863.95 |
10
+ | gpt4chan_model_float16 | 11653.7 | 4437.71 |
11
+ | gpt-j-6B | 11653.7 | 5633.79 |
12
+ | galactica-6.7b | 12697.9 | 4429.89 |
13
+ | opt-6.7b | 12700 | 4368.66 |
14
+ | bloomz-7b1-p3 | 13483.1 | 4470.34 |
15
+
16
+ #### GPU mode with 8-bit precision
17
+
18
+ Allows you to load models that would not normally fit into your GPU. Enabled by default for 13b and 20b models in this web UI.
19
+
20
+ | model | VRAM (GPU) | RAM |
21
+ |:---------------|-------------:|--------:|
22
+ | opt-13b | 12528.1 | 1152.39 |
23
+ | gpt-neox-20b | 20384 | 2291.7 |
24
+
25
+ #### CPU mode (32-bit precision)
26
+
27
+ A lot slower, but does not require a GPU.
28
+
29
+ On my i5-12400F, 6B models take around 10-20 seconds to respond in chat mode, and around 5 minutes to generate a 200 tokens completion.
30
+
31
+ | model | RAM |
32
+ |:-----------------------|---------:|
33
+ | arxiv_ai_gpt2 | 4430.82 |
34
+ | gpt-neo-1.3b | 6089.31 |
35
+ | opt-1.3b | 8411.12 |
36
+ | blenderbot-1B-distill | 8508.16 |
37
+ | opt-2.7b | 14969.3 |
38
+ | bloomz-7b1-p3 | 21371.2 |
39
+ | gpt-j-6B | 24200.3 |
40
+ | gpt4chan_model | 24246.3 |
41
+ | galactica-6.7b | 26561.4 |
42
+ | opt-6.7b | 29596.6 |
docs/llama.cpp.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # llama.cpp
2
+
3
+ llama.cpp is the best backend in two important scenarios:
4
+
5
+ 1) You don't have a GPU.
6
+ 2) You want to run a model that doesn't fit into your GPU.
7
+
8
+ ## Setting up the models
9
+
10
+ #### Pre-converted
11
+
12
+ Download the ggml model directly into your `text-generation-webui/models` folder, making sure that its name contains `ggml` somewhere and ends in `.bin`. It's a single file.
13
+
14
+ `q4_K_M` quantization is recommended.
15
+
16
+ #### Convert Llama yourself
17
+
18
+ Follow the instructions in the llama.cpp README to generate a ggml: https://github.com/ggerganov/llama.cpp#prepare-data--run
19
+
20
+ ## GPU acceleration
21
+
22
+ Enabled with the `--n-gpu-layers` parameter.
23
+
24
+ * If you have enough VRAM, use a high number like `--n-gpu-layers 1000` to offload all layers to the GPU.
25
+ * Otherwise, start with a low number like `--n-gpu-layers 10` and then gradually increase it until you run out of memory.
26
+
27
+ This feature works out of the box for NVIDIA GPUs on Linux (amd64) or Windows. For other GPUs, you need to uninstall `llama-cpp-python` with
28
+
29
+ ```
30
+ pip uninstall -y llama-cpp-python
31
+ ```
32
+
33
+ and then recompile it using the commands here: https://pypi.org/project/llama-cpp-python/
34
+
35
+ #### macOS
36
+
37
+ For macOS, these are the commands:
38
+
39
+ ```
40
+ pip uninstall -y llama-cpp-python
41
+ CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir
42
+ ```