TheBloke commited on
Commit
ad17c0a
1 Parent(s): b4393b0

Initial GPTQ model commit

Browse files
Files changed (1) hide show
  1. README.md +38 -13
README.md CHANGED
@@ -35,9 +35,20 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
35
 
36
  Many thanks to William Beauchamp from [Chai](https://chai-research.com/) for providing the hardware for these quantisations!
37
 
38
- ## Required: latest version of Transformers
39
 
40
- Before trying these GPTQs, please update Transformers to the latest Github code:
 
 
 
 
 
 
 
 
 
 
 
41
 
42
  ```
43
  pip3 install git+https://github.com/huggingface/transformers
@@ -45,13 +56,11 @@ pip3 install git+https://github.com/huggingface/transformers
45
 
46
  If using a UI like text-generation-webui, make sure to do this in the Python environment of text-generation-webui.
47
 
48
- Note that at the time of writing, ExLlama is not yet compatible with the Llama 2 70B models, but support is coming soon.
49
 
50
  ## Repositories available
51
 
52
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ)
53
- * [Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)
54
- * [My fp16 conversion of the unquantised PTH model files](https://huggingface.co/TheBloke/Llama-2-70B-chat-fp16)
55
 
56
  ## Prompt template: Llama-2-Chat
57
 
@@ -69,7 +78,7 @@ Each separate quant is in a different branch. See below for instructions on fet
69
 
70
  | Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
71
  | ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
72
- | main | 4 | 128 | False | 35.33 GB | False | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
73
  | gptq-4bit-32g-actorder_True | 4 | 32 | True | 40.66 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
74
  | gptq-4bit-64g-actorder_True | 4 | 64 | True | 37.99 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
75
  | gptq-4bit-128g-actorder_True | 4 | 128 | True | 36.65 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
@@ -87,19 +96,33 @@ git clone --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/L
87
  ```
88
  - In Python Transformers code, the branch is the `revision` parameter; see below.
89
 
90
- ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
91
 
92
- Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
93
 
94
  It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
95
 
96
- Before trying the model, first update Transformers to the latest Github code:
97
 
 
 
 
 
 
 
 
 
 
 
98
  ```
99
- pip3 install git+https://github.com/huggingface/transformers
 
100
  ```
 
 
 
101
 
102
- ExLlama is not currently compatible with Llama 2 70B but support is expected soon.
103
 
104
  1. Click the **Model tab**.
105
  2. Under **Download custom model or LoRA**, enter `TheBloke/Llama-2-70B-chat-GPTQ`.
@@ -107,7 +130,7 @@ ExLlama is not currently compatible with Llama 2 70B but support is expected soo
107
  - see Provided Files above for the list of branches for each option.
108
  3. Click **Download**.
109
  4. The model will start downloading. Once it's finished it will say "Done"
110
- 5. Set Loader to AutoGPTQ or GPTQ-for-LLaMA
111
  - If you use AutoGPTQ, make sure "No inject fused attention" is ticked
112
  6. In the top left, click the refresh icon next to **Model**.
113
  7. In the **Model** dropdown, choose the model you just downloaded: `TheBloke/Llama-2-70B-chat-GPTQ`
@@ -201,7 +224,9 @@ print(pipe(prompt_template)[0]['generated_text'])
201
 
202
  The files provided will work with AutoGPTQ (CUDA and Triton modes), GPTQ-for-LLaMa (only CUDA has been tested), and Occ4m's GPTQ-for-LLaMa fork.
203
 
204
- ExLlama is not currently compatible with Llama 2 70B models, but support is coming soon. Please see the Provided Files table above for per-file compatibility.
 
 
205
 
206
  <!-- footer start -->
207
  ## Discord
 
35
 
36
  Many thanks to William Beauchamp from [Chai](https://chai-research.com/) for providing the hardware for these quantisations!
37
 
38
+ ## ExLlama support for 70B is here!
39
 
40
+ As of [this commit](https://github.com/turboderp/exllama/commit/b3aea521859b83cfd889c4c00c05a323313b7fee), ExLlama has support for Llama 2 70B models.
41
+
42
+ Please make sure you update ExLlama to the latest version. If you are a text-generation-webui one-click user, you must first uninstall the ExLlama wheel, then clone ExLlama into `text-generation-webui/repositories`; full instructions are below.
43
+
44
+ Now that we have ExLlama, that is the recommended loader to use for these models, as performance should be better than with AutoGPTQ and GPTQ-for-LLaMa, and you will be able to use the higher accuracy models, eg 128g + Act-Order.
45
+
46
+ Reminder: ExLlama does not support 3-bit models, so if you wish to try those quants, you will need to use AutoGPTQ or GPTQ-for-LLaMa.
47
+
48
+
49
+ ## AutoGPTQ and GPTQ-for-LLaMa requires latest version of Transformers
50
+
51
+ If you plan to use any of these quants with AutoGPTQ or GPTQ-for-LLaMa, you will need to update Transformers to the latest Github code:
52
 
53
  ```
54
  pip3 install git+https://github.com/huggingface/transformers
 
56
 
57
  If using a UI like text-generation-webui, make sure to do this in the Python environment of text-generation-webui.
58
 
 
59
 
60
  ## Repositories available
61
 
62
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ)
63
+ * [Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/TheBloke/Llama-2-70B-chat-fp16)
 
64
 
65
  ## Prompt template: Llama-2-Chat
66
 
 
78
 
79
  | Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
80
  | ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
81
+ | main | 4 | 128 | False | 35332232264.00 GB | False | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
82
  | gptq-4bit-32g-actorder_True | 4 | 32 | True | 40.66 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
83
  | gptq-4bit-64g-actorder_True | 4 | 64 | True | 37.99 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
84
  | gptq-4bit-128g-actorder_True | 4 | 128 | True | 36.65 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
 
96
  ```
97
  - In Python Transformers code, the branch is the `revision` parameter; see below.
98
 
99
+ ### How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
100
 
101
+ Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui), which includes support for Llama 2 models.
102
 
103
  It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
104
 
105
+ ### Use ExLlama (4-bit models only) - recommended option if you have enough VRAM for 4-bit
106
 
107
+ ExLlama has now been updated to support Llama 2 70B, but you will need to update ExLlama to the latest version.
108
+
109
+ By default text-generation-webui installs a pre-compiled wheel for ExLlama. Until text-generation-webui updates to reflect the ExLlama changes - which hopefully won't be long - you must uninstall that and then clone ExLlama into the `text-generation-webui/repositories` directory. ExLlama will then compile its kernel on model load.
110
+
111
+ Note that this requires that your system is capable of compiling CUDA extensions, which may be an issue on Windows.
112
+
113
+ Instructions for Linux One Click Installer:
114
+
115
+ 1. Change directory into the text-generation-webui main folder: `cd /path/to/text-generation-webui`
116
+ 2. Activate the conda env of text-generation-webui:
117
  ```
118
+ source "installer_files/conda/etc/profile.d/conda.sh"
119
+ conda activate installer_files/env
120
  ```
121
+ 3. Run: `pip3 uninstall exllama`
122
+ 4. Run: `cd repositories/exllama` followed by `git pull` to update exllama.
123
+ 6. Now launch text-generation-webui and follow the instructions below for downloading and running the model. ExLlama should build its kernel when the model first loads.
124
 
125
+ ### Downloading and running the model in text-generation-webui
126
 
127
  1. Click the **Model tab**.
128
  2. Under **Download custom model or LoRA**, enter `TheBloke/Llama-2-70B-chat-GPTQ`.
 
130
  - see Provided Files above for the list of branches for each option.
131
  3. Click **Download**.
132
  4. The model will start downloading. Once it's finished it will say "Done"
133
+ 5. Set Loader to ExLlama if you plan to use a 4-bit file, or else choose AutoGPTQ or GPTQ-for-LLaMA.
134
  - If you use AutoGPTQ, make sure "No inject fused attention" is ticked
135
  6. In the top left, click the refresh icon next to **Model**.
136
  7. In the **Model** dropdown, choose the model you just downloaded: `TheBloke/Llama-2-70B-chat-GPTQ`
 
224
 
225
  The files provided will work with AutoGPTQ (CUDA and Triton modes), GPTQ-for-LLaMa (only CUDA has been tested), and Occ4m's GPTQ-for-LLaMa fork.
226
 
227
+ ExLlama is now compatible with Llama 2 70B models, as of [this commit](https://github.com/turboderp/exllama/commit/b3aea521859b83cfd889c4c00c05a323313b7fee).
228
+
229
+ Please see the Provided Files table above for per-file compatibility.
230
 
231
  <!-- footer start -->
232
  ## Discord