TheBloke
/

Llama-2-70B-Chat-GPTQ

@@ -35,9 +35,20 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
 Many thanks to William Beauchamp from [Chai](https://chai-research.com/) for providing the hardware for these quantisations!
-## Required: latest version of Transformers
-Before trying these GPTQs, please update Transformers to the latest Github code:
 ```
 pip3 install git+https://github.com/huggingface/transformers
@@ -45,13 +56,11 @@ pip3 install git+https://github.com/huggingface/transformers
 If using a UI like text-generation-webui, make sure to do this in the Python environment of text-generation-webui.
-Note that at the time of writing, ExLlama is not yet compatible with the Llama 2 70B models, but support is coming soon.
 ## Repositories available
 * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ)
-* [Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)
-* [My fp16 conversion of the unquantised PTH model files](https://huggingface.co/TheBloke/Llama-2-70B-chat-fp16)
 ## Prompt template: Llama-2-Chat
@@ -69,7 +78,7 @@ Each separate quant is in a different branch.  See below for instructions on fet
 | Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
 | ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
-| main | 4 | 128 | False | 35.33 GB | False | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
 | gptq-4bit-32g-actorder_True | 4 | 32 | True | 40.66 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
 | gptq-4bit-64g-actorder_True | 4 | 64 | True | 37.99 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
 | gptq-4bit-128g-actorder_True | 4 | 128 | True | 36.65 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
@@ -87,19 +96,33 @@ git clone --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/L
 ```
 - In Python Transformers code, the branch is the `revision` parameter; see below.
-## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
-Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
 It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
-Before trying the model, first update Transformers to the latest Github code:
 ```
-pip3 install git+https://github.com/huggingface/transformers
 ```
-ExLlama is not currently compatible with Llama 2 70B but support is expected soon.
 1. Click the **Model tab**.
 2. Under **Download custom model or LoRA**, enter `TheBloke/Llama-2-70B-chat-GPTQ`.
@@ -107,7 +130,7 @@ ExLlama is not currently compatible with Llama 2 70B but support is expected soo
   - see Provided Files above for the list of branches for each option.
 3. Click **Download**.
 4. The model will start downloading. Once it's finished it will say "Done"
-5. Set Loader to AutoGPTQ or GPTQ-for-LLaMA
   - If you use AutoGPTQ, make sure "No inject fused attention" is ticked
 6. In the top left, click the refresh icon next to **Model**.
 7. In the **Model** dropdown, choose the model you just downloaded: `TheBloke/Llama-2-70B-chat-GPTQ`
@@ -201,7 +224,9 @@ print(pipe(prompt_template)[0]['generated_text'])
 The files provided will work with AutoGPTQ (CUDA and Triton modes), GPTQ-for-LLaMa (only CUDA has been tested), and Occ4m's GPTQ-for-LLaMa fork.
-ExLlama is not currently compatible with Llama 2 70B models, but support is coming soon. Please see the Provided Files table above for per-file compatibility.
 <!-- footer start -->
 ## Discord

 Many thanks to William Beauchamp from [Chai](https://chai-research.com/) for providing the hardware for these quantisations!
+## ExLlama support for 70B is here!
+As of [this commit](https://github.com/turboderp/exllama/commit/b3aea521859b83cfd889c4c00c05a323313b7fee), ExLlama has support for Llama 2 70B models.
+Please make sure you update ExLlama to the latest version. If you are a text-generation-webui one-click user, you must first uninstall the ExLlama wheel, then clone ExLlama into `text-generation-webui/repositories`; full instructions are below.
+Now that we have ExLlama, that is the recommended loader to use for these models, as performance should be better than with AutoGPTQ and GPTQ-for-LLaMa, and you will be able to use the higher accuracy models, eg 128g + Act-Order.
+Reminder: ExLlama does not support 3-bit models, so if you wish to try those quants, you will need to use AutoGPTQ or GPTQ-for-LLaMa.
+## AutoGPTQ and GPTQ-for-LLaMa requires latest version of Transformers
+If you plan to use any of these quants with AutoGPTQ or GPTQ-for-LLaMa, you will need to update Transformers to the latest Github code:
 ```
 pip3 install git+https://github.com/huggingface/transformers
 If using a UI like text-generation-webui, make sure to do this in the Python environment of text-generation-webui.
 ## Repositories available
 * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ)
+* [Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/TheBloke/Llama-2-70B-chat-fp16)
 ## Prompt template: Llama-2-Chat
 | Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
 | ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
+| main | 4 | 128 | False | 35332232264.00 GB | False | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
 | gptq-4bit-32g-actorder_True | 4 | 32 | True | 40.66 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
 | gptq-4bit-64g-actorder_True | 4 | 64 | True | 37.99 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
 | gptq-4bit-128g-actorder_True | 4 | 128 | True | 36.65 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
 ```
 - In Python Transformers code, the branch is the `revision` parameter; see below.
+### How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
+Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui), which includes support for Llama 2 models.
 It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
+### Use ExLlama (4-bit models only) - recommended option if you have enough VRAM for 4-bit
+ExLlama has now been updated to support Llama 2 70B, but you will need to update ExLlama to the latest version.
+By default text-generation-webui installs a pre-compiled wheel for ExLlama. Until text-generation-webui updates to reflect the ExLlama changes - which hopefully won't be long - you must uninstall that and then clone ExLlama into the `text-generation-webui/repositories` directory.  ExLlama will then compile its kernel on model load.
+Note that this requires that your system is capable of compiling CUDA extensions, which may be an issue on Windows.
+Instructions for Linux One Click Installer:
+1. Change directory into the text-generation-webui main folder: `cd /path/to/text-generation-webui`
+2. Activate the conda env of text-generation-webui:
 ```
+source "installer_files/conda/etc/profile.d/conda.sh"
+conda activate installer_files/env
 ```
+3. Run: `pip3 uninstall exllama`
+4. Run: `cd repositories/exllama` followed by `git pull` to update exllama.
+6. Now launch text-generation-webui and follow the instructions below for downloading and running the model. ExLlama should build its kernel when the model first loads.
+### Downloading and running the model in text-generation-webui
 1. Click the **Model tab**.
 2. Under **Download custom model or LoRA**, enter `TheBloke/Llama-2-70B-chat-GPTQ`.
   - see Provided Files above for the list of branches for each option.
 3. Click **Download**.
 4. The model will start downloading. Once it's finished it will say "Done"
+5. Set Loader to ExLlama if you plan to use a 4-bit file, or else choose AutoGPTQ or GPTQ-for-LLaMA.
   - If you use AutoGPTQ, make sure "No inject fused attention" is ticked
 6. In the top left, click the refresh icon next to **Model**.
 7. In the **Model** dropdown, choose the model you just downloaded: `TheBloke/Llama-2-70B-chat-GPTQ`
 The files provided will work with AutoGPTQ (CUDA and Triton modes), GPTQ-for-LLaMa (only CUDA has been tested), and Occ4m's GPTQ-for-LLaMa fork.
+ExLlama is now compatible with Llama 2 70B models, as of [this commit](https://github.com/turboderp/exllama/commit/b3aea521859b83cfd889c4c00c05a323313b7fee).
+ Please see the Provided Files table above for per-file compatibility.
 <!-- footer start -->
 ## Discord