TheBloke
/

Llama-2-70B-Chat-GPTQ

Text Generation

text-generation-inference

4-bit precision

Model card Files Files and versions Community

TheBloke commited on Jul 19, 2023

Commit

286b640

•

1 Parent(s): e0b8ec7

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -68,7 +68,7 @@ Each separate quant is in a different branch.  See below for instructions on fet
 | Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
 | ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
-| main | 4 | 128 | False | 35332232264.00 GB | False | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
 | gptq-4bit-32g-actorder_True | 4 | 32 | True | 40.66 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
 | gptq-4bit-64g-actorder_True | 4 | 64 | True | 37.99 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
 | gptq-4bit-128g-actorder_True | 4 | 128 | True | 36.65 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
@@ -101,7 +101,7 @@ pip3 install git+https://github.com/huggingface/transformers
 ExLlama is not currently compatible with Llama 2 70B but support is expected soon.
 1. Click the **Model tab**.
-2. Under **Download custom model or LoRA**, enter `%%REPO_GPTQ`.
   - To download from a specific branch, enter for example `TheBloke/Llama-2-70B-chat-GPTQ:gptq-4bit-32g-actorder_True`
   - see Provided Files above for the list of branches for each option.
 3. Click **Download**.

 | Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
 | ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
+| main | 4 | 128 | False | 35.33 GB | False | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
 | gptq-4bit-32g-actorder_True | 4 | 32 | True | 40.66 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
 | gptq-4bit-64g-actorder_True | 4 | 64 | True | 37.99 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
 | gptq-4bit-128g-actorder_True | 4 | 128 | True | 36.65 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
 ExLlama is not currently compatible with Llama 2 70B but support is expected soon.
 1. Click the **Model tab**.
+2. Under **Download custom model or LoRA**, enter `TheBloke/Llama-2-70B-chat-GPTQ`.
   - To download from a specific branch, enter for example `TheBloke/Llama-2-70B-chat-GPTQ:gptq-4bit-32g-actorder_True`
   - see Provided Files above for the list of branches for each option.
 3. Click **Download**.