Initial GPTQ model commit
Browse files
README.md
CHANGED
@@ -47,7 +47,7 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
|
|
47 |
|
48 |
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-7B-GPTQ)
|
49 |
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/Llama-2-7B-GGML)
|
50 |
-
* [
|
51 |
|
52 |
## Prompt template: None
|
53 |
|
@@ -66,7 +66,7 @@ Each separate quant is in a different branch. See below for instructions on fet
|
|
66 |
| main | 4 | 128 | False | 3.90 GB | True | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
|
67 |
| gptq-4bit-32g-actorder_True | 4 | 32 | True | 4.28 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
|
68 |
| gptq-4bit-64g-actorder_True | 4 | 64 | True | 4.02 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
|
69 |
-
| gptq-4bit-128g-actorder_True | 4 | 128 | True |
|
70 |
|
71 |
## How to download from branches
|
72 |
|
|
|
47 |
|
48 |
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-7B-GPTQ)
|
49 |
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/Llama-2-7B-GGML)
|
50 |
+
* [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/meta-llama/Llama-2-7b-hf)
|
51 |
|
52 |
## Prompt template: None
|
53 |
|
|
|
66 |
| main | 4 | 128 | False | 3.90 GB | True | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
|
67 |
| gptq-4bit-32g-actorder_True | 4 | 32 | True | 4.28 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
|
68 |
| gptq-4bit-64g-actorder_True | 4 | 64 | True | 4.02 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
|
69 |
+
| gptq-4bit-128g-actorder_True | 4 | 128 | True | 3.90 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
|
70 |
|
71 |
## How to download from branches
|
72 |
|