Upload new GPTQs with varied parameters
Browse files
README.md
CHANGED
@@ -1,6 +1,7 @@
|
|
1 |
---
|
2 |
inference: false
|
3 |
license: other
|
|
|
4 |
---
|
5 |
|
6 |
<!-- header start -->
|
@@ -19,13 +20,15 @@ license: other
|
|
19 |
|
20 |
# Henk717's Airochronos 33B GPTQ
|
21 |
|
22 |
-
These files are GPTQ
|
23 |
|
24 |
-
|
|
|
|
|
25 |
|
26 |
## Repositories available
|
27 |
|
28 |
-
* [
|
29 |
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/airochronos-33B-GGML)
|
30 |
* [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/Henk717/airochronos-33B)
|
31 |
|
@@ -39,6 +42,32 @@ Below is an instruction that describes a task. Write a response that appropriate
|
|
39 |
### Response:
|
40 |
```
|
41 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
43 |
|
44 |
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
@@ -47,6 +76,8 @@ It is strongly recommended to use the text-generation-webui one-click-installers
|
|
47 |
|
48 |
1. Click the **Model tab**.
|
49 |
2. Under **Download custom model or LoRA**, enter `TheBloke/airochronos-33B-GPTQ`.
|
|
|
|
|
50 |
3. Click **Download**.
|
51 |
4. The model will start downloading. Once it's finished it will say "Done"
|
52 |
5. In the top left, click the refresh icon next to **Model**.
|
@@ -76,20 +107,31 @@ use_triton = False
|
|
76 |
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
|
77 |
|
78 |
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
79 |
-
model_basename=model_basename
|
80 |
use_safetensors=True,
|
81 |
trust_remote_code=False,
|
82 |
device="cuda:0",
|
83 |
use_triton=use_triton,
|
84 |
quantize_config=None)
|
85 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
86 |
prompt = "Tell me about AI"
|
87 |
prompt_template=f'''Below is an instruction that describes a task. Write a response that appropriately completes the request.
|
88 |
|
89 |
### Instruction: {prompt}
|
90 |
|
91 |
### Response:
|
92 |
-
|
93 |
'''
|
94 |
|
95 |
print("\n\n*** Generate:")
|
@@ -117,21 +159,11 @@ pipe = pipeline(
|
|
117 |
print(pipe(prompt_template)[0]['generated_text'])
|
118 |
```
|
119 |
|
120 |
-
##
|
121 |
-
|
122 |
-
**airochronos-33b-GPTQ-4bit--1g.act.order.safetensors**
|
123 |
-
|
124 |
-
This will work with ExLlama, AutoGPTQ, Occ4m's fork of GPTQ-for-LLaMa, and GPTQ-for-LLaMa. There are reports of issues with Triton mode of recent GPTQ-for-LLaMa but this is untested.
|
125 |
|
126 |
-
|
127 |
|
128 |
-
|
129 |
-
* Works with [ExLlama](https://github.com/turboderp/exllama), providing the best performance and lowest VRAM usage. Recommended.
|
130 |
-
* Works with AutoGPTQ in CUDA or Triton modes.
|
131 |
-
* Works with [Occ4m's GPTQ-for-LLaMa fork](https://github.com/0cc4m/GPTQ-for-LLaMa).
|
132 |
-
* Works with GPTQ-for-LLaMa in CUDA mode. May have issues with GPTQ-for-LLaMa Triton mode.
|
133 |
-
* Works with text-generation-webui, including one-click-installers.
|
134 |
-
* Parameters: Groupsize = -1. Act Order / desc_act = True.
|
135 |
|
136 |
<!-- footer start -->
|
137 |
## Discord
|
|
|
1 |
---
|
2 |
inference: false
|
3 |
license: other
|
4 |
+
model_type: llama
|
5 |
---
|
6 |
|
7 |
<!-- header start -->
|
|
|
20 |
|
21 |
# Henk717's Airochronos 33B GPTQ
|
22 |
|
23 |
+
These files are GPTQ model files for [Henk717's Airochronos 33B](https://huggingface.co/Henk717/airochronos-33B).
|
24 |
|
25 |
+
Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
|
26 |
+
|
27 |
+
These models were quantised using hardware kindly provided by [Latitude.sh](https://www.latitude.sh/accelerate).
|
28 |
|
29 |
## Repositories available
|
30 |
|
31 |
+
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/airochronos-33B-GPTQ)
|
32 |
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/airochronos-33B-GGML)
|
33 |
* [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/Henk717/airochronos-33B)
|
34 |
|
|
|
42 |
### Response:
|
43 |
```
|
44 |
|
45 |
+
## Provided files
|
46 |
+
|
47 |
+
Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
|
48 |
+
|
49 |
+
Each separate quant is in a different branch. See below for instructions on fetching from different branches.
|
50 |
+
|
51 |
+
| Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
|
52 |
+
| ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
|
53 |
+
| main | 4 | None | True | 16.94 GB | True | GPTQ-for-LLaMa | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
|
54 |
+
| gptq-4bit-32g-actorder_True | 4 | 32 | True | 19.44 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
|
55 |
+
| gptq-4bit-64g-actorder_True | 4 | 64 | True | 18.18 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
|
56 |
+
| gptq-4bit-128g-actorder_True | 4 | 128 | True | 17.55 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
|
57 |
+
| gptq-8bit--1g-actorder_True | 8 | None | True | 32.99 GB | False | AutoGPTQ | 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed. |
|
58 |
+
| gptq-8bit-128g-actorder_False | 8 | 128 | False | 33.73 GB | False | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and without Act Order to improve AutoGPTQ speed. |
|
59 |
+
| gptq-3bit--1g-actorder_True | 3 | None | True | 12.92 GB | False | AutoGPTQ | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
|
60 |
+
| gptq-3bit-128g-actorder_False | 3 | 128 | False | 13.51 GB | False | AutoGPTQ | 3-bit, with group size 128g but no act-order. Slightly higher VRAM requirements than 3-bit None. |
|
61 |
+
|
62 |
+
## How to download from branches
|
63 |
+
|
64 |
+
- In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/airochronos-33B-GPTQ:gptq-4bit-32g-actorder_True`
|
65 |
+
- With Git, you can clone a branch with:
|
66 |
+
```
|
67 |
+
git clone --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/airochronos-33B-GPTQ`
|
68 |
+
```
|
69 |
+
- In Python Transformers code, the branch is the `revision` parameter; see below.
|
70 |
+
|
71 |
## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
72 |
|
73 |
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
|
|
76 |
|
77 |
1. Click the **Model tab**.
|
78 |
2. Under **Download custom model or LoRA**, enter `TheBloke/airochronos-33B-GPTQ`.
|
79 |
+
- To download from a specific branch, enter for example `TheBloke/airochronos-33B-GPTQ:gptq-4bit-32g-actorder_True`
|
80 |
+
- see Provided Files above for the list of branches for each option.
|
81 |
3. Click **Download**.
|
82 |
4. The model will start downloading. Once it's finished it will say "Done"
|
83 |
5. In the top left, click the refresh icon next to **Model**.
|
|
|
107 |
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
|
108 |
|
109 |
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
110 |
+
model_basename=model_basename
|
111 |
use_safetensors=True,
|
112 |
trust_remote_code=False,
|
113 |
device="cuda:0",
|
114 |
use_triton=use_triton,
|
115 |
quantize_config=None)
|
116 |
|
117 |
+
"""
|
118 |
+
To download from a specific branch, use the revision parameter, as in this example:
|
119 |
+
|
120 |
+
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
121 |
+
revision="gptq-4bit-32g-actorder_True",
|
122 |
+
model_basename=model_basename,
|
123 |
+
use_safetensors=True,
|
124 |
+
trust_remote_code=False,
|
125 |
+
device="cuda:0",
|
126 |
+
quantize_config=None)
|
127 |
+
"""
|
128 |
+
|
129 |
prompt = "Tell me about AI"
|
130 |
prompt_template=f'''Below is an instruction that describes a task. Write a response that appropriately completes the request.
|
131 |
|
132 |
### Instruction: {prompt}
|
133 |
|
134 |
### Response:
|
|
|
135 |
'''
|
136 |
|
137 |
print("\n\n*** Generate:")
|
|
|
159 |
print(pipe(prompt_template)[0]['generated_text'])
|
160 |
```
|
161 |
|
162 |
+
## Compatibility
|
|
|
|
|
|
|
|
|
163 |
|
164 |
+
The files provided will work with AutoGPTQ (CUDA and Triton modes), GPTQ-for-LLaMa (only CUDA has been tested), and Occ4m's GPTQ-for-LLaMa fork.
|
165 |
|
166 |
+
ExLlama works with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
|
|
|
|
|
|
|
|
|
|
|
|
|
167 |
|
168 |
<!-- footer start -->
|
169 |
## Discord
|