TheBloke commited on
Commit
cf55091
1 Parent(s): 253614e

Upload new GPTQs with varied parameters

Browse files
Files changed (1) hide show
  1. README.md +50 -18
README.md CHANGED
@@ -1,6 +1,7 @@
1
  ---
2
  inference: false
3
  license: other
 
4
  ---
5
 
6
  <!-- header start -->
@@ -19,13 +20,15 @@ license: other
19
 
20
  # Henk717's Airochronos 33B GPTQ
21
 
22
- These files are GPTQ 4bit model files for [Henk717's Airochronos 33B](https://huggingface.co/Henk717/airochronos-33B).
23
 
24
- It is the result of quantising to 4bit using [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa).
 
 
25
 
26
  ## Repositories available
27
 
28
- * [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/airochronos-33B-GPTQ)
29
  * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/airochronos-33B-GGML)
30
  * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/Henk717/airochronos-33B)
31
 
@@ -39,6 +42,32 @@ Below is an instruction that describes a task. Write a response that appropriate
39
  ### Response:
40
  ```
41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
43
 
44
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
@@ -47,6 +76,8 @@ It is strongly recommended to use the text-generation-webui one-click-installers
47
 
48
  1. Click the **Model tab**.
49
  2. Under **Download custom model or LoRA**, enter `TheBloke/airochronos-33B-GPTQ`.
 
 
50
  3. Click **Download**.
51
  4. The model will start downloading. Once it's finished it will say "Done"
52
  5. In the top left, click the refresh icon next to **Model**.
@@ -76,20 +107,31 @@ use_triton = False
76
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
77
 
78
  model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
79
- model_basename=model_basename,
80
  use_safetensors=True,
81
  trust_remote_code=False,
82
  device="cuda:0",
83
  use_triton=use_triton,
84
  quantize_config=None)
85
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  prompt = "Tell me about AI"
87
  prompt_template=f'''Below is an instruction that describes a task. Write a response that appropriately completes the request.
88
 
89
  ### Instruction: {prompt}
90
 
91
  ### Response:
92
-
93
  '''
94
 
95
  print("\n\n*** Generate:")
@@ -117,21 +159,11 @@ pipe = pipeline(
117
  print(pipe(prompt_template)[0]['generated_text'])
118
  ```
119
 
120
- ## Provided files
121
-
122
- **airochronos-33b-GPTQ-4bit--1g.act.order.safetensors**
123
-
124
- This will work with ExLlama, AutoGPTQ, Occ4m's fork of GPTQ-for-LLaMa, and GPTQ-for-LLaMa. There are reports of issues with Triton mode of recent GPTQ-for-LLaMa but this is untested.
125
 
126
- It was created without group_size to lower VRAM requirements, and with --act-order (desc_act) to boost inference accuracy as much as possible.
127
 
128
- * `airochronos-33b-GPTQ-4bit--1g.act.order.safetensors`
129
- * Works with [ExLlama](https://github.com/turboderp/exllama), providing the best performance and lowest VRAM usage. Recommended.
130
- * Works with AutoGPTQ in CUDA or Triton modes.
131
- * Works with [Occ4m's GPTQ-for-LLaMa fork](https://github.com/0cc4m/GPTQ-for-LLaMa).
132
- * Works with GPTQ-for-LLaMa in CUDA mode. May have issues with GPTQ-for-LLaMa Triton mode.
133
- * Works with text-generation-webui, including one-click-installers.
134
- * Parameters: Groupsize = -1. Act Order / desc_act = True.
135
 
136
  <!-- footer start -->
137
  ## Discord
 
1
  ---
2
  inference: false
3
  license: other
4
+ model_type: llama
5
  ---
6
 
7
  <!-- header start -->
 
20
 
21
  # Henk717's Airochronos 33B GPTQ
22
 
23
+ These files are GPTQ model files for [Henk717's Airochronos 33B](https://huggingface.co/Henk717/airochronos-33B).
24
 
25
+ Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
26
+
27
+ These models were quantised using hardware kindly provided by [Latitude.sh](https://www.latitude.sh/accelerate).
28
 
29
  ## Repositories available
30
 
31
+ * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/airochronos-33B-GPTQ)
32
  * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/airochronos-33B-GGML)
33
  * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/Henk717/airochronos-33B)
34
 
 
42
  ### Response:
43
  ```
44
 
45
+ ## Provided files
46
+
47
+ Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
48
+
49
+ Each separate quant is in a different branch. See below for instructions on fetching from different branches.
50
+
51
+ | Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
52
+ | ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
53
+ | main | 4 | None | True | 16.94 GB | True | GPTQ-for-LLaMa | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
54
+ | gptq-4bit-32g-actorder_True | 4 | 32 | True | 19.44 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
55
+ | gptq-4bit-64g-actorder_True | 4 | 64 | True | 18.18 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
56
+ | gptq-4bit-128g-actorder_True | 4 | 128 | True | 17.55 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
57
+ | gptq-8bit--1g-actorder_True | 8 | None | True | 32.99 GB | False | AutoGPTQ | 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed. |
58
+ | gptq-8bit-128g-actorder_False | 8 | 128 | False | 33.73 GB | False | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and without Act Order to improve AutoGPTQ speed. |
59
+ | gptq-3bit--1g-actorder_True | 3 | None | True | 12.92 GB | False | AutoGPTQ | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
60
+ | gptq-3bit-128g-actorder_False | 3 | 128 | False | 13.51 GB | False | AutoGPTQ | 3-bit, with group size 128g but no act-order. Slightly higher VRAM requirements than 3-bit None. |
61
+
62
+ ## How to download from branches
63
+
64
+ - In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/airochronos-33B-GPTQ:gptq-4bit-32g-actorder_True`
65
+ - With Git, you can clone a branch with:
66
+ ```
67
+ git clone --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/airochronos-33B-GPTQ`
68
+ ```
69
+ - In Python Transformers code, the branch is the `revision` parameter; see below.
70
+
71
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
72
 
73
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
 
76
 
77
  1. Click the **Model tab**.
78
  2. Under **Download custom model or LoRA**, enter `TheBloke/airochronos-33B-GPTQ`.
79
+ - To download from a specific branch, enter for example `TheBloke/airochronos-33B-GPTQ:gptq-4bit-32g-actorder_True`
80
+ - see Provided Files above for the list of branches for each option.
81
  3. Click **Download**.
82
  4. The model will start downloading. Once it's finished it will say "Done"
83
  5. In the top left, click the refresh icon next to **Model**.
 
107
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
108
 
109
  model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
110
+ model_basename=model_basename
111
  use_safetensors=True,
112
  trust_remote_code=False,
113
  device="cuda:0",
114
  use_triton=use_triton,
115
  quantize_config=None)
116
 
117
+ """
118
+ To download from a specific branch, use the revision parameter, as in this example:
119
+
120
+ model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
121
+ revision="gptq-4bit-32g-actorder_True",
122
+ model_basename=model_basename,
123
+ use_safetensors=True,
124
+ trust_remote_code=False,
125
+ device="cuda:0",
126
+ quantize_config=None)
127
+ """
128
+
129
  prompt = "Tell me about AI"
130
  prompt_template=f'''Below is an instruction that describes a task. Write a response that appropriately completes the request.
131
 
132
  ### Instruction: {prompt}
133
 
134
  ### Response:
 
135
  '''
136
 
137
  print("\n\n*** Generate:")
 
159
  print(pipe(prompt_template)[0]['generated_text'])
160
  ```
161
 
162
+ ## Compatibility
 
 
 
 
163
 
164
+ The files provided will work with AutoGPTQ (CUDA and Triton modes), GPTQ-for-LLaMa (only CUDA has been tested), and Occ4m's GPTQ-for-LLaMa fork.
165
 
166
+ ExLlama works with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
 
 
 
 
 
 
167
 
168
  <!-- footer start -->
169
  ## Discord