TheBloke commited on
Commit
2070d45
1 Parent(s): aed0aff

Update for Transformers GPTQ support

Browse files
README.md CHANGED
@@ -14,22 +14,25 @@ tags:
14
  ---
15
 
16
  <!-- header start -->
17
- <div style="width: 100%;">
18
- <img src="https://i.imgur.com/EBdldam.jpg" alt="TheBlokeAI" style="width: 100%; min-width: 400px; display: block; margin: auto;">
 
19
  </div>
20
  <div style="display: flex; justify-content: space-between; width: 100%;">
21
  <div style="display: flex; flex-direction: column; align-items: flex-start;">
22
- <p><a href="https://discord.gg/theblokeai">Chat & support: my new Discord server</a></p>
23
  </div>
24
  <div style="display: flex; flex-direction: column; align-items: flex-end;">
25
- <p><a href="https://www.patreon.com/TheBlokeAI">Want to contribute? TheBloke's Patreon page</a></p>
26
  </div>
27
  </div>
 
 
28
  <!-- header end -->
29
 
30
- # Meta's Llama 2 70B Chat GPTQ
31
 
32
- These files are GPTQ model files for [Meta's Llama 2 70B Chat](https://huggingface.co/meta-llama/Llama-2-70b-hf).
33
 
34
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
35
 
@@ -45,18 +48,18 @@ Now that we have ExLlama, that is the recommended loader to use for these models
45
 
46
  Reminder: ExLlama does not support 3-bit models, so if you wish to try those quants, you will need to use AutoGPTQ or GPTQ-for-LLaMa.
47
 
48
-
49
  ## AutoGPTQ and GPTQ-for-LLaMa requires latest version of Transformers
50
 
51
- If you plan to use any of these quants with AutoGPTQ or GPTQ-for-LLaMa, you will need to update Transformers to the latest Github code:
 
 
 
 
52
 
53
  ```
54
  pip3 install git+https://github.com/huggingface/transformers
55
  ```
56
 
57
- If using a UI like text-generation-webui, make sure to do this in the Python environment of text-generation-webui.
58
-
59
-
60
  ## Repositories available
61
 
62
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-70B-GPTQ)
@@ -76,10 +79,10 @@ Each separate quant is in a different branch. See below for instructions on fet
76
 
77
  | Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
78
  | ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
79
- | main | 4 | 128 | False | 35332232264.00 GB | False | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
80
- | gptq-4bit-32g-actorder_True | 4 | 32 | True | 40.66 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
81
- | gptq-4bit-64g-actorder_True | 4 | 64 | True | 37.99 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
82
- | gptq-4bit-128g-actorder_True | 4 | 128 | True | 36.65 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
83
  | gptq-3bit--1g-actorder_True | 3 | None | True | 26.78 GB | False | AutoGPTQ | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
84
  | gptq-3bit-128g-actorder_False | 3 | 128 | False | 28.03 GB | False | AutoGPTQ | 3-bit, with group size 128g but no act-order. Slightly higher VRAM requirements than 3-bit None. |
85
  | gptq-3bit-128g-actorder_True | 3 | 128 | True | 28.03 GB | False | AutoGPTQ | 3-bit, with group size 128g and act-order. Higher quality than 128g-False but poor AutoGPTQ CUDA speed. |
@@ -159,7 +162,7 @@ from transformers import AutoTokenizer, pipeline, logging
159
  from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
160
 
161
  model_name_or_path = "TheBloke/Llama-2-70B-GPTQ"
162
- model_basename = "gptq_model-4bit-128g"
163
 
164
  use_triton = False
165
 
@@ -225,6 +228,7 @@ ExLlama is now compatible with Llama 2 70B models, as of [this commit](https://g
225
  Please see the Provided Files table above for per-file compatibility.
226
 
227
  <!-- footer start -->
 
228
  ## Discord
229
 
230
  For further support, and discussions on these models and AI in general, join us at:
@@ -244,87 +248,15 @@ Donaters will get priority support on any and all AI/LLM/model questions and req
244
  * Patreon: https://patreon.com/TheBlokeAI
245
  * Ko-Fi: https://ko-fi.com/TheBlokeAI
246
 
247
- **Special thanks to**: Luke from CarbonQuill, Aemon Algiz.
248
-
249
- **Patreon special mentions**: Space Cruiser, Nikolai Manek, Sam, Chris McCloskey, Rishabh Srivastava, Kalila, Spiking Neurons AB, Khalefa Al-Ahmad, WelcomeToTheClub, Chadd, Lone Striker, Viktor Bowallius, Edmond Seymore, Ai Maven, Chris Smitley, Dave, Alexandros Triantafyllidis, Luke @flexchar, Elle, ya boyyy, Talal Aujan, Alex , Jonathan Leane, Deep Realms, Randy H, subjectnull, Preetika Verma, Joseph William Delisle, Michael Levine, chris gileta, K, Oscar Rangel, LangChain4j, Trenton Dambrowitz, Eugene Pentland, Johann-Peter Hartmann, Femi Adebogun, Illia Dulskyi, senxiiz, Daniel P. Andersen, Sean Connelly, Artur Olbinski, RoA, Mano Prime, Derek Yates, Raven Klaugh, David Flickinger, Willem Michiel, Pieter, Willian Hasse, vamX, Luke Pendergrass, webtim, Ghost , Rainer Wilmers, Nathan LeClaire, Will Dee, Cory Kujawski, John Detwiler, Fred von Graf, biorpg, Iucharbius , Imad Khwaja, Pierre Kircher, terasurfer , Asp the Wyvern, John Villwock, theTransient, zynix , Gabriel Tamborski, Fen Risland, Gabriel Puliatti, Matthew Berman, Pyrater, SuperWojo, Stephen Murray, Karl Bernard, Ajan Kanaga, Greatston Gnanesh, Junyu Yang.
250
-
251
- Thank you to all my generous patrons and donaters!
252
-
253
- <!-- footer end -->
254
-
255
- # Original model card: Meta's Llama 2 70B Chat
256
-
257
-
258
- <!-- header start -->
259
- <div style="width: 100%;">
260
- <img src="https://i.imgur.com/EBdldam.jpg" alt="TheBlokeAI" style="width: 100%; min-width: 400px; display: block; margin: auto;">
261
- </div>
262
- <div style="display: flex; justify-content: space-between; width: 100%;">
263
- <div style="display: flex; flex-direction: column; align-items: flex-start;">
264
- <p><a href="https://discord.gg/theblokeai">Chat & support: my new Discord server</a></p>
265
- </div>
266
- <div style="display: flex; flex-direction: column; align-items: flex-end;">
267
- <p><a href="https://www.patreon.com/TheBlokeAI">Want to contribute? TheBloke's Patreon page</a></p>
268
- </div>
269
- </div>
270
- <!-- header end -->
271
-
272
- # Meta's Llama 2 70B fp16
273
-
274
- These files are fp16 format model files for [Meta's Llama 2 70B](https://huggingface.co/meta-llama/Llama-2-70b-hf).
275
 
276
- They were produced by downloading the PTH files from Meta, and then converting to HF format using the latest Transformers 4.32.0.dev0, from Git, with the Llama 2 PR included: https://github.com/huggingface/transformers/pull/24891.
277
 
278
- Command to convert was:
279
- ```
280
- python3 /workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir /workspace/git/llama/download --model_size 70B --output_dir /workspace/process/llama-2-70b-chat/source --safe_serialization true
281
- ```
282
-
283
- The files were saved in Safetensors format.
284
-
285
- I am uploading this repo because I initially tried to create GPTQs using the [MetaLlama 2 70B HF repo](https://huggingface.co/meta-llama/Llama-2-70b-hf), but got strange errors that suggested the weights were not correct. But converting from the PTH files using the latest `convert_llama_weights_to_hf.py` script worked fine.
286
-
287
-
288
- Many thanks to William Beauchamp from [Chai](https://chai-research.com/) for providing the hardware for merging and uploading these files!
289
-
290
- ## Repositories available
291
-
292
- * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-70B-GPTQ)
293
- * [Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/meta-llama/Llama-2-70b-hf)
294
- * [My fp16 conversion of the unquantised PTH model files](https://huggingface.co/TheBloke/Llama-2-70B-fp16)
295
-
296
- ## Prompt template: None
297
-
298
- ```
299
- {prompt}
300
- ```
301
-
302
- <!-- footer start -->
303
- ## Discord
304
-
305
- For further support, and discussions on these models and AI in general, join us at:
306
-
307
- [TheBloke AI's Discord server](https://discord.gg/theblokeai)
308
-
309
- ## Thanks, and how to contribute.
310
-
311
- Thanks to the [chirper.ai](https://chirper.ai) team!
312
-
313
- I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training.
314
-
315
- If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects.
316
-
317
- Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits.
318
-
319
- * Patreon: https://patreon.com/TheBlokeAI
320
- * Ko-Fi: https://ko-fi.com/TheBlokeAI
321
-
322
- **Special thanks to**: Luke from CarbonQuill, Aemon Algiz.
323
-
324
- **Patreon special mentions**: Space Cruiser, Nikolai Manek, Sam, Chris McCloskey, Rishabh Srivastava, Kalila, Spiking Neurons AB, Khalefa Al-Ahmad, WelcomeToTheClub, Chadd, Lone Striker, Viktor Bowallius, Edmond Seymore, Ai Maven, Chris Smitley, Dave, Alexandros Triantafyllidis, Luke @flexchar, Elle, ya boyyy, Talal Aujan, Alex , Jonathan Leane, Deep Realms, Randy H, subjectnull, Preetika Verma, Joseph William Delisle, Michael Levine, chris gileta, K, Oscar Rangel, LangChain4j, Trenton Dambrowitz, Eugene Pentland, Johann-Peter Hartmann, Femi Adebogun, Illia Dulskyi, senxiiz, Daniel P. Andersen, Sean Connelly, Artur Olbinski, RoA, Mano Prime, Derek Yates, Raven Klaugh, David Flickinger, Willem Michiel, Pieter, Willian Hasse, vamX, Luke Pendergrass, webtim, Ghost , Rainer Wilmers, Nathan LeClaire, Will Dee, Cory Kujawski, John Detwiler, Fred von Graf, biorpg, Iucharbius , Imad Khwaja, Pierre Kircher, terasurfer , Asp the Wyvern, John Villwock, theTransient, zynix , Gabriel Tamborski, Fen Risland, Gabriel Puliatti, Matthew Berman, Pyrater, SuperWojo, Stephen Murray, Karl Bernard, Ajan Kanaga, Greatston Gnanesh, Junyu Yang.
325
 
326
  Thank you to all my generous patrons and donaters!
327
 
 
 
328
  <!-- footer end -->
329
 
330
  # Original model card: Meta's Llama 2 70B
 
14
  ---
15
 
16
  <!-- header start -->
17
+ <!-- 200823 -->
18
+ <div style="width: auto; margin-left: auto; margin-right: auto">
19
+ <img src="https://i.imgur.com/EBdldam.jpg" alt="TheBlokeAI" style="width: 100%; min-width: 400px; display: block; margin: auto;">
20
  </div>
21
  <div style="display: flex; justify-content: space-between; width: 100%;">
22
  <div style="display: flex; flex-direction: column; align-items: flex-start;">
23
+ <p style="margin-top: 0.5em; margin-bottom: 0em;"><a href="https://discord.gg/theblokeai">Chat & support: TheBloke's Discord server</a></p>
24
  </div>
25
  <div style="display: flex; flex-direction: column; align-items: flex-end;">
26
+ <p style="margin-top: 0.5em; margin-bottom: 0em;"><a href="https://www.patreon.com/TheBlokeAI">Want to contribute? TheBloke's Patreon page</a></p>
27
  </div>
28
  </div>
29
+ <div style="text-align:center; margin-top: 0em; margin-bottom: 0em"><p style="margin-top: 0.25em; margin-bottom: 0em;">TheBloke's LLM work is generously supported by a grant from <a href="https://a16z.com">andreessen horowitz (a16z)</a></p></div>
30
+ <hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
31
  <!-- header end -->
32
 
33
+ # Meta's Llama 2 70B GPTQ
34
 
35
+ These files are GPTQ model files for [Meta's Llama 2 70B](https://huggingface.co/meta-llama/Llama-2-70b-hf).
36
 
37
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
38
 
 
48
 
49
  Reminder: ExLlama does not support 3-bit models, so if you wish to try those quants, you will need to use AutoGPTQ or GPTQ-for-LLaMa.
50
 
 
51
  ## AutoGPTQ and GPTQ-for-LLaMa requires latest version of Transformers
52
 
53
+ If you plan to use any of these quants with AutoGPTQ or GPTQ-for-LLaMa, your Transformers needs to be be using the latest Github code.
54
+
55
+ If you're using text-generation-webui and have updated to the latest version, this is done for you automatically.
56
+
57
+ If not, you can update it manually with:
58
 
59
  ```
60
  pip3 install git+https://github.com/huggingface/transformers
61
  ```
62
 
 
 
 
63
  ## Repositories available
64
 
65
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-70B-GPTQ)
 
79
 
80
  | Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
81
  | ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
82
+ | main | 4 | 128 | False | 35.33 GB | True | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
83
+ | gptq-4bit-32g-actorder_True | 4 | 32 | True | 40.66 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
84
+ | gptq-4bit-64g-actorder_True | 4 | 64 | True | 37.99 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
85
+ | gptq-4bit-128g-actorder_True | 4 | 128 | True | 36.65 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
86
  | gptq-3bit--1g-actorder_True | 3 | None | True | 26.78 GB | False | AutoGPTQ | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
87
  | gptq-3bit-128g-actorder_False | 3 | 128 | False | 28.03 GB | False | AutoGPTQ | 3-bit, with group size 128g but no act-order. Slightly higher VRAM requirements than 3-bit None. |
88
  | gptq-3bit-128g-actorder_True | 3 | 128 | True | 28.03 GB | False | AutoGPTQ | 3-bit, with group size 128g and act-order. Higher quality than 128g-False but poor AutoGPTQ CUDA speed. |
 
162
  from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
163
 
164
  model_name_or_path = "TheBloke/Llama-2-70B-GPTQ"
165
+ model_basename = "model"
166
 
167
  use_triton = False
168
 
 
228
  Please see the Provided Files table above for per-file compatibility.
229
 
230
  <!-- footer start -->
231
+ <!-- 200823 -->
232
  ## Discord
233
 
234
  For further support, and discussions on these models and AI in general, join us at:
 
248
  * Patreon: https://patreon.com/TheBlokeAI
249
  * Ko-Fi: https://ko-fi.com/TheBlokeAI
250
 
251
+ **Special thanks to**: Aemon Algiz.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
252
 
253
+ **Patreon special mentions**: Sam, theTransient, Jonathan Leane, Steven Wood, webtim, Johann-Peter Hartmann, Geoffrey Montalvo, Gabriel Tamborski, Willem Michiel, John Villwock, Derek Yates, Mesiah Bishop, Eugene Pentland, Pieter, Chadd, Stephen Murray, Daniel P. Andersen, terasurfer, Brandon Frisco, Thomas Belote, Sid, Nathan LeClaire, Magnesian, Alps Aficionado, Stanislav Ovsiannikov, Alex, Joseph William Delisle, Nikolai Manek, Michael Davis, Junyu Yang, K, J, Spencer Kim, Stefan Sabev, Olusegun Samson, transmissions 11, Michael Levine, Cory Kujawski, Rainer Wilmers, zynix, Kalila, Luke @flexchar, Ajan Kanaga, Mandus, vamX, Ai Maven, Mano Prime, Matthew Berman, subjectnull, Vitor Caleffi, Clay Pascal, biorpg, alfie_i, 阿明, Jeffrey Morgan, ya boyyy, Raymond Fosdick, knownsqashed, Olakabola, Leonard Tan, ReadyPlayerEmma, Enrico Ros, Dave, Talal Aujan, Illia Dulskyi, Sean Connelly, senxiiz, Artur Olbinski, Elle, Raven Klaugh, Fen Risland, Deep Realms, Imad Khwaja, Fred von Graf, Will Dee, usrbinkat, SuperWojo, Alexandros Triantafyllidis, Swaroop Kallakuri, Dan Guido, John Detwiler, Pedro Madruga, Iucharbius, Viktor Bowallius, Asp the Wyvern, Edmond Seymore, Trenton Dambrowitz, Space Cruiser, Spiking Neurons AB, Pyrater, LangChain4j, Tony Hughes, Kacper Wikieł, Rishabh Srivastava, David Ziegler, Luke Pendergrass, Andrey, Gabriel Puliatti, Lone Striker, Sebastain Graf, Pierre Kircher, Randy H, NimbleBox.ai, Vadim, danny, Deo Leter
254
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
255
 
256
  Thank you to all my generous patrons and donaters!
257
 
258
+ And thank you again to a16z for their generous grant.
259
+
260
  <!-- footer end -->
261
 
262
  # Original model card: Meta's Llama 2 70B
config.json CHANGED
@@ -1,25 +1,36 @@
1
  {
2
- "architectures": [
3
- "LlamaForCausalLM"
4
- ],
5
- "bos_token_id": 1,
6
- "eos_token_id": 2,
7
- "hidden_act": "silu",
8
- "hidden_size": 8192,
9
- "initializer_range": 0.02,
10
- "intermediate_size": 28672,
11
- "max_position_embeddings": 2048,
12
- "model_type": "llama",
13
- "num_attention_heads": 64,
14
- "num_hidden_layers": 80,
15
- "num_key_value_heads": 8,
16
- "pad_token_id": 0,
17
- "pretraining_tp": 1,
18
- "rms_norm_eps": 1e-05,
19
- "rope_scaling": null,
20
- "tie_word_embeddings": false,
21
- "torch_dtype": "float16",
22
- "transformers_version": "4.32.0.dev0",
23
- "use_cache": true,
24
- "vocab_size": 32000
 
 
 
 
 
 
 
 
 
 
 
25
  }
 
1
  {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "bos_token_id": 1,
6
+ "eos_token_id": 2,
7
+ "hidden_act": "silu",
8
+ "hidden_size": 8192,
9
+ "initializer_range": 0.02,
10
+ "intermediate_size": 28672,
11
+ "max_position_embeddings": 2048,
12
+ "model_type": "llama",
13
+ "num_attention_heads": 64,
14
+ "num_hidden_layers": 80,
15
+ "num_key_value_heads": 8,
16
+ "pad_token_id": 0,
17
+ "pretraining_tp": 1,
18
+ "rms_norm_eps": 1e-05,
19
+ "rope_scaling": null,
20
+ "tie_word_embeddings": false,
21
+ "torch_dtype": "float16",
22
+ "transformers_version": "4.32.0.dev0",
23
+ "use_cache": true,
24
+ "vocab_size": 32000,
25
+ "quantization_config": {
26
+ "bits": 3,
27
+ "group_size": 128,
28
+ "damp_percent": 0.01,
29
+ "desc_act": false,
30
+ "sym": true,
31
+ "true_sequential": true,
32
+ "model_name_or_path": null,
33
+ "model_file_base_name": "model",
34
+ "quant_method": "gptq"
35
+ }
36
  }
gptq_model-3bit-128g.safetensors → model.safetensors RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c0ed4333b200e042cd03e45f7422109e6042c7dcbaa2ac00ac2e87a0ef3c3a35
3
- size 28029146232
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d8bf9a7aa6e896736b1a0d0772f5112f4ee18e64f9f75c048733c0d6ee72c619
3
+ size 28029146288
quantize_config.json CHANGED
@@ -6,5 +6,5 @@
6
  "sym": true,
7
  "true_sequential": true,
8
  "model_name_or_path": null,
9
- "model_file_base_name": null
10
  }
 
6
  "sym": true,
7
  "true_sequential": true,
8
  "model_name_or_path": null,
9
+ "model_file_base_name": "model"
10
  }