TheBloke commited on
Commit
aba044c
1 Parent(s): e731e62

Update for Transformers GPTQ support

Browse files
README.md CHANGED
@@ -12,17 +12,20 @@ tags:
12
  ---
13
 
14
  <!-- header start -->
15
- <div style="width: 100%;">
16
- <img src="https://i.imgur.com/EBdldam.jpg" alt="TheBlokeAI" style="width: 100%; min-width: 400px; display: block; margin: auto;">
 
17
  </div>
18
  <div style="display: flex; justify-content: space-between; width: 100%;">
19
  <div style="display: flex; flex-direction: column; align-items: flex-start;">
20
- <p><a href="https://discord.gg/theblokeai">Chat & support: my new Discord server</a></p>
21
  </div>
22
  <div style="display: flex; flex-direction: column; align-items: flex-end;">
23
- <p><a href="https://www.patreon.com/TheBlokeAI">Want to contribute? TheBloke's Patreon page</a></p>
24
  </div>
25
  </div>
 
 
26
  <!-- header end -->
27
 
28
  # Nous Research's Nous Hermes Llama 2 13B GPTQ
@@ -56,13 +59,13 @@ Each separate quant is in a different branch. See below for instructions on fet
56
 
57
  | Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
58
  | ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
59
- | main | 4 | 128 | False | 7.26 GB | True | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
60
- | gptq-4bit-32g-actorder_True | 4 | 32 | True | 8.00 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
61
- | gptq-4bit-64g-actorder_True | 4 | 64 | True | 7.51 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
62
- | gptq-4bit-128g-actorder_True | 4 | 128 | True | 7.26 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
63
- | gptq-8bit-128g-actorder_True | 8 | 128 | True | 13.65 GB | False | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. Poor AutoGPTQ CUDA speed. |
64
- | gptq-8bit-64g-actorder_True | 8 | 64 | True | 13.95 GB | False | AutoGPTQ | 8-bit, with group size 64g and Act Order for maximum inference quality. Poor AutoGPTQ CUDA speed. |
65
- | gptq-8bit-128g-actorder_False | 8 | 128 | False | 13.65 GB | False | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and without Act Order to improve AutoGPTQ speed. |
66
  | gptq-8bit--1g-actorder_True | 8 | None | True | 13.36 GB | False | AutoGPTQ | 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed. |
67
 
68
  ## How to download from branches
@@ -106,7 +109,7 @@ from transformers import AutoTokenizer, pipeline, logging
106
  from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
107
 
108
  model_name_or_path = "TheBloke/Nous-Hermes-Llama2-GPTQ"
109
- model_basename = "gptq_model-4bit-128g"
110
 
111
  use_triton = False
112
 
@@ -172,6 +175,7 @@ The files provided will work with AutoGPTQ (CUDA and Triton modes), GPTQ-for-LLa
172
  ExLlama works with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
173
 
174
  <!-- footer start -->
 
175
  ## Discord
176
 
177
  For further support, and discussions on these models and AI in general, join us at:
@@ -191,13 +195,15 @@ Donaters will get priority support on any and all AI/LLM/model questions and req
191
  * Patreon: https://patreon.com/TheBlokeAI
192
  * Ko-Fi: https://ko-fi.com/TheBlokeAI
193
 
194
- **Special thanks to**: Luke from CarbonQuill, Aemon Algiz.
195
 
196
- **Patreon special mentions**: Slarti, Chadd, John Detwiler, Pieter, zynix, K, Mano Prime, ReadyPlayerEmma, Ai Maven, Leonard Tan, Edmond Seymore, Joseph William Delisle, Luke @flexchar, Fred von Graf, Viktor Bowallius, Rishabh Srivastava, Nikolai Manek, Matthew Berman, Johann-Peter Hartmann, ya boyyy, Greatston Gnanesh, Femi Adebogun, Talal Aujan, Jonathan Leane, terasurfer, David Flickinger, William Sang, Ajan Kanaga, Vadim, Artur Olbinski, Raven Klaugh, Michael Levine, Oscar Rangel, Randy H, Cory Kujawski, RoA, Dave, Alex, Alexandros Triantafyllidis, Fen Risland, Eugene Pentland, vamX, Elle, Nathan LeClaire, Khalefa Al-Ahmad, Rainer Wilmers, subjectnull, Junyu Yang, Daniel P. Andersen, SuperWojo, LangChain4j, Mandus, Kalila, Illia Dulskyi, Trenton Dambrowitz, Asp the Wyvern, Derek Yates, Jeffrey Morgan, Deep Realms, Imad Khwaja, Pyrater, Preetika Verma, biorpg, Gabriel Tamborski, Stephen Murray, Spiking Neurons AB, Iucharbius, Chris Smitley, Willem Michiel, Luke Pendergrass, Sebastain Graf, senxiiz, Will Dee, Space Cruiser, Karl Bernard, Clay Pascal, Lone Striker, transmissions 11, webtim, WelcomeToTheClub, Sam, theTransient, Pierre Kircher, chris gileta, John Villwock, Sean Connelly, Willian Hasse
197
 
198
 
199
  Thank you to all my generous patrons and donaters!
200
 
 
 
201
  <!-- footer end -->
202
 
203
  # Original model card: Nous Research's Nous Hermes Llama 2 13B
@@ -228,16 +234,16 @@ The model was trained almost entirely on synthetic GPT-4 outputs. Curating high
228
  This includes data from diverse sources such as GPTeacher, the general, roleplay v1&2, code instruct datasets, Nous Instruct & PDACTL (unpublished), and several others, detailed further below
229
 
230
  ## Collaborators
231
- The model fine-tuning and the datasets were a collaboration of efforts and resources between Teknium, Karan4D, Emozilla, Huemin Art, and Redmond AI.
232
-
233
  Special mention goes to @winglian for assisting in some of the training issues.
234
 
235
- Huge shoutout and acknowledgement is deserved for all the dataset creators who generously share their datasets openly.
236
 
237
  Among the contributors of datasets:
238
  - GPTeacher was made available by Teknium
239
  - Wizard LM by nlpxucan
240
- - Nous Research Instruct Dataset was provided by Karan4D and HueminArt.
241
  - GPT4-LLM and Unnatural Instructions were provided by Microsoft
242
  - Airoboros dataset by jondurbin
243
  - Camel-AI's domain expert datasets are from Camel-AI
@@ -257,7 +263,7 @@ The model follows the Alpaca prompt format:
257
 
258
  ```
259
 
260
- or
261
 
262
  ```
263
  ### Instruction:
@@ -269,7 +275,7 @@ or
269
  ### Response:
270
  <leave a newline blank for model to respond>
271
 
272
- ```
273
 
274
  ## Benchmark Results
275
  AGI-Eval
@@ -338,15 +344,15 @@ These are the highest benchmarks Hermes has seen on every metric, achieving the
338
  - 0.3657 on BigBench, up from 0.328 on hermes-llama1
339
  - 0.372 on AGIEval, up from 0.354 on Hermes-llama1
340
 
341
- These benchmarks currently have us at #1 on ARC-c, ARC-e, Hellaswag, and OpenBookQA, and 2nd place on Winogrande, comparing to GPT4all's benchmarking list, supplanting Hermes 1 for the new top position.
342
 
343
  ## Resources for Applied Use Cases:
344
- For an example of a back and forth chatbot using huggingface transformers and discord, check out: https://github.com/teknium1/alpaca-discord
345
- For an example of a roleplaying discord chatbot, check out this: https://github.com/teknium1/alpaca-roleplay-discordbot
346
 
347
  ## Future Plans
348
- We plan to continue to iterate on both more high quality data, and new data filtering techniques to eliminate lower quality data going forward.
349
 
350
  ## Model Usage
351
  The model is available for download on Hugging Face. It is suitable for a wide range of language tasks, from generating creative text to understanding and following complex instructions.
352
-
 
12
  ---
13
 
14
  <!-- header start -->
15
+ <!-- 200823 -->
16
+ <div style="width: auto; margin-left: auto; margin-right: auto">
17
+ <img src="https://i.imgur.com/EBdldam.jpg" alt="TheBlokeAI" style="width: 100%; min-width: 400px; display: block; margin: auto;">
18
  </div>
19
  <div style="display: flex; justify-content: space-between; width: 100%;">
20
  <div style="display: flex; flex-direction: column; align-items: flex-start;">
21
+ <p style="margin-top: 0.5em; margin-bottom: 0em;"><a href="https://discord.gg/theblokeai">Chat & support: TheBloke's Discord server</a></p>
22
  </div>
23
  <div style="display: flex; flex-direction: column; align-items: flex-end;">
24
+ <p style="margin-top: 0.5em; margin-bottom: 0em;"><a href="https://www.patreon.com/TheBlokeAI">Want to contribute? TheBloke's Patreon page</a></p>
25
  </div>
26
  </div>
27
+ <div style="text-align:center; margin-top: 0em; margin-bottom: 0em"><p style="margin-top: 0.25em; margin-bottom: 0em;">TheBloke's LLM work is generously supported by a grant from <a href="https://a16z.com">andreessen horowitz (a16z)</a></p></div>
28
+ <hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
29
  <!-- header end -->
30
 
31
  # Nous Research's Nous Hermes Llama 2 13B GPTQ
 
59
 
60
  | Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
61
  | ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
62
+ | main | 4 | 128 | False | 7.26 GB | True | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
63
+ | gptq-4bit-32g-actorder_True | 4 | 32 | True | 8.00 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
64
+ | gptq-4bit-64g-actorder_True | 4 | 64 | True | 7.51 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
65
+ | gptq-4bit-128g-actorder_True | 4 | 128 | True | 7.26 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
66
+ | gptq-8bit-128g-actorder_True | 8 | 128 | True | 13.65 GB | False | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. Poor AutoGPTQ CUDA speed. |
67
+ | gptq-8bit-64g-actorder_True | 8 | 64 | True | 13.95 GB | False | AutoGPTQ | 8-bit, with group size 64g and Act Order for maximum inference quality. Poor AutoGPTQ CUDA speed. |
68
+ | gptq-8bit-128g-actorder_False | 8 | 128 | False | 13.65 GB | False | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and without Act Order to improve AutoGPTQ speed. |
69
  | gptq-8bit--1g-actorder_True | 8 | None | True | 13.36 GB | False | AutoGPTQ | 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed. |
70
 
71
  ## How to download from branches
 
109
  from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
110
 
111
  model_name_or_path = "TheBloke/Nous-Hermes-Llama2-GPTQ"
112
+ model_basename = "model"
113
 
114
  use_triton = False
115
 
 
175
  ExLlama works with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
176
 
177
  <!-- footer start -->
178
+ <!-- 200823 -->
179
  ## Discord
180
 
181
  For further support, and discussions on these models and AI in general, join us at:
 
195
  * Patreon: https://patreon.com/TheBlokeAI
196
  * Ko-Fi: https://ko-fi.com/TheBlokeAI
197
 
198
+ **Special thanks to**: Aemon Algiz.
199
 
200
+ **Patreon special mentions**: Sam, theTransient, Jonathan Leane, Steven Wood, webtim, Johann-Peter Hartmann, Geoffrey Montalvo, Gabriel Tamborski, Willem Michiel, John Villwock, Derek Yates, Mesiah Bishop, Eugene Pentland, Pieter, Chadd, Stephen Murray, Daniel P. Andersen, terasurfer, Brandon Frisco, Thomas Belote, Sid, Nathan LeClaire, Magnesian, Alps Aficionado, Stanislav Ovsiannikov, Alex, Joseph William Delisle, Nikolai Manek, Michael Davis, Junyu Yang, K, J, Spencer Kim, Stefan Sabev, Olusegun Samson, transmissions 11, Michael Levine, Cory Kujawski, Rainer Wilmers, zynix, Kalila, Luke @flexchar, Ajan Kanaga, Mandus, vamX, Ai Maven, Mano Prime, Matthew Berman, subjectnull, Vitor Caleffi, Clay Pascal, biorpg, alfie_i, 阿明, Jeffrey Morgan, ya boyyy, Raymond Fosdick, knownsqashed, Olakabola, Leonard Tan, ReadyPlayerEmma, Enrico Ros, Dave, Talal Aujan, Illia Dulskyi, Sean Connelly, senxiiz, Artur Olbinski, Elle, Raven Klaugh, Fen Risland, Deep Realms, Imad Khwaja, Fred von Graf, Will Dee, usrbinkat, SuperWojo, Alexandros Triantafyllidis, Swaroop Kallakuri, Dan Guido, John Detwiler, Pedro Madruga, Iucharbius, Viktor Bowallius, Asp the Wyvern, Edmond Seymore, Trenton Dambrowitz, Space Cruiser, Spiking Neurons AB, Pyrater, LangChain4j, Tony Hughes, Kacper Wikieł, Rishabh Srivastava, David Ziegler, Luke Pendergrass, Andrey, Gabriel Puliatti, Lone Striker, Sebastain Graf, Pierre Kircher, Randy H, NimbleBox.ai, Vadim, danny, Deo Leter
201
 
202
 
203
  Thank you to all my generous patrons and donaters!
204
 
205
+ And thank you again to a16z for their generous grant.
206
+
207
  <!-- footer end -->
208
 
209
  # Original model card: Nous Research's Nous Hermes Llama 2 13B
 
234
  This includes data from diverse sources such as GPTeacher, the general, roleplay v1&2, code instruct datasets, Nous Instruct & PDACTL (unpublished), and several others, detailed further below
235
 
236
  ## Collaborators
237
+ The model fine-tuning and the datasets were a collaboration of efforts and resources between Teknium, Karan4D, Emozilla, Huemin Art, and Redmond AI.
238
+
239
  Special mention goes to @winglian for assisting in some of the training issues.
240
 
241
+ Huge shoutout and acknowledgement is deserved for all the dataset creators who generously share their datasets openly.
242
 
243
  Among the contributors of datasets:
244
  - GPTeacher was made available by Teknium
245
  - Wizard LM by nlpxucan
246
+ - Nous Research Instruct Dataset was provided by Karan4D and HueminArt.
247
  - GPT4-LLM and Unnatural Instructions were provided by Microsoft
248
  - Airoboros dataset by jondurbin
249
  - Camel-AI's domain expert datasets are from Camel-AI
 
263
 
264
  ```
265
 
266
+ or
267
 
268
  ```
269
  ### Instruction:
 
275
  ### Response:
276
  <leave a newline blank for model to respond>
277
 
278
+ ```
279
 
280
  ## Benchmark Results
281
  AGI-Eval
 
344
  - 0.3657 on BigBench, up from 0.328 on hermes-llama1
345
  - 0.372 on AGIEval, up from 0.354 on Hermes-llama1
346
 
347
+ These benchmarks currently have us at #1 on ARC-c, ARC-e, Hellaswag, and OpenBookQA, and 2nd place on Winogrande, comparing to GPT4all's benchmarking list, supplanting Hermes 1 for the new top position.
348
 
349
  ## Resources for Applied Use Cases:
350
+ For an example of a back and forth chatbot using huggingface transformers and discord, check out: https://github.com/teknium1/alpaca-discord
351
+ For an example of a roleplaying discord chatbot, check out this: https://github.com/teknium1/alpaca-roleplay-discordbot
352
 
353
  ## Future Plans
354
+ We plan to continue to iterate on both more high quality data, and new data filtering techniques to eliminate lower quality data going forward.
355
 
356
  ## Model Usage
357
  The model is available for download on Hugging Face. It is suitable for a wide range of language tasks, from generating creative text to understanding and following complex instructions.
358
+
config.json CHANGED
@@ -1,26 +1,37 @@
1
  {
2
- "_name_or_path": "output/hermes-llama2-4k/checkpoint-2259",
3
- "architectures": [
4
- "LlamaForCausalLM"
5
- ],
6
- "bos_token_id": 1,
7
- "eos_token_id": 2,
8
- "hidden_act": "silu",
9
- "hidden_size": 5120,
10
- "initializer_range": 0.02,
11
- "intermediate_size": 13824,
12
- "max_position_embeddings": 4096,
13
- "model_type": "llama",
14
- "num_attention_heads": 40,
15
- "num_hidden_layers": 40,
16
- "num_key_value_heads": 40,
17
- "pad_token_id": 0,
18
- "pretraining_tp": 1,
19
- "rms_norm_eps": 1e-05,
20
- "rope_scaling": null,
21
- "tie_word_embeddings": false,
22
- "torch_dtype": "bfloat16",
23
- "transformers_version": "4.32.0.dev0",
24
- "use_cache": true,
25
- "vocab_size": 32032
 
 
 
 
 
 
 
 
 
 
 
26
  }
 
1
  {
2
+ "_name_or_path": "output/hermes-llama2-4k/checkpoint-2259",
3
+ "architectures": [
4
+ "LlamaForCausalLM"
5
+ ],
6
+ "bos_token_id": 1,
7
+ "eos_token_id": 2,
8
+ "hidden_act": "silu",
9
+ "hidden_size": 5120,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 13824,
12
+ "max_position_embeddings": 4096,
13
+ "model_type": "llama",
14
+ "num_attention_heads": 40,
15
+ "num_hidden_layers": 40,
16
+ "num_key_value_heads": 40,
17
+ "pad_token_id": 0,
18
+ "pretraining_tp": 1,
19
+ "rms_norm_eps": 1e-05,
20
+ "rope_scaling": null,
21
+ "tie_word_embeddings": false,
22
+ "torch_dtype": "bfloat16",
23
+ "transformers_version": "4.32.0.dev0",
24
+ "use_cache": true,
25
+ "vocab_size": 32032,
26
+ "quantization_config": {
27
+ "bits": 8,
28
+ "group_size": 128,
29
+ "damp_percent": 0.01,
30
+ "desc_act": true,
31
+ "sym": true,
32
+ "true_sequential": true,
33
+ "model_name_or_path": null,
34
+ "model_file_base_name": "model",
35
+ "quant_method": "gptq"
36
+ }
37
  }
gptq_model-8bit-128g.safetensors → model.safetensors RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8cf1aaf60f335209fe8ce0508e08c122fdae937900896ba79a223075657c40ed
3
- size 13653553504
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2252bc7a8a0adb2e724831a802087066e2fb37821b690224f2d9648bdf411a33
3
+ size 13653553560
quantize_config.json CHANGED
@@ -6,5 +6,5 @@
6
  "sym": true,
7
  "true_sequential": true,
8
  "model_name_or_path": null,
9
- "model_file_base_name": null
10
  }
 
6
  "sym": true,
7
  "true_sequential": true,
8
  "model_name_or_path": null,
9
+ "model_file_base_name": "model"
10
  }