TheBloke commited on
Commit
5937923
1 Parent(s): ec9eb4f

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +151 -103
README.md CHANGED
@@ -1,16 +1,19 @@
1
  ---
 
2
  datasets:
3
  - Open-Orca/OpenOrca
4
  inference: false
5
  language:
6
  - en
7
  library_name: transformers
8
- license: other
9
  model_creator: Open-Orca
10
- model_link: https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B
11
  model_name: OpenOrca x OpenChat - Preview2 - 13B
12
  model_type: llama
13
  pipeline_tag: text-generation
 
 
 
14
  quantized_by: TheBloke
15
  ---
16
 
@@ -35,163 +38,181 @@ quantized_by: TheBloke
35
  - Model creator: [Open-Orca](https://huggingface.co/Open-Orca)
36
  - Original model: [OpenOrca x OpenChat - Preview2 - 13B](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B)
37
 
 
38
  ## Description
39
 
40
  This repo contains GPTQ model files for [Open-Orca's OpenOrca x OpenChat - Preview2 - 13B](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B).
41
 
42
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
43
 
 
 
44
  ## Repositories available
45
 
 
46
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ)
47
- * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/OpenOrcaxOpenChat-Preview2-13B-GGML)
48
  * [Open-Orca's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B)
 
49
 
50
- ## Prompt template: OpenChat Llama2 V1
 
51
 
52
  ```
53
- User: {prompt}<|end_of_turn|>Assistant:
 
54
  ```
55
 
 
 
 
 
56
  ## Provided files and GPTQ parameters
57
 
58
  Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
59
 
60
  Each separate quant is in a different branch. See below for instructions on fetching from different branches.
61
 
62
- All GPTQ files are made with AutoGPTQ.
63
 
64
  <details>
65
  <summary>Explanation of GPTQ parameters</summary>
66
 
67
  - Bits: The bit size of the quantised model.
68
  - GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
69
- - Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have issues with models that use Act Order plus Group Size.
70
  - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
71
- - GPTQ dataset: The dataset used for quantisation. The dataset used for quantisation can affect the quantisation accuracy. The dataset used for quantisation is not the same as the dataset used to train the model.
72
- - Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only affects the quantisation accuracy on longer inference sequences.
73
  - ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama models in 4-bit.
74
 
75
  </details>
76
 
77
  | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
78
  | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
79
- | main | 4 | 128 | No | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 7.26 GB | Yes | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
80
- | gptq-4bit-32g-actorder_True | 4 | 32 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 8.00 GB | Yes | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
81
- | gptq-4bit-64g-actorder_True | 4 | 64 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 7.51 GB | Yes | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
82
- | gptq-4bit-128g-actorder_True | 4 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 7.26 GB | Yes | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
83
- | gptq-8bit--1g-actorder_True | 8 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 13.36 GB | No | 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed. |
84
- | gptq-8bit-128g-actorder_True | 8 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 13.65 GB | No | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. Poor AutoGPTQ CUDA speed. |
 
 
85
 
 
86
  ## How to download from branches
87
 
88
- - In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ:gptq-4bit-32g-actorder_True`
89
  - With Git, you can clone a branch with:
90
  ```
91
- git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ
92
  ```
93
  - In Python Transformers code, the branch is the `revision` parameter; see below.
94
-
 
95
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
96
 
97
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
98
 
99
- It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
100
 
101
  1. Click the **Model tab**.
102
  2. Under **Download custom model or LoRA**, enter `TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ`.
103
- - To download from a specific branch, enter for example `TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ:gptq-4bit-32g-actorder_True`
104
  - see Provided Files above for the list of branches for each option.
105
  3. Click **Download**.
106
- 4. The model will start downloading. Once it's finished it will say "Done"
107
  5. In the top left, click the refresh icon next to **Model**.
108
  6. In the **Model** dropdown, choose the model you just downloaded: `OpenOrcaxOpenChat-Preview2-13B-GPTQ`
109
  7. The model will automatically load, and is now ready for use!
110
  8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
111
- * Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
112
  9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
 
113
 
 
114
  ## How to use this GPTQ model from Python code
115
 
116
- First make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) 0.3.1 or later installed:
117
 
118
- ```
119
- pip3 install auto-gptq
120
- ```
121
 
122
- If you have problems installing AutoGPTQ, please build from source instead:
 
 
123
  ```
 
 
 
 
124
  pip3 uninstall -y auto-gptq
125
  git clone https://github.com/PanQiWei/AutoGPTQ
126
  cd AutoGPTQ
127
  pip3 install .
128
  ```
129
 
130
- Then try the following example code:
 
 
 
 
 
 
 
 
131
 
132
  ```python
133
- from transformers import AutoTokenizer, pipeline, logging
134
- from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
135
 
136
  model_name_or_path = "TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ"
137
-
138
- use_triton = False
 
 
 
 
139
 
140
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
141
 
142
- model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
143
- use_safetensors=True,
144
- trust_remote_code=False,
145
- device="cuda:0",
146
- use_triton=use_triton,
147
- quantize_config=None)
148
-
149
- """
150
- # To download from a specific branch, use the revision parameter, as in this example:
151
- # Note that `revision` requires AutoGPTQ 0.3.1 or later!
152
-
153
- model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
154
- revision="gptq-4bit-32g-actorder_True",
155
- use_safetensors=True,
156
- trust_remote_code=False,
157
- device="cuda:0",
158
- quantize_config=None)
159
- """
160
-
161
  prompt = "Tell me about AI"
162
- prompt_template=f'''User: {prompt}<|end_of_turn|>Assistant:
 
163
  '''
164
 
165
  print("\n\n*** Generate:")
166
 
167
  input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
168
- output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
169
  print(tokenizer.decode(output[0]))
170
 
171
  # Inference can also be done using transformers' pipeline
172
 
173
- # Prevent printing spurious transformers error when using pipeline with AutoGPTQ
174
- logging.set_verbosity(logging.CRITICAL)
175
-
176
  print("*** Pipeline:")
177
  pipe = pipeline(
178
  "text-generation",
179
  model=model,
180
  tokenizer=tokenizer,
181
  max_new_tokens=512,
 
182
  temperature=0.7,
183
  top_p=0.95,
184
- repetition_penalty=1.15
 
185
  )
186
 
187
  print(pipe(prompt_template)[0]['generated_text'])
188
  ```
 
189
 
 
190
  ## Compatibility
191
 
192
- The files provided will work with AutoGPTQ (CUDA and Triton modes), GPTQ-for-LLaMa (only CUDA has been tested), and Occ4m's GPTQ-for-LLaMa fork.
 
 
193
 
194
- ExLlama works with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
 
195
 
196
  <!-- footer start -->
197
  <!-- 200823 -->
@@ -201,10 +222,12 @@ For further support, and discussions on these models and AI in general, join us
201
 
202
  [TheBloke AI's Discord server](https://discord.gg/theblokeai)
203
 
204
- ## Thanks, and how to contribute.
205
 
206
  Thanks to the [chirper.ai](https://chirper.ai) team!
207
 
 
 
208
  I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training.
209
 
210
  If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects.
@@ -216,7 +239,7 @@ Donaters will get priority support on any and all AI/LLM/model questions and req
216
 
217
  **Special thanks to**: Aemon Algiz.
218
 
219
- **Patreon special mentions**: Sam, theTransient, Jonathan Leane, Steven Wood, webtim, Johann-Peter Hartmann, Geoffrey Montalvo, Gabriel Tamborski, Willem Michiel, John Villwock, Derek Yates, Mesiah Bishop, Eugene Pentland, Pieter, Chadd, Stephen Murray, Daniel P. Andersen, terasurfer, Brandon Frisco, Thomas Belote, Sid, Nathan LeClaire, Magnesian, Alps Aficionado, Stanislav Ovsiannikov, Alex, Joseph William Delisle, Nikolai Manek, Michael Davis, Junyu Yang, K, J, Spencer Kim, Stefan Sabev, Olusegun Samson, transmissions 11, Michael Levine, Cory Kujawski, Rainer Wilmers, zynix, Kalila, Luke @flexchar, Ajan Kanaga, Mandus, vamX, Ai Maven, Mano Prime, Matthew Berman, subjectnull, Vitor Caleffi, Clay Pascal, biorpg, alfie_i, 阿明, Jeffrey Morgan, ya boyyy, Raymond Fosdick, knownsqashed, Olakabola, Leonard Tan, ReadyPlayerEmma, Enrico Ros, Dave, Talal Aujan, Illia Dulskyi, Sean Connelly, senxiiz, Artur Olbinski, Elle, Raven Klaugh, Fen Risland, Deep Realms, Imad Khwaja, Fred von Graf, Will Dee, usrbinkat, SuperWojo, Alexandros Triantafyllidis, Swaroop Kallakuri, Dan Guido, John Detwiler, Pedro Madruga, Iucharbius, Viktor Bowallius, Asp the Wyvern, Edmond Seymore, Trenton Dambrowitz, Space Cruiser, Spiking Neurons AB, Pyrater, LangChain4j, Tony Hughes, Kacper Wikieł, Rishabh Srivastava, David Ziegler, Luke Pendergrass, Andrey, Gabriel Puliatti, Lone Striker, Sebastain Graf, Pierre Kircher, Randy H, NimbleBox.ai, Vadim, danny, Deo Leter
220
 
221
 
222
  Thank you to all my generous patrons and donaters!
@@ -267,6 +290,38 @@ We will also give sneak-peak announcements on our Discord, which you can find he
267
 
268
  https://AlignmentLab.ai
269
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
270
 
271
  # Evaluation
272
 
@@ -276,8 +331,7 @@ Our average performance for BigBench-Hard: 0.488
276
 
277
  Average for AGIEval: 0.447
278
 
279
- In the Orca paper, they measured their score relative to Vicuna on these evals.
280
- We have done the same and have found our score averages to **~103%** of the total performance that was shown in the Orca paper, using the same evaluation methods as outlined in the paper.
281
 
282
  So we are surpassing Orca performance with <20% of the dataset size and <1/10th the training budget!
283
 
@@ -305,7 +359,19 @@ We have run our own tests using parameters matching the [HuggingFaceH4 Open LLM
305
 
306
  We place #1 for all 13B models at release time!
307
 
308
- ![OpenOrca Preview2 HuggingFace Leaderboard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2HuggingFaceLeaderboard.png "GPT4ALL Performance")
 
 
 
 
 
 
 
 
 
 
 
 
309
 
310
  ## GPT4ALL Leaderboard Performance
311
 
@@ -332,41 +398,21 @@ Commodity cost was ~$600.
332
  Please await our full releases for further training details.
333
 
334
 
335
- # Prompt Template
336
-
337
- We use our own prompt template which we call "`OpenChat Llama2 V1`"
338
-
339
-
340
- Examples:
341
- ```
342
- # Single-turn V1 Llama 2
343
- tokenize("User: Hello<|end_of_turn|>Assistant:")
344
- # Result: [1, 4911, 29901, 15043, 32000, 4007, 22137, 29901]
345
-
346
- # Multi-turn V1 Llama 2
347
- tokenize("User: Hello<|end_of_turn|>Assistant: Hi<|end_of_turn|>User: How are you today?<|end_of_turn|>Assistant:")
348
- # Result: [1, 4911, 29901, 15043, 32000, 4007, 22137, 29901, 6324, 32000, 4911, 29901, 1128, 526, 366, 9826, 29973, 32000, 4007, 22137, 29901]
349
- ```
350
-
351
- For UIs with Prefix and Suffix fields, these will likely work:
352
-
353
- Prefix (include a space after colon):
354
- ```
355
- User:
356
- ```
357
-
358
- Suffix (space after colon):
359
- ```
360
- <|end_of_turn|>\nAssistant:
361
- ```
362
-
363
-
364
  # Serving
365
 
366
  This model is most easily served with [OpenChat's](https://github.com/imoneoi/openchat) customized vLLM OpenAI-compatible API server.
367
  This is highly recommended as it is by far the fastest in terms of inference speed and is a quick and easy option for setup.
368
  We also illustrate setup of Oobabooga/text-generation-webui below. The settings outlined there will also apply to other uses of `Transformers`.
369
 
 
 
 
 
 
 
 
 
 
370
 
371
  ## Serving with OpenChat
372
 
@@ -387,7 +433,7 @@ You may then connect to the OpenAI-compatible API endpoint with tools such as [B
387
  ## Serving with Oobabooga / text-generation-webui
388
 
389
  The model may also be loaded via [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui/) in a similar manner to other models.
390
- See the requirements below. Note that inference with Transformers is significantly slower than using the recommended OpenChat vLLM server.
391
 
392
  ### Oobabooga Key Requirements
393
 
@@ -413,7 +459,9 @@ For "`Bot string`" ...
413
  ```
414
  Assistant:
415
  ```
416
- For "`Context`", it is not necessary but we have found good results with ...
 
 
417
  ```
418
  You are a helpful assistant. Please answer truthfully and write out your thinking step by step to be sure you get the right answer. If you make a mistake or encounter an error in your thinking, say so out loud and attempt to correct it. If you don't know or aren't sure about something, say so clearly. You will act as a professional logician, mathematician, and physicist. You will also act as the most appropriate type of expert to answer any particular question or solve the relevant problem; state which expert type your are, if so. Also think of any particular named expert that would be ideal to answer the relevant question or solve the relevant problem; name and act as them, if appropriate.
419
  ```
@@ -437,7 +485,6 @@ It should look as below:
437
 
438
  Then you should be ready to generate!
439
 
440
-
441
  # Citation
442
 
443
  ```bibtex
@@ -459,7 +506,7 @@ Then you should be ready to generate!
459
  month = {7},
460
  }
461
  @misc{mukherjee2023orca,
462
- title={Orca: Progressive Learning from Complex Explanation Traces of GPT-4},
463
  author={Subhabrata Mukherjee and Arindam Mitra and Ganesh Jawahar and Sahaj Agarwal and Hamid Palangi and Ahmed Awadallah},
464
  year={2023},
465
  eprint={2306.02707},
@@ -467,17 +514,18 @@ Then you should be ready to generate!
467
  primaryClass={cs.CL}
468
  }
469
  @misc{longpre2023flan,
470
- title={The Flan Collection: Designing Data and Methods for Effective Instruction Tuning},
471
  author={Shayne Longpre and Le Hou and Tu Vu and Albert Webson and Hyung Won Chung and Yi Tay and Denny Zhou and Quoc V. Le and Barret Zoph and Jason Wei and Adam Roberts},
472
  year={2023},
473
  eprint={2301.13688},
474
  archivePrefix={arXiv},
475
  primaryClass={cs.AI}
476
  }
477
- @software{touvron2023llama,
478
- title={LLaMA: Open and Efficient Foundation Language Models},
479
- author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume},
480
- journal={arXiv preprint arXiv:2302.13971},
481
- year={2023}
 
482
  }
483
  ```
 
1
  ---
2
+ base_model: https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B
3
  datasets:
4
  - Open-Orca/OpenOrca
5
  inference: false
6
  language:
7
  - en
8
  library_name: transformers
9
+ license: llama2
10
  model_creator: Open-Orca
 
11
  model_name: OpenOrca x OpenChat - Preview2 - 13B
12
  model_type: llama
13
  pipeline_tag: text-generation
14
+ prompt_template: 'user: {prompt}<|end_of_turn|>assistant:
15
+
16
+ '
17
  quantized_by: TheBloke
18
  ---
19
 
 
38
  - Model creator: [Open-Orca](https://huggingface.co/Open-Orca)
39
  - Original model: [OpenOrca x OpenChat - Preview2 - 13B](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B)
40
 
41
+ <!-- description start -->
42
  ## Description
43
 
44
  This repo contains GPTQ model files for [Open-Orca's OpenOrca x OpenChat - Preview2 - 13B](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B).
45
 
46
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
47
 
48
+ <!-- description end -->
49
+ <!-- repositories-available start -->
50
  ## Repositories available
51
 
52
+ * [AWQ model(s) for GPU inference.](https://huggingface.co/TheBloke/OpenOrcaxOpenChat-Preview2-13B-AWQ)
53
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ)
54
+ * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/OpenOrcaxOpenChat-Preview2-13B-GGUF)
55
  * [Open-Orca's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B)
56
+ <!-- repositories-available end -->
57
 
58
+ <!-- prompt-template start -->
59
+ ## Prompt template: openchat llama2 v1
60
 
61
  ```
62
+ user: {prompt}<|end_of_turn|>assistant:
63
+
64
  ```
65
 
66
+ <!-- prompt-template end -->
67
+
68
+
69
+ <!-- README_GPTQ.md-provided-files start -->
70
  ## Provided files and GPTQ parameters
71
 
72
  Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
73
 
74
  Each separate quant is in a different branch. See below for instructions on fetching from different branches.
75
 
76
+ All recent GPTQ files are made with AutoGPTQ, and all files in non-main branches are made with AutoGPTQ. Files in the `main` branch which were uploaded before August 2023 were made with GPTQ-for-LLaMa.
77
 
78
  <details>
79
  <summary>Explanation of GPTQ parameters</summary>
80
 
81
  - Bits: The bit size of the quantised model.
82
  - GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
83
+ - Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now.
84
  - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
85
+ - GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
86
+ - Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
87
  - ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama models in 4-bit.
88
 
89
  </details>
90
 
91
  | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
92
  | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
93
+ | [main](https://huggingface.co/TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ/tree/main) | 4 | 128 | No | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 7.26 GB | Yes | 4-bit, without Act Order and group size 128g. |
94
+ | [gptq-4bit-32g-actorder_True](https://huggingface.co/TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ/tree/gptq-4bit-32g-actorder_True) | 4 | 32 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 8.00 GB | Yes | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. |
95
+ | [gptq-4bit-64g-actorder_True](https://huggingface.co/TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ/tree/gptq-4bit-64g-actorder_True) | 4 | 64 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 7.51 GB | Yes | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. |
96
+ | [gptq-4bit-128g-actorder_True](https://huggingface.co/TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ/tree/gptq-4bit-128g-actorder_True) | 4 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 7.26 GB | Yes | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. |
97
+ | [gptq-8bit--1g-actorder_True](https://huggingface.co/TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ/tree/gptq-8bit--1g-actorder_True) | 8 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 13.36 GB | No | 8-bit, with Act Order. No group size, to lower VRAM requirements. |
98
+ | [gptq-8bit-128g-actorder_True](https://huggingface.co/TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ/tree/gptq-8bit-128g-actorder_True) | 8 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 13.65 GB | No | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. |
99
+
100
+ <!-- README_GPTQ.md-provided-files end -->
101
 
102
+ <!-- README_GPTQ.md-download-from-branches start -->
103
  ## How to download from branches
104
 
105
+ - In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ:main`
106
  - With Git, you can clone a branch with:
107
  ```
108
+ git clone --single-branch --branch main https://huggingface.co/TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ
109
  ```
110
  - In Python Transformers code, the branch is the `revision` parameter; see below.
111
+ <!-- README_GPTQ.md-download-from-branches end -->
112
+ <!-- README_GPTQ.md-text-generation-webui start -->
113
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
114
 
115
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
116
 
117
+ It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
118
 
119
  1. Click the **Model tab**.
120
  2. Under **Download custom model or LoRA**, enter `TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ`.
121
+ - To download from a specific branch, enter for example `TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ:main`
122
  - see Provided Files above for the list of branches for each option.
123
  3. Click **Download**.
124
+ 4. The model will start downloading. Once it's finished it will say "Done".
125
  5. In the top left, click the refresh icon next to **Model**.
126
  6. In the **Model** dropdown, choose the model you just downloaded: `OpenOrcaxOpenChat-Preview2-13B-GPTQ`
127
  7. The model will automatically load, and is now ready for use!
128
  8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
129
+ * Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
130
  9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
131
+ <!-- README_GPTQ.md-text-generation-webui end -->
132
 
133
+ <!-- README_GPTQ.md-use-from-python start -->
134
  ## How to use this GPTQ model from Python code
135
 
136
+ ### Install the necessary packages
137
 
138
+ Requires: Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
 
 
139
 
140
+ ```shell
141
+ pip3 install transformers>=4.32.0 optimum>=1.12.0
142
+ pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
143
  ```
144
+
145
+ If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead:
146
+
147
+ ```shell
148
  pip3 uninstall -y auto-gptq
149
  git clone https://github.com/PanQiWei/AutoGPTQ
150
  cd AutoGPTQ
151
  pip3 install .
152
  ```
153
 
154
+ ### For CodeLlama models only: you must use Transformers 4.33.0 or later.
155
+
156
+ If 4.33.0 is not yet released when you read this, you will need to install Transformers from source:
157
+ ```shell
158
+ pip3 uninstall -y transformers
159
+ pip3 install git+https://github.com/huggingface/transformers.git
160
+ ```
161
+
162
+ ### You can then use the following code
163
 
164
  ```python
165
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
 
166
 
167
  model_name_or_path = "TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ"
168
+ # To use a different branch, change revision
169
+ # For example: revision="main"
170
+ model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
171
+ device_map="auto",
172
+ trust_remote_code=False,
173
+ revision="main")
174
 
175
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
  prompt = "Tell me about AI"
178
+ prompt_template=f'''user: {prompt}<|end_of_turn|>assistant:
179
+
180
  '''
181
 
182
  print("\n\n*** Generate:")
183
 
184
  input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
185
+ output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
186
  print(tokenizer.decode(output[0]))
187
 
188
  # Inference can also be done using transformers' pipeline
189
 
 
 
 
190
  print("*** Pipeline:")
191
  pipe = pipeline(
192
  "text-generation",
193
  model=model,
194
  tokenizer=tokenizer,
195
  max_new_tokens=512,
196
+ do_sample=True,
197
  temperature=0.7,
198
  top_p=0.95,
199
+ top_k=40,
200
+ repetition_penalty=1.1
201
  )
202
 
203
  print(pipe(prompt_template)[0]['generated_text'])
204
  ```
205
+ <!-- README_GPTQ.md-use-from-python end -->
206
 
207
+ <!-- README_GPTQ.md-compatibility start -->
208
  ## Compatibility
209
 
210
+ The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with [Occ4m's GPTQ-for-LLaMa fork](https://github.com/0cc4m/KoboldAI).
211
+
212
+ [ExLlama](https://github.com/turboderp/exllama) is compatible with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
213
 
214
+ [Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is compatible with all GPTQ models.
215
+ <!-- README_GPTQ.md-compatibility end -->
216
 
217
  <!-- footer start -->
218
  <!-- 200823 -->
 
222
 
223
  [TheBloke AI's Discord server](https://discord.gg/theblokeai)
224
 
225
+ ## Thanks, and how to contribute
226
 
227
  Thanks to the [chirper.ai](https://chirper.ai) team!
228
 
229
+ Thanks to Clay from [gpus.llm-utils.org](llm-utils)!
230
+
231
  I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training.
232
 
233
  If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects.
 
239
 
240
  **Special thanks to**: Aemon Algiz.
241
 
242
+ **Patreon special mentions**: Alicia Loh, Stephen Murray, K, Ajan Kanaga, RoA, Magnesian, Deo Leter, Olakabola, Eugene Pentland, zynix, Deep Realms, Raymond Fosdick, Elijah Stavena, Iucharbius, Erik Bjäreholt, Luis Javier Navarrete Lozano, Nicholas, theTransient, John Detwiler, alfie_i, knownsqashed, Mano Prime, Willem Michiel, Enrico Ros, LangChain4j, OG, Michael Dempsey, Pierre Kircher, Pedro Madruga, James Bentley, Thomas Belote, Luke @flexchar, Leonard Tan, Johann-Peter Hartmann, Illia Dulskyi, Fen Risland, Chadd, S_X, Jeff Scroggin, Ken Nordquist, Sean Connelly, Artur Olbinski, Swaroop Kallakuri, Jack West, Ai Maven, David Ziegler, Russ Johnson, transmissions 11, John Villwock, Alps Aficionado, Clay Pascal, Viktor Bowallius, Subspace Studios, Rainer Wilmers, Trenton Dambrowitz, vamX, Michael Levine, 준교 김, Brandon Frisco, Kalila, Trailburnt, Randy H, Talal Aujan, Nathan Dryer, Vadim, 阿明, ReadyPlayerEmma, Tiffany J. Kim, George Stoitzev, Spencer Kim, Jerry Meng, Gabriel Tamborski, Cory Kujawski, Jeffrey Morgan, Spiking Neurons AB, Edmond Seymore, Alexandros Triantafyllidis, Lone Striker, Cap'n Zoog, Nikolai Manek, danny, ya boyyy, Derek Yates, usrbinkat, Mandus, TL, Nathan LeClaire, subjectnull, Imad Khwaja, webtim, Raven Klaugh, Asp the Wyvern, Gabriel Puliatti, Caitlyn Gatomon, Joseph William Delisle, Jonathan Leane, Luke Pendergrass, SuperWojo, Sebastain Graf, Will Dee, Fred von Graf, Andrey, Dan Guido, Daniel P. Andersen, Nitin Borwankar, Elle, Vitor Caleffi, biorpg, jjj, NimbleBox.ai, Pieter, Matthew Berman, terasurfer, Michael Davis, Alex, Stanislav Ovsiannikov
243
 
244
 
245
  Thank you to all my generous patrons and donaters!
 
290
 
291
  https://AlignmentLab.ai
292
 
293
+ # Prompt Template
294
+
295
+ We use our own prompt template which we call "`OpenChat Llama2 V1`".
296
+
297
+ The model is heavily conditioned to work using this format only and will likely encounter issues such as run-on output which emulates a chat between a user and assistant if this format is not properly followed.
298
+
299
+
300
+ Examples:
301
+ ```
302
+ # Single-turn `OpenChat Llama2 V1`
303
+ tokenize("You are OpenOrcaChat.<|end_of_turn|>User: Hello<|end_of_turn|>Assistant:")
304
+ # [1, 887, 526, 4673, 2816, 1113, 1451, 271, 29889, 32000, 4911, 29901, 15043, 32000, 4007, 22137, 29901]
305
+
306
+ # Multi-turn `OpenChat Llama2 V1`
307
+ tokenize("You are OpenOrcaChat.<|end_of_turn|>User: Hello<|end_of_turn|>Assistant: Hi<|end_of_turn|>User: How are you today?<|end_of_turn|>Assistant:")
308
+ # [1, 887, 526, 4673, 2816, 1113, 1451, 271, 29889, 32000, 4911, 29901, 15043, 32000, 4007, 22137, 29901, 6324, 32000, 4911, 29901, 1128, 526, 366, 9826, 29973, 32000, 4007, 22137, 29901]
309
+ ```
310
+
311
+ For UIs with Prefix and Suffix fields, these will likely work:
312
+
313
+ Prefix (include a space after colon):
314
+ ```
315
+ User:
316
+ ```
317
+
318
+ Suffix (space after colon):
319
+ ```
320
+ <|end_of_turn|>\nAssistant:
321
+ ```
322
+
323
+ **Oobabooga's text-generation-webui instructions can be found [further down the page](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B#serving-with-oobabooga--text-generation-webui).**
324
+
325
 
326
  # Evaluation
327
 
 
331
 
332
  Average for AGIEval: 0.447
333
 
334
+ We find our score averages to **~103%** of the total performance that was shown in the Orca paper, using the same evaluation methods as outlined in the paper.
 
335
 
336
  So we are surpassing Orca performance with <20% of the dataset size and <1/10th the training budget!
337
 
 
359
 
360
  We place #1 for all 13B models at release time!
361
 
362
+ ![OpenOrca Preview2 HuggingFace Leaderboard Internal Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2HuggingFaceLeaderboard.png "HuggingFace Leaderboard Internal Performance")
363
+
364
+ **Update Aug 10th:** The official results on the leaderboard are below.
365
+
366
+ ![OpenOrca Preview2 HuggingFace Leaderboard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2HFLeaderboardOfficial.png "HuggingFace Leaderboard Performance")
367
+
368
+ Since our release, a new model which merges an Orca-style model with a Platypus (trained on STEM and logic) model places narrowly above ours, but we were #1 at release time.
369
+
370
+ Below we also highlight how our model fits relative to models of all sizes on the current (as of Aug 10th, 2023) leaderboard.
371
+
372
+ ![OpenOrca Preview2 HuggingFace Leaderboard Performance](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B/resolve/main/Images/OpenOrcaP2HFLeaderboardFull.png "HuggingFace Full Leaderboard")
373
+
374
+ Notably, performance is beyond falcon-40b-instruct, and close to LLaMA1-65B base.
375
 
376
  ## GPT4ALL Leaderboard Performance
377
 
 
398
  Please await our full releases for further training details.
399
 
400
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
401
  # Serving
402
 
403
  This model is most easily served with [OpenChat's](https://github.com/imoneoi/openchat) customized vLLM OpenAI-compatible API server.
404
  This is highly recommended as it is by far the fastest in terms of inference speed and is a quick and easy option for setup.
405
  We also illustrate setup of Oobabooga/text-generation-webui below. The settings outlined there will also apply to other uses of `Transformers`.
406
 
407
+ ## Serving Quantized
408
+
409
+ Pre-quantized models are now available courtesy of our friend TheBloke:
410
+
411
+ * **GGML**: https://huggingface.co/TheBloke/OpenOrcaxOpenChat-Preview2-13B-GGML
412
+ * **GPTQ**: https://huggingface.co/TheBloke/OpenOrcaxOpenChat-Preview2-13B-GPTQ
413
+
414
+ The serving instructions below only apply to the unquantized model being presented in the repository you are viewing here.
415
+ There are some notes, such as on use of the prompt format, that will still apply to the quantized models though.
416
 
417
  ## Serving with OpenChat
418
 
 
433
  ## Serving with Oobabooga / text-generation-webui
434
 
435
  The model may also be loaded via [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui/) in a similar manner to other models.
436
+ See the requirements below. Note that inference with just the Transformers library is significantly slower than using the recommended OpenChat vLLM server.
437
 
438
  ### Oobabooga Key Requirements
439
 
 
459
  ```
460
  Assistant:
461
  ```
462
+ For "`Context`", this is analogous to system prompt.
463
+ It is not necessary, but we have found good results with the below example.
464
+ System prompts used in the Orca training also work well. ...
465
  ```
466
  You are a helpful assistant. Please answer truthfully and write out your thinking step by step to be sure you get the right answer. If you make a mistake or encounter an error in your thinking, say so out loud and attempt to correct it. If you don't know or aren't sure about something, say so clearly. You will act as a professional logician, mathematician, and physicist. You will also act as the most appropriate type of expert to answer any particular question or solve the relevant problem; state which expert type your are, if so. Also think of any particular named expert that would be ideal to answer the relevant question or solve the relevant problem; name and act as them, if appropriate.
467
  ```
 
485
 
486
  Then you should be ready to generate!
487
 
 
488
  # Citation
489
 
490
  ```bibtex
 
506
  month = {7},
507
  }
508
  @misc{mukherjee2023orca,
509
+ title={Orca: Progressive Learning from Complex Explanation Traces of GPT-4},
510
  author={Subhabrata Mukherjee and Arindam Mitra and Ganesh Jawahar and Sahaj Agarwal and Hamid Palangi and Ahmed Awadallah},
511
  year={2023},
512
  eprint={2306.02707},
 
514
  primaryClass={cs.CL}
515
  }
516
  @misc{longpre2023flan,
517
+ title={The Flan Collection: Designing Data and Methods for Effective Instruction Tuning},
518
  author={Shayne Longpre and Le Hou and Tu Vu and Albert Webson and Hyung Won Chung and Yi Tay and Denny Zhou and Quoc V. Le and Barret Zoph and Jason Wei and Adam Roberts},
519
  year={2023},
520
  eprint={2301.13688},
521
  archivePrefix={arXiv},
522
  primaryClass={cs.AI}
523
  }
524
+ @misc{touvron2023llama,
525
+ title={Llama 2: Open Foundation and Fine-Tuned Chat Models},
526
+ author={Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton Ferrer and Moya Chen and Guillem Cucurull and David Esiobu and Jude Fernandes and Jeremy Fu and Wenyin Fu and Brian Fuller and Cynthia Gao and Vedanuj Goswami and Naman Goyal and Anthony Hartshorn and Saghar Hosseini and Rui Hou and Hakan Inan and Marcin Kardas and Viktor Kerkez and Madian Khabsa and Isabel Kloumann and Artem Korenev and Punit Singh Koura and Marie-Anne Lachaux and Thibaut Lavril and Jenya Lee and Diana Liskovich and Yinghai Lu and Yuning Mao and Xavier Martinet and Todor Mihaylov and Pushkar Mishra and Igor Molybog and Yixin Nie and Andrew Poulton and Jeremy Reizenstein and Rashi Rungta and Kalyan Saladi and Alan Schelten and Ruan Silva and Eric Michael Smith and Ranjan Subramanian and Xiaoqing Ellen Tan and Binh Tang and Ross Taylor and Adina Williams and Jian Xiang Kuan and Puxin Xu and Zheng Yan and Iliyan Zarov and Yuchen Zhang and Angela Fan and Melanie Kambadur and Sharan Narang and Aurelien Rodriguez and Robert Stojnic and Sergey Edunov and Thomas Scialom},
527
+ year={2023},
528
+ eprint={2307.09288},
529
+ archivePrefix={arXiv},
530
  }
531
  ```