TheBloke commited on
Commit
1430087
1 Parent(s): a0ef7d0

Upload new k-quant GGML quantised models.

Browse files
Files changed (1) hide show
  1. README.md +57 -50
README.md CHANGED
@@ -1,20 +1,6 @@
1
  ---
2
  inference: false
3
  license: other
4
- datasets:
5
- - QingyiSi/Alpaca-CoT
6
- - teknium/GPT4-LLM-Cleaned
7
- - teknium/GPTeacher-General-Instruct
8
- - metaeval/ScienceQA_text_only
9
- - hellaswag
10
- - openai/summarize_from_feedback
11
- - riddle_sense
12
- - gsm8k
13
- - OpenAssistant/oasst1
14
- language:
15
- - en
16
- library_name: transformers
17
- pipeline_tag: text-generation
18
  ---
19
 
20
  <!-- header start -->
@@ -45,47 +31,64 @@ GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/gger
45
  ## Repositories available
46
 
47
  * [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/hippogriff-30b-chat-GPTQ)
48
- * [4-bit, 5-bit, and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/hippogriff-30b-chat-GGML)
49
  * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/openaccess-ai-collective/hippogriff-30b-chat)
50
 
51
- ## Prompt template
 
52
 
53
- ```
54
- You are a helpful assistant
55
- USER: prompt goes here
56
- ASSISTANT:
57
- ```
58
- or
59
 
60
- ```
61
- <|system|> You are a helpful assistant
62
- <|user|> prompt goes here
63
- <|model|>
64
- ```
65
 
66
- ## THE FILES IN MAIN BRANCH REQUIRES LATEST LLAMA.CPP (May 19th 2023 - commit 2d5db48)!
67
 
68
- llama.cpp recently made another breaking change to its quantisation methods - https://github.com/ggerganov/llama.cpp/pull/1508
69
 
70
- I have quantised the GGML files in this repo with the latest version. Therefore you will require llama.cpp compiled on May 19th or later (commit `2d5db48` or later) to use them.
 
 
 
 
 
 
 
 
 
 
 
71
 
72
  ## Provided files
73
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
74
  | ---- | ---- | ---- | ---- | ---- | ----- |
75
- | hippogriff-30b.ggmlv3.q4_0.bin | q4_0 | 4 | 18.30 GB | 20.80 GB | 4-bit. |
76
- | hippogriff-30b.ggmlv3.q4_1.bin | q4_1 | 4 | 20.33 GB | 22.83 GB | 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
77
- | hippogriff-30b.ggmlv3.q5_0.bin | q5_0 | 5 | 22.37 GB | 24.87 GB | 5-bit. Higher accuracy, higher resource usage and slower inference. |
78
- | hippogriff-30b.ggmlv3.q5_1.bin | q5_1 | 5 | 24.40 GB | 26.90 GB | 5-bit. Even higher accuracy, resource usage and slower inference. |
79
- | hippogriff-30b.ggmlv3.q8_0.bin | q8_0 | 8 | 34.56 GB | 37.06 GB | 8-bit. Almost indistinguishable from float16. Huge resource use and slow. Not recommended for normal use. |
 
 
 
 
 
 
 
 
 
 
80
 
81
  **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
82
 
83
  ## How to run in `llama.cpp`
84
 
85
- Here is an example command line. Adjust for your tastes and needs:
86
 
87
  ```
88
- ./main -t 10 -ngl 32 -m hippogriff-30b.ggmlv3.q5_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|system|> You are a story writing assistant\n<|user|> prompt\n<|model|>"
89
  ```
90
  Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
91
 
@@ -117,22 +120,26 @@ Donaters will get priority support on any and all AI/LLM/model questions and req
117
  * Patreon: https://patreon.com/TheBlokeAI
118
  * Ko-Fi: https://ko-fi.com/TheBlokeAI
119
 
120
- **Patreon special mentions**: Aemon Algiz, Dmitriy Samsonov, Nathan LeClaire, Trenton Dambrowitz, Mano Prime, David Flickinger, vamX, Nikolai Manek, senxiiz, Khalefa Al-Ahmad, Illia Dulskyi, Jonathan Leane, Talal Aujan, V. Lukas, Joseph William Delisle, Pyrater, Oscar Rangel, Lone Striker, Luke Pendergrass, Eugene Pentland, Sebastain Graf, Johann-Peter Hartman.
 
 
121
 
122
  Thank you to all my generous patrons and donaters!
 
123
  <!-- footer end -->
124
 
125
  # Original model card: OpenAccess AI Collective's Hippogriff 30B Chat
126
 
 
127
  # Hippogriff 30B Chat
128
 
129
- [<img src="https://huggingface.co/openaccess-ai-collective/hippogriff-30b-chat/resolve/main/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
130
 
131
 
132
- Hippogriff 30B Chat is an experiment that builds on Manticore with new datasets, while removing a few more instruction and chat datasets. It also includes a de-duped subset of the Pygmalion dataset. It also removes all Alpaca style prompts using `###` in favor of
133
  chat only style prompts using `USER:`,`ASSISTANT:` as well as [pygmalion/metharme prompting](https://huggingface.co/PygmalionAI/metharme-7b#prompting) using `<|system|>, <|user|> and <|model|>` tokens.
134
 
135
- Questions, comments, feedback, looking to donate, or want to help? Reach out on our [Discord](https://discord.gg/EqrvvehG) or email [wing@openaccessaicollective.org](mailto:wing@openaccessaicollective.org)
136
 
137
  # Training Datasets
138
 
@@ -140,13 +147,13 @@ Hippogriff 30B Chat is a Llama 30B model fine-tuned on the following datasets
140
 
141
  - OpenAssistant/oasst1 - cleaned dataset, similar to Guanaco
142
  - synthetic jokes generation and explanation derived from reddit jokes dataset
143
- - synthetic prose generation and rewriting self-chat
144
  - Q&A based on provided context
145
  - self instruct augmented logic_inference_oa
146
  - de-duped pygmalion dataset, filtered down to RP data, cleaned, english only, 25%
147
- - [riddle_sense](https://huggingface.co/datasets/riddle_sense) - instruct augmented
148
  - hellaswag, updated for detailed explanations w 30K+ rows
149
- - [gsm8k](https://huggingface.co/datasets/gsm8k) - instruct augmented
150
  - [ewof/code-alpaca-instruct-unfiltered](https://huggingface.co/datasets/ewof/code-alpaca-instruct-unfiltered) synthetic self chat dataset derived from about 1000 rows
151
  - [subset of QingyiSi/Alpaca-CoT for roleplay and CoT](https://huggingface.co/QingyiSi/Alpaca-CoT)
152
  - [GPTeacher-General-Instruct](https://huggingface.co/datasets/teknium/GPTeacher-General-Instruct)
@@ -172,22 +179,22 @@ A Spaces demo is not provided for this release due to 30B models currently not f
172
 
173
  ## Build
174
 
175
- Hippogriff was built with [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) on 8xA100 80GB
176
- - 1.5 epochs taking approximately 12 hours. No further epochs will be released.
177
 
178
  ## Bias, Risks, and Limitations
179
 
180
  Hippogriff has not been aligned to human preferences with techniques like RLHF or deployed with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs (especially when prompted to do so).
181
  Hippogriff was fine-tuned from the base model LlaMa 30B, please refer to its model card's Limitations Section for relevant information.
182
 
183
- The core functionality of Llama is taking a string of text and predicting the next token. While language models are widely used for tasks other than this,
184
- there are a lot of unknowns with this work. When prompting Hippogriff it is important to remember that the statistically most likely next token is often
185
  not the token that produces the most "accurate" text. Never depend upon Hippogriff to produce factually accurate output.
186
 
187
  Llama was trained on the Pile, a dataset known to contain profanity, lewd, and otherwise abrasive language. Depending upon use case Hippogriff may produce
188
  socially unacceptable text. See Sections 5 and 6 of the Pile paper for a more detailed analysis of the biases in the Pile.
189
 
190
- As with all language models, it is hard to predict in advance how Hippogriff will respond to particular prompts and offensive content may occur without warning.
191
  We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
192
 
193
  ## Examples
@@ -196,7 +203,7 @@ TBD
196
 
197
  ```
198
  USER: write an original short story of at least 300 words in the style of Zora Neale Hurston. Incorporate the following literary devices: Euphemism and Foreshadowing.
199
- ASSISTANT: Once upon a time, there was a black girl named Rose, whose skin shined like a pearl beneath the hot sun. She lived with her mama and daddy on their farm, where they grew crops to feed the family and neighbors. On this particular day, Rose felt a tinge of excitement as she watched her parents work diligently in their fields. 'It's almost harvest season!' she thought, imagining the delicious meals that awaited them. But, soon after, it became apparent that something sinister lurked amidst the rows of fresh produce.
200
 
201
  As her father hoed around a vine-covered cucumber plant, he unearthed a peculiar rock. It was the size of an acorn but had a strange shape, unlike anything he or his wife had ever seen. "What have we here?" He said, turning to his wife with curiosity and excitement. She examined the rock for herself and declared, "Why, it looks like a piece of coal!" But as both mother and daughter looked closer, it seemed to be far more than a mere piece of coal. A sense of unease crept up Rose's spine when she noticed that the rock resembled the dark face of a baby in the womb. As Rose tried to suppress a scream, her mother gasped, "Sweet Jesus! Let's bury this thing before anyone sees!" And so, they quickly hid the stone back underground, hoping that nobody would ever discover their secret.
202
  ```
 
1
  ---
2
  inference: false
3
  license: other
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
  <!-- header start -->
 
31
  ## Repositories available
32
 
33
  * [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/hippogriff-30b-chat-GPTQ)
34
+ * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/hippogriff-30b-chat-GGML)
35
  * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/openaccess-ai-collective/hippogriff-30b-chat)
36
 
37
+ <!-- compatibility_ggml start -->
38
+ ## Compatibility
39
 
40
+ ### Original llama.cpp quant methods: `q4_0, q4_1, q5_0, q5_1, q8_0`
 
 
 
 
 
41
 
42
+ I have quantized these 'original' quantisation methods using an older version of llama.cpp so that they remain compatible with llama.cpp as of May 19th, commit `2d5db48`.
43
+
44
+ They should be compatible with all current UIs and libraries that use llama.cpp, such as those listed at the top of this README.
45
+
46
+ ### New k-quant methods: `q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K`
47
 
48
+ These new quantisation methods are only compatible with llama.cpp as of June 6th, commit `2d43387`.
49
 
50
+ They will NOT be compatible with koboldcpp, text-generation-ui, and other UIs and libraries yet. Support is expected to come over the next few days.
51
 
52
+ ## Explanation of the new k-quant methods
53
+
54
+ The new methods available are:
55
+ * GGML_TYPE_Q2_K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
56
+ * GGML_TYPE_Q3_K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw.
57
+ * GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
58
+ * GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
59
+ * GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
60
+ * GGML_TYPE_Q8_K - "type-0" 8-bit quantization. Only used for quantizing intermediate results. The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type.
61
+
62
+ Refer to the Provided Files table below to see what files use which methods, and how.
63
+ <!-- compatibility_ggml end -->
64
 
65
  ## Provided files
66
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
67
  | ---- | ---- | ---- | ---- | ---- | ----- |
68
+ | hippogriff-30b.ggmlv3.q2_K.bin | q2_K | 2 | 13.60 GB | 16.10 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
69
+ | hippogriff-30b.ggmlv3.q3_K_L.bin | q3_K_L | 3 | 17.20 GB | 19.70 GB | New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
70
+ | hippogriff-30b.ggmlv3.q3_K_M.bin | q3_K_M | 3 | 15.64 GB | 18.14 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
71
+ | hippogriff-30b.ggmlv3.q3_K_S.bin | q3_K_S | 3 | 13.98 GB | 16.48 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
72
+ | hippogriff-30b.ggmlv3.q4_0.bin | q4_0 | 4 | 18.30 GB | 20.80 GB | Original llama.cpp quant method, 4-bit. |
73
+ | hippogriff-30b.ggmlv3.q4_1.bin | q4_1 | 4 | 20.33 GB | 22.83 GB | Original llama.cpp quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
74
+ | hippogriff-30b.ggmlv3.q4_K_M.bin | q4_K_M | 4 | 19.57 GB | 22.07 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K |
75
+ | hippogriff-30b.ggmlv3.q4_K_S.bin | q4_K_S | 4 | 18.30 GB | 20.80 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
76
+ | hippogriff-30b.ggmlv3.q5_0.bin | q5_0 | 5 | 22.37 GB | 24.87 GB | Original llama.cpp quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
77
+ | hippogriff-30b.ggmlv3.q5_1.bin | q5_1 | 5 | 24.40 GB | 26.90 GB | Original llama.cpp quant method, 5-bit. Even higher accuracy, resource usage and slower inference. |
78
+ | hippogriff-30b.ggmlv3.q5_K_M.bin | q5_K_M | 5 | 23.02 GB | 25.52 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K |
79
+ | hippogriff-30b.ggmlv3.q5_K_S.bin | q5_K_S | 5 | 22.37 GB | 24.87 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
80
+ | hippogriff-30b.ggmlv3.q6_K.bin | q6_K | 6 | 26.69 GB | 29.19 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
81
+ | hippogriff-30b.ggmlv3.q8_0.bin | q8_0 | 8 | 34.56 GB | 37.06 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
82
+
83
 
84
  **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
85
 
86
  ## How to run in `llama.cpp`
87
 
88
+ I use the following command line; adjust for your tastes and needs:
89
 
90
  ```
91
+ ./main -t 10 -ngl 32 -m hippogriff-30b.ggmlv3.q5_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
92
  ```
93
  Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
94
 
 
120
  * Patreon: https://patreon.com/TheBlokeAI
121
  * Ko-Fi: https://ko-fi.com/TheBlokeAI
122
 
123
+ **Special thanks to**: Luke from CarbonQuill, Aemon Algiz, Dmitriy Samsonov.
124
+
125
+ **Patreon special mentions**: Ajan Kanaga, Kalila, Derek Yates, Sean Connelly, Luke, Nathan LeClaire, Trenton Dambrowitz, Mano Prime, David Flickinger, vamX, Nikolai Manek, senxiiz, Khalefa Al-Ahmad, Illia Dulskyi, trip7s trip, Jonathan Leane, Talal Aujan, Artur Olbinski, Cory Kujawski, Joseph William Delisle, Pyrater, Oscar Rangel, Lone Striker, Luke Pendergrass, Eugene Pentland, Johann-Peter Hartmann.
126
 
127
  Thank you to all my generous patrons and donaters!
128
+
129
  <!-- footer end -->
130
 
131
  # Original model card: OpenAccess AI Collective's Hippogriff 30B Chat
132
 
133
+
134
  # Hippogriff 30B Chat
135
 
136
+ [<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
137
 
138
 
139
+ Hippogriff 30B Chat is an experiment that builds on Manticore with new datasets, while removing a few more instruction and chat datasets. It also includes a de-duped subset of the Pygmalion dataset. It also removes all Alpaca style prompts using `###` in favor of
140
  chat only style prompts using `USER:`,`ASSISTANT:` as well as [pygmalion/metharme prompting](https://huggingface.co/PygmalionAI/metharme-7b#prompting) using `<|system|>, <|user|> and <|model|>` tokens.
141
 
142
+ Questions, comments, feedback, looking to donate, or want to help? Reach out on our [Discord](https://discord.gg/PugNNHAF5r) or email [wing@openaccessaicollective.org](mailto:wing@openaccessaicollective.org)
143
 
144
  # Training Datasets
145
 
 
147
 
148
  - OpenAssistant/oasst1 - cleaned dataset, similar to Guanaco
149
  - synthetic jokes generation and explanation derived from reddit jokes dataset
150
+ - synthetic prose generation and rewriting self-chat
151
  - Q&A based on provided context
152
  - self instruct augmented logic_inference_oa
153
  - de-duped pygmalion dataset, filtered down to RP data, cleaned, english only, 25%
154
+ - [riddle_sense](https://huggingface.co/datasets/riddle_sense) - instruct augmented
155
  - hellaswag, updated for detailed explanations w 30K+ rows
156
+ - [gsm8k](https://huggingface.co/datasets/gsm8k) - instruct augmented
157
  - [ewof/code-alpaca-instruct-unfiltered](https://huggingface.co/datasets/ewof/code-alpaca-instruct-unfiltered) synthetic self chat dataset derived from about 1000 rows
158
  - [subset of QingyiSi/Alpaca-CoT for roleplay and CoT](https://huggingface.co/QingyiSi/Alpaca-CoT)
159
  - [GPTeacher-General-Instruct](https://huggingface.co/datasets/teknium/GPTeacher-General-Instruct)
 
179
 
180
  ## Build
181
 
182
+ Hippogriff was built with [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) on 8xA100 80GB
183
+ - 1.5 epochs taking approximately 12 hours. No further epochs will be released.
184
 
185
  ## Bias, Risks, and Limitations
186
 
187
  Hippogriff has not been aligned to human preferences with techniques like RLHF or deployed with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs (especially when prompted to do so).
188
  Hippogriff was fine-tuned from the base model LlaMa 30B, please refer to its model card's Limitations Section for relevant information.
189
 
190
+ The core functionality of Llama is taking a string of text and predicting the next token. While language models are widely used for tasks other than this,
191
+ there are a lot of unknowns with this work. When prompting Hippogriff it is important to remember that the statistically most likely next token is often
192
  not the token that produces the most "accurate" text. Never depend upon Hippogriff to produce factually accurate output.
193
 
194
  Llama was trained on the Pile, a dataset known to contain profanity, lewd, and otherwise abrasive language. Depending upon use case Hippogriff may produce
195
  socially unacceptable text. See Sections 5 and 6 of the Pile paper for a more detailed analysis of the biases in the Pile.
196
 
197
+ As with all language models, it is hard to predict in advance how Hippogriff will respond to particular prompts and offensive content may occur without warning.
198
  We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
199
 
200
  ## Examples
 
203
 
204
  ```
205
  USER: write an original short story of at least 300 words in the style of Zora Neale Hurston. Incorporate the following literary devices: Euphemism and Foreshadowing.
206
+ ASSISTANT: Once upon a time, there was a black girl named Rose, whose skin shined like a pearl beneath the hot sun. She lived with her mama and daddy on their farm, where they grew crops to feed the family and neighbors. On this particular day, Rose felt a tinge of excitement as she watched her parents work diligently in their fields. 'It's almost harvest season!' she thought, imagining the delicious meals that awaited them. But, soon after, it became apparent that something sinister lurked amidst the rows of fresh produce.
207
 
208
  As her father hoed around a vine-covered cucumber plant, he unearthed a peculiar rock. It was the size of an acorn but had a strange shape, unlike anything he or his wife had ever seen. "What have we here?" He said, turning to his wife with curiosity and excitement. She examined the rock for herself and declared, "Why, it looks like a piece of coal!" But as both mother and daughter looked closer, it seemed to be far more than a mere piece of coal. A sense of unease crept up Rose's spine when she noticed that the rock resembled the dark face of a baby in the womb. As Rose tried to suppress a scream, her mother gasped, "Sweet Jesus! Let's bury this thing before anyone sees!" And so, they quickly hid the stone back underground, hoping that nobody would ever discover their secret.
209
  ```