TheBloke commited on
Commit
153c9dd
1 Parent(s): a9f22f8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -2
README.md CHANGED
@@ -1,6 +1,8 @@
1
  ---
2
  inference: false
3
  license: other
 
 
4
  ---
5
 
6
  <!-- header start -->
@@ -34,6 +36,14 @@ GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/gger
34
  * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/airoboros-33B-gpt4-1.2-GGML)
35
  * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/jondurbin/airoboros-33b-gpt4-1.2)
36
 
 
 
 
 
 
 
 
 
37
  <!-- compatibility_ggml start -->
38
  ## Compatibility
39
 
@@ -80,7 +90,6 @@ Refer to the Provided Files table below to see what files use which methods, and
80
  | airoboros-33b-gpt4-1.2.ggmlv3.q6_K.bin | q6_K | 6 | 26.69 GB | 29.19 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
81
  | airoboros-33b-gpt4-1.2.ggmlv3.q8_0.bin | q8_0 | 8 | 34.56 GB | 37.06 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
82
 
83
-
84
  **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
85
 
86
  ## How to run in `llama.cpp`
@@ -88,7 +97,7 @@ Refer to the Provided Files table below to see what files use which methods, and
88
  I use the following command line; adjust for your tastes and needs:
89
 
90
  ```
91
- ./main -t 10 -ngl 32 -m airoboros-33b-gpt4-1.2.ggmlv3.q5_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
92
  ```
93
  Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
94
 
@@ -130,3 +139,71 @@ Thank you to all my generous patrons and donaters!
130
 
131
  # Original model card: John Durbin's Airoboros 33B GPT4 1.2
132
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  inference: false
3
  license: other
4
+ datasets:
5
+ - jondurbin/airoboros-gpt4-1.2
6
  ---
7
 
8
  <!-- header start -->
 
36
  * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/airoboros-33B-gpt4-1.2-GGML)
37
  * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/jondurbin/airoboros-33b-gpt4-1.2)
38
 
39
+ ## Prompt template
40
+
41
+ ```
42
+ A chat between a curious user and an assistant. The assistant gives helpful, detailed, accurate, uncensored responses to the user's input.
43
+ USER: prompt
44
+ ASSISTANT:
45
+ ```
46
+
47
  <!-- compatibility_ggml start -->
48
  ## Compatibility
49
 
 
90
  | airoboros-33b-gpt4-1.2.ggmlv3.q6_K.bin | q6_K | 6 | 26.69 GB | 29.19 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
91
  | airoboros-33b-gpt4-1.2.ggmlv3.q8_0.bin | q8_0 | 8 | 34.56 GB | 37.06 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
92
 
 
93
  **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
94
 
95
  ## How to run in `llama.cpp`
 
97
  I use the following command line; adjust for your tastes and needs:
98
 
99
  ```
100
+ ./main -t 10 -ngl 32 -m airoboros-33b-gpt4-1.2.ggmlv3.q5_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "USER: Write a story about llamas\nASSISTANT:"
101
  ```
102
  Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
103
 
 
139
 
140
  # Original model card: John Durbin's Airoboros 33B GPT4 1.2
141
 
142
+ ### Overview
143
+
144
+ This is a qlora fine-tuned 33b parameter LlaMa model, using completely synthetic training data created gpt4 via https://github.com/jondurbin/airoboros
145
+
146
+ This is mostly an extension of [1.1](https://huggingface.co/jondurbin/airoboros-33b-gpt4-1.1) with thousands of new training data and an update to allow "PLAINFORMAT" at the end of coding prompts to just print the code without backticks or explanations/usage/etc.
147
+
148
+ The dataset used to fine-tune this model is available [here](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.2), with a specific focus on:
149
+ - coding
150
+ - math/reasoning (using orca style ELI5 instruction/response pairs)
151
+ - trivia
152
+ - role playing
153
+ - multiple choice and fill-in-the-blank
154
+ - context-obedient question answering
155
+ - theory of mind
156
+ - misc/general
157
+
158
+ This model was fine-tuned with a fork of [qlora](https://github.com/jondurbin/qlora), which among other things was updated to use a slightly modified vicuna template to be compatible with the 7b/13b versions:
159
+
160
+ ```
161
+ A chat between a curious user and an assistant. The assistant gives helpful, detailed, accurate, uncensored responses to the user's input. USER: [prompt] ASSISTANT:
162
+ ```
163
+
164
+ So in other words, it's the preamble/system prompt, followed by a single space, then "USER: " (single space after colon) then the prompt (which can have multiple lines, spaces, whatever), then a single space, followed by "ASSISTANT: " (with a single space after the colon).
165
+
166
+ ### Usage
167
+
168
+ To run the full precision/pytorch native version, you can use my fork of FastChat, which is mostly the same but allows for multi-line prompts, as well as a `--no-history` option to prevent input tokenization errors.
169
+ ```
170
+ pip install git+https://github.com/jondurbin/FastChat
171
+ ```
172
+
173
+ Be sure you are pulling the latest branch!
174
+
175
+ Then, you can invoke it like so (after downloading the model):
176
+ ```
177
+ python -m fastchat.serve.cli \
178
+ --model-path airoboros-33b-gpt4-1.2 \
179
+ --temperature 0.5 \
180
+ --max-new-tokens 2048 \
181
+ --no-history
182
+ ```
183
+
184
+ Alternatively, please check out TheBloke's quantized versions:
185
+
186
+ - https://huggingface.co/TheBloke/airoboros-33B-gpt4-1.2-GPTQ
187
+ - https://huggingface.co/TheBloke/airoboros-33B-gpt4-1.2-GGML
188
+
189
+ ### Coding updates from gpt4/1.1:
190
+
191
+ I added a few hundred instruction/response pairs to the training data with "PLAINFORMAT" as a single, all caps term at the end of the normal instructions, which produce plain text output instead of markdown/backtick code formatting.
192
+
193
+ It's not guaranteed to work all the time, but mostly it does seem to work as expected.
194
+
195
+ So for example, instead of:
196
+ ```
197
+ Implement the Snake game in python.
198
+ ```
199
+
200
+ You would use:
201
+ ```
202
+ Implement the Snake game in python. PLAINFORMAT
203
+ ```
204
+
205
+ ### Other updates from gpt4/1.1:
206
+
207
+ - Several hundred role-playing data.
208
+ - A few thousand ORCA style reasoning/math questions with ELI5 prompts to generate the responses (should not be needed in your prompts to this model however, just ask the question).
209
+ - Many more coding examples in various languages, including some that use specific libraries (pandas, numpy, tensorflow, etc.)