TheBloke
/

airoboros-13b-gpt4-GGML

Model card Files Files and versions Community

TheBloke commited on Jun 7, 2023

Commit

463601b

•

1 Parent(s): 37ba5c3

Upload new k-quant GGML quantised models.

Browse files

Files changed (1) hide show

README.md +102 -59

README.md CHANGED Viewed

@@ -1,8 +1,6 @@
 ---
 inference: false
 license: other
-datasets:
-- jondurbin/airoboros-gpt4
 ---
 <!-- header start -->
@@ -33,44 +31,54 @@ GGML files are for CPU + GPU inference using [llama.cpp](https://github.com/gger
 ## Repositories available
 * [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/airoboros-13b-gpt4-GPTQ)
-* [4-bit, 5-bit, and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/airoboros-13b-gpt4-GGML)
 * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/TheBloke/airoboros-13b-gpt4-fp16)
-## Prompt template
-Uses the Vicuna 1.1 format:
-```
-USER: prompt
-ASSISTANT:
-```
-## Context length with GGML
-The base Airoboros GPT4 models have an increased context length of 4096.
-However this GGML conversion appears to still have the default 2048 context.
-I have experimented with llama.cpp's `-n 4096` parameter to specify a context of 4096 but it so far always results in gibberish output.
-I will investigate this further and upload a correct model if this proves necessary.
-For now, please assume this GGML to have a context of 2048.
-## THE FILES IN MAIN BRANCH REQUIRES LATEST LLAMA.CPP (May 19th 2023 - commit 2d5db48)!
-llama.cpp recently made another breaking change to its quantisation methods - https://github.com/ggerganov/llama.cpp/pull/1508
-I have quantised the GGML files in this repo with the latest version. Therefore you will require llama.cpp compiled on May 19th or later (commit `2d5db48` or later) to use them.
 ## Provided files
 | Name | Quant method | Bits | Size | Max RAM required | Use case |
 | ---- | ---- | ---- | ---- | ---- | ----- |
-| airoboros-13b-gpt4.ggmlv3.q4_0.bin | q4_0 | 4 | 7.32 GB | 9.82 GB | 4-bit. |
-| airoboros-13b-gpt4.ggmlv3.q4_1.bin | q4_1 | 4 | 8.14 GB | 10.64 GB | 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
-| airoboros-13b-gpt4.ggmlv3.q5_0.bin | q5_0 | 5 | 8.95 GB | 11.45 GB | 5-bit. Higher accuracy, higher resource usage and slower inference. |
-| airoboros-13b-gpt4.ggmlv3.q5_1.bin | q5_1 | 5 | 9.76 GB | 12.26 GB | 5-bit. Even higher accuracy, resource usage and slower inference. |
-| airoboros-13b-gpt4.ggmlv3.q8_0.bin | q8_0 | 8 | 13.83 GB | 16.33 GB | 8-bit. Almost indistinguishable from float16. Huge resource use and slow. Not recommended for normal use. |
 **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
@@ -80,7 +88,7 @@ I have quantised the GGML files in this repo with the latest version. Therefore
 I use the following command line; adjust for your tastes and needs:
 ```
-./main -t 10 -ngl 32 -m airoboros-13b-gpt4.ggmlv3.q5_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "USER": Write a story about llamas\nASSISTANT:"
 ```
 Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
@@ -112,20 +120,22 @@ Donaters will get priority support on any and all AI/LLM/model questions and req
 * Patreon: https://patreon.com/TheBlokeAI
 * Ko-Fi: https://ko-fi.com/TheBlokeAI
-**Patreon special mentions**: Aemon Algiz, Dmitriy Samsonov, Nathan LeClaire, Trenton Dambrowitz, Mano Prime, David Flickinger, vamX, Nikolai Manek, senxiiz, Khalefa Al-Ahmad, Illia Dulskyi, Jonathan Leane, Talal Aujan, V. Lukas, Joseph William Delisle, Pyrater, Oscar Rangel, Lone Striker, Luke Pendergrass, Eugene Pentland, Sebastain Graf, Johann-Peter Hartman.
 Thank you to all my generous patrons and donaters!
 <!-- footer end -->
 # Original model card: Jon Durbin's Airoboros 13B GPT4
 ## Overview
 This is a fine-tuned 13b parameter LlaMa model, using completely synthetic training data created gpt4 via https://github.com/jondurbin/airoboros
-The context size has been increased to 4096.
-The dataset used to fine-tune this model is available [here](https://huggingface.co/airoboros-gpt4), with a specific focus on:
 - trivia
 - math/reasoning (although it still sucks)
 - coding
@@ -136,13 +146,11 @@ The dataset used to fine-tune this model is available [here](https://huggingface
 This model was fine-tuned with a fork of FastChat, and therefore uses the standard vicuna template:
 ```
-USER:
-[prompt]
-<\s>
-ASSISTANT:
 ```
 The most important bit, to me, is the context obedient question answering support, without extensive prompt engineering.
 ### Usage
@@ -158,7 +166,6 @@ python -m fastchat.serve.cli
   --model-path airoboros-13b-gpt4 \
   --temperature 0.5 \
   --max-new-tokens 4096 \
-  --context-length 4096 \
   --conv-template vicuna_v1.1 \
   --no-history
 ```
@@ -183,6 +190,23 @@ BEGININSTRUCTION
 ENDINSTRUCTION
 ```
 Here's a trivial, but important example to prove the point:
 ```
 BEGININPUT
@@ -204,6 +228,23 @@ Source:
 date: 2021-01-01
 url: https://web.site/123
 ```
 <details>
   <summary>A more elaborate example, with a rewrite of the Michigan Wikipedia article to be fake data.</summary>
@@ -251,7 +292,7 @@ Once exploited for logging and mining, today the sparsely populated Upper Penins
 The Lower Peninsula is a hub for manufacturing, forestry, agriculture, services, and high-tech industry.
 ENDINPUT
 BEGININSTRUCTION
-What is Michigan's capital?
 Give me a JSON object with 3 facts from Foolo's article.
 What are the states that border Michigan?
 Reference?
@@ -346,6 +387,8 @@ pr: 23441
 </details>
 ### Other stuff
 #### Better coding
@@ -497,12 +540,12 @@ void insert(char* key, char* value) {
         perror("malloc failed");
         exit(1);
     }
     strcpy(parent->key, key);
     strcpy(parent->value, value);
     parent->left = parent->right = NULL;
     if (!tree) {
         tree = parent;
     } else {
@@ -518,7 +561,7 @@ char* search(char* key) {
         }
         current = current->right;
     }
     return NULL;
 }
@@ -533,17 +576,17 @@ void print_tree() {
 void* accept_client(void* args) {
     int client_socket = *(int*)args;
     char buffer[1024];
     while (1) {
         ssize_t bytes_received = recv(client_socket, buffer, sizeof(buffer), 0);
         if (bytes_received <= 0) {
             close(client_socket);
             break;
         }
         char* input = buffer;
         char* input_token;
         while ((input_token = strtok(input, " \n")) != NULL) {
             switch (input_token[0]) {
                 case 'G':
@@ -553,7 +596,7 @@ void* accept_client(void* args) {
                     }
                     printf("%s = %s\n", input_token, search(input_token));
                     break;
                 case 'S':
                     if (!strcmp(input_token, "SET")) {
                         char* key = strtok(NULL, " \n");
@@ -562,7 +605,7 @@ void* accept_client(void* args) {
                             perror("Invalid input");
                             exit(1);
                         }
                         insert(key, value);
                         printf("%s set %s to %s\n", input_token, key, value);
                     } else {
@@ -570,16 +613,16 @@ void* accept_client(void* args) {
                         exit(1);
                     }
                     break;
                 default:
                     perror("Invalid input");
                     exit(1);
             }
             input = strtok(NULL, " \n");
         }
     }
     return NULL;
 }
@@ -589,50 +632,50 @@ int main() {
         perror("socket failed");
         exit(1);
     }
     struct sockaddr_in server_addr;
     memset(&server_addr, 0, sizeof(server_addr));
     server_addr.sin_family = AF_INET;
     server_addr.sin_port = htons(8080);
     server_addr.sin_addr.s_addr = INADDR_ANY;
     if (bind(server_socket, (struct sockaddr*)&server_addr, sizeof(server_addr)) < 0) {
         perror("bind failed");
         exit(1);
     }
     if (listen(server_socket, 5) < 0) {
         perror("listen failed");
         exit(1);
     }
     pthread_t accept_thread;
     pthread_create(&accept_thread, NULL, accept_client, &server_socket);
     char* client_input;
     int client_socket = accept(server_socket, (struct sockaddr*)NULL, NULL);
     if (client_socket < 0) {
         perror("accept failed");
         exit(1);
     }
     while (1) {
         sleep(1);
         char buffer[1024];
         ssize_t bytes_received = recv(client_socket, buffer, sizeof(buffer), 0);
         if (bytes_received <= 0) {
             close(client_socket);
             break;
         }
         client_input = buffer;
         parse_input(client_input);
     }
     close(client_socket);
     pthread_join(accept_thread, NULL);
     return 0;
 }
 ```

 ---
 inference: false
 license: other
 ---
 <!-- header start -->
 ## Repositories available
 * [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/airoboros-13b-gpt4-GPTQ)
+* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/airoboros-13b-gpt4-GGML)
 * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/TheBloke/airoboros-13b-gpt4-fp16)
+<!-- compatibility_ggml start -->
+## Compatibility
+### Original llama.cpp quant methods: `q4_0, q4_1, q5_0, q5_1, q8_0`
+I have quantized these 'original' quantisation methods using an older version of llama.cpp so that they remain compatible with llama.cpp as of May 19th, commit `2d5db48`.
+They should be compatible with all current UIs and libraries that use llama.cpp, such as those listed at the top of this README.
+### New k-quant methods: `q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K`
+These new quantisation methods are only compatible with llama.cpp as of June 6th, commit `2d43387`.
+They will NOT be compatible with koboldcpp, text-generation-ui, and other UIs and libraries yet. Support is expected to come over the next few days.
+## Explanation of the new k-quant methods
+The new methods available are:
+* GGML_TYPE_Q2_K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
+* GGML_TYPE_Q3_K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw.
+* GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
+* GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
+* GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
+* GGML_TYPE_Q8_K - "type-0" 8-bit quantization. Only used for quantizing intermediate results. The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type.
+Refer to the Provided Files table below to see what files use which methods, and how.
+<!-- compatibility_ggml end -->
 ## Provided files
 | Name | Quant method | Bits | Size | Max RAM required | Use case |
 | ---- | ---- | ---- | ---- | ---- | ----- |
+| airoboros-13b-gpt4.ggmlv3.q4_0.bin | q4_0 | 4 | 7.32 GB | 9.82 GB | Original llama.cpp quant method, 4-bit. |
+| airoboros-13b-gpt4.ggmlv3.q4_1.bin | q4_1 | 4 | 8.14 GB | 10.64 GB | Original llama.cpp quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. |
+| airoboros-13b-gpt4.ggmlv3.q5_0.bin | q5_0 | 5 | 8.95 GB | 11.45 GB | Original llama.cpp quant method, 5-bit. Higher accuracy, higher resource usage and slower inference. |
+| airoboros-13b-gpt4.ggmlv3.q5_1.bin | q5_1 | 5 | 9.76 GB | 12.26 GB | Original llama.cpp quant method, 5-bit. Even higher accuracy, resource usage and slower inference. |
+| airoboros-13b-gpt4.ggmlv3.q8_0.bin | q8_0 | 8 | 13.83 GB | 16.33 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
+| airoboros-13b.ggmlv3.q2_K.bin | q2_K | 2 | 5.43 GB | 7.93 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
+| airoboros-13b.ggmlv3.q3_K_L.bin | q3_K_L | 3 | 6.87 GB | 9.37 GB | New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
+| airoboros-13b.ggmlv3.q3_K_M.bin | q3_K_M | 3 | 6.25 GB | 8.75 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
+| airoboros-13b.ggmlv3.q3_K_S.bin | q3_K_S | 3 | 5.59 GB | 8.09 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
+| airoboros-13b.ggmlv3.q4_K_M.bin | q4_K_M | 4 | 7.82 GB | 10.32 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K |
+| airoboros-13b.ggmlv3.q4_K_S.bin | q4_K_S | 4 | 7.32 GB | 9.82 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
+| airoboros-13b.ggmlv3.q5_K_M.bin | q5_K_M | 5 | 9.21 GB | 11.71 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K |
+| airoboros-13b.ggmlv3.q5_K_S.bin | q5_K_S | 5 | 8.95 GB | 11.45 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
+| airoboros-13b.ggmlv3.q6_K.bin | q6_K | 6 | 10.68 GB | 13.18 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
 **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
 I use the following command line; adjust for your tastes and needs:
 ```
+./main -t 10 -ngl 32 -m airoboros-13b.ggmlv3.q5_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"
 ```
 Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
 * Patreon: https://patreon.com/TheBlokeAI
 * Ko-Fi: https://ko-fi.com/TheBlokeAI
+**Special thanks to**: Luke from CarbonQuill, Aemon Algiz, Dmitriy Samsonov.
+**Patreon special mentions**: Ajan Kanaga, Kalila, Derek Yates, Sean Connelly, Luke, Nathan LeClaire, Trenton Dambrowitz, Mano Prime, David Flickinger, vamX, Nikolai Manek, senxiiz, Khalefa Al-Ahmad, Illia Dulskyi, trip7s trip, Jonathan Leane, Talal Aujan, Artur Olbinski, Cory Kujawski, Joseph William Delisle, Pyrater, Oscar Rangel, Lone Striker, Luke Pendergrass, Eugene Pentland, Johann-Peter Hartmann.
 Thank you to all my generous patrons and donaters!
 <!-- footer end -->
 # Original model card: Jon Durbin's Airoboros 13B GPT4
 ## Overview
 This is a fine-tuned 13b parameter LlaMa model, using completely synthetic training data created gpt4 via https://github.com/jondurbin/airoboros
+The dataset used to fine-tune this model is available [here](https://huggingface.co/datasets/jondurbin/airoboros-gpt4), with a specific focus on:
 - trivia
 - math/reasoning (although it still sucks)
 - coding
 This model was fine-tuned with a fork of FastChat, and therefore uses the standard vicuna template:
 ```
+USER: [prompt] ASSISTANT:
 ```
+*__NOTE: an earlier version claimed context length of 4096 - this did not work!  I modified the code to train with with 4096, and several instructions are beyond 2048.  I tested a few prompts beyond 2048, and they seem to produce fairly coherent responses with increased context length for a couple hundred tokens beyond 2048, but I did not properly test up to 4096.  As it turns out, it would appear without a massive fine-tune of the base model on a larger context window, this won't work.  Sorry!__*
 The most important bit, to me, is the context obedient question answering support, without extensive prompt engineering.
 ### Usage
   --model-path airoboros-13b-gpt4 \
   --temperature 0.5 \
   --max-new-tokens 4096 \
   --conv-template vicuna_v1.1 \
   --no-history
 ```
 ENDINSTRUCTION
 ```
+It's also helpful to add "Don't make up answers if you don't know." to your instruction block to make sure if the context is completely unrelated it doesn't make something up.
+*The __only__ prompts that need this closed context formating are closed-context instructions.  Normal questions/instructions do not!*
+I know it's a bit verbose and annoying, but after much trial and error, using these explicit delimiters helps the model understand where to find the responses and how to associate specific sources with it.
+- `BEGININPUT` - denotes a new input block
+- `BEGINCONTEXT` - denotes the block of context (metadata key/value pairs) to associate with the current input block
+- `ENDCONTEXT` - denotes the end of the metadata block for the current input
+- [text] - Insert whatever text you want for the input block, as many paragraphs as can fit in the context.
+- `ENDINPUT` - denotes the end of the current input block
+- [repeat as many input blocks in this format as you want]
+- `BEGININSTRUCTION` - denotes the start of the list (or one) instruction(s) to respond to for all of the input blocks above.
+- [instruction(s)]
+- `ENDINSTRUCTION` - denotes the end of instruction set
+It sometimes works without `ENDINSTRUCTION`, but by explicitly including that in the prompt, the model better understands that all of the instructions in the block should be responded to.
 Here's a trivial, but important example to prove the point:
 ```
 BEGININPUT
 date: 2021-01-01
 url: https://web.site/123
 ```
+The prompt itself should be wrapped in the vicuna1.1 template if you aren't using fastchat with the conv-template vicuna_v1.1 as described:
+```
+USER: BEGININPUT
+BEGINCONTEXT
+date: 2021-01-01
+url: https://web.site/123
+ENDCONTEXT
+In a shocking turn of events, blueberries are now green, but will be sticking with the same name.
+ENDINPUT
+BEGININSTRUCTION
+What color are bluberries?  Source?
+ENDINSTRUCTION
+ ASSISTANT:
+```
 <details>
   <summary>A more elaborate example, with a rewrite of the Michigan Wikipedia article to be fake data.</summary>
 The Lower Peninsula is a hub for manufacturing, forestry, agriculture, services, and high-tech industry.
 ENDINPUT
 BEGININSTRUCTION
+What is Michigan's capital?
 Give me a JSON object with 3 facts from Foolo's article.
 What are the states that border Michigan?
 Reference?
 </details>
+NOTE: Thanks /u/tareq_al_muntasir for testing and finding an issue with many questions and answer pairs in the context.  If you ask a question of a document with question answer pairs, it may continue generating beyond your actual question.  You can "fix" it by replacing question marks with periods in the input texts.  Or, you might be able to add a preamble to the prompt, like "Be sure to only respond to the instructions in the BEGININSTRUCTION block.
 ### Other stuff
 #### Better coding
         perror("malloc failed");
         exit(1);
     }
     strcpy(parent->key, key);
     strcpy(parent->value, value);
     parent->left = parent->right = NULL;
     if (!tree) {
         tree = parent;
     } else {
         }
         current = current->right;
     }
     return NULL;
 }
 void* accept_client(void* args) {
     int client_socket = *(int*)args;
     char buffer[1024];
     while (1) {
         ssize_t bytes_received = recv(client_socket, buffer, sizeof(buffer), 0);
         if (bytes_received <= 0) {
             close(client_socket);
             break;
         }
         char* input = buffer;
         char* input_token;
         while ((input_token = strtok(input, " \n")) != NULL) {
             switch (input_token[0]) {
                 case 'G':
                     }
                     printf("%s = %s\n", input_token, search(input_token));
                     break;
                 case 'S':
                     if (!strcmp(input_token, "SET")) {
                         char* key = strtok(NULL, " \n");
                             perror("Invalid input");
                             exit(1);
                         }
                         insert(key, value);
                         printf("%s set %s to %s\n", input_token, key, value);
                     } else {
                         exit(1);
                     }
                     break;
                 default:
                     perror("Invalid input");
                     exit(1);
             }
             input = strtok(NULL, " \n");
         }
     }
     return NULL;
 }
         perror("socket failed");
         exit(1);
     }
     struct sockaddr_in server_addr;
     memset(&server_addr, 0, sizeof(server_addr));
     server_addr.sin_family = AF_INET;
     server_addr.sin_port = htons(8080);
     server_addr.sin_addr.s_addr = INADDR_ANY;
     if (bind(server_socket, (struct sockaddr*)&server_addr, sizeof(server_addr)) < 0) {
         perror("bind failed");
         exit(1);
     }
     if (listen(server_socket, 5) < 0) {
         perror("listen failed");
         exit(1);
     }
     pthread_t accept_thread;
     pthread_create(&accept_thread, NULL, accept_client, &server_socket);
     char* client_input;
     int client_socket = accept(server_socket, (struct sockaddr*)NULL, NULL);
     if (client_socket < 0) {
         perror("accept failed");
         exit(1);
     }
     while (1) {
         sleep(1);
         char buffer[1024];
         ssize_t bytes_received = recv(client_socket, buffer, sizeof(buffer), 0);
         if (bytes_received <= 0) {
             close(client_socket);
             break;
         }
         client_input = buffer;
         parse_input(client_input);
     }
     close(client_socket);
     pthread_join(accept_thread, NULL);
     return 0;
 }
 ```