ddh0 commited on
Commit
29c01d4
1 Parent(s): b1fff28

Upload gguf.md

Browse files
Files changed (1) hide show
  1. gguf.md +797 -0
gguf.md ADDED
@@ -0,0 +1,797 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GGUF
2
+
3
+ GGUF is a file format for storing models for inference with GGML and executors based on GGML. GGUF is a binary format that is designed for fast loading and saving of models, and for ease of reading. Models are traditionally developed using PyTorch or another framework, and then converted to GGUF for use in GGML.
4
+
5
+ It is a successor file format to GGML, GGMF and GGJT, and is designed to be unambiguous by containing all the information needed to load a model. It is also designed to be extensible, so that new information can be added to models without breaking compatibility.
6
+
7
+ For more information about the motivation behind GGUF, see [Historical State of Affairs](#historical-state-of-affairs).
8
+
9
+ ## Specification
10
+
11
+ GGUF is a format based on the existing GGJT, but makes a few changes to the format to make it more extensible and easier to use. The following features are desired:
12
+
13
+ - Single-file deployment: they can be easily distributed and loaded, and do not require any external files for additional information.
14
+ - Extensible: new features can be added to GGML-based executors/new information can be added to GGUF models without breaking compatibility with existing models.
15
+ - `mmap` compatibility: models can be loaded using `mmap` for fast loading and saving.
16
+ - Easy to use: models can be easily loaded and saved using a small amount of code, with no need for external libraries, regardless of the language used.
17
+ - Full information: all information needed to load a model is contained in the model file, and no additional information needs to be provided by the user.
18
+
19
+ The key difference between GGJT and GGUF is the use of a key-value structure for the hyperparameters (now referred to as metadata), rather than a list of untyped values. This allows for new metadata to be added without breaking compatibility with existing models, and to annotate the model with additional information that may be useful for inference or for identifying the model.
20
+
21
+ ### GGUF Naming Convention
22
+
23
+ GGUF follow a naming convention of `<BaseName><SizeLabel><FineTune><Version><Encoding><Type><Shard>.gguf` where each component is delimitated by a `-` if present. Ultimately this is intended to make it easier for humans to at a glance get the most important details of a model. It is not intended to be perfectly parsable in the field due to the diversity of existing gguf filenames.
24
+
25
+ The components are:
26
+ 1. **BaseName**: A descriptive name for the model base type or architecture.
27
+ - This can be derived from gguf metadata `general.basename` substituting spaces for dashes.
28
+ 1. **SizeLabel**: Parameter weight class (useful for leader boards) represented as `<expertCount>x<count><scale-prefix>`
29
+ - This can be derived from gguf metadata `general.size_label` if available or calculated if missing.
30
+ - Rounded decimal point is supported in count with a single letter scale prefix to assist in floating point exponent shown below
31
+ - `Q`: Quadrillion parameters.
32
+ - `T`: Trillion parameters.
33
+ - `B`: Billion parameters.
34
+ - `M`: Million parameters.
35
+ - `K`: Thousand parameters.
36
+ - Additional `-<attributes><count><scale-prefix>` can be appended as needed to indicate other attributes of interest
37
+ 1. **FineTune**: A descriptive name for the model fine tuning goal (e.g. Chat, Instruct, etc...)
38
+ - This can be derived from gguf metadata `general.finetune` substituting spaces for dashes.
39
+ 1. **Version**: (Optional) Denotes the model version number, formatted as `v<Major>.<Minor>`
40
+ - If model is missing a version number then assume `v1.0` (First Public Release)
41
+ - This can be derived from gguf metadata `general.version`
42
+ 1. **Encoding**: Indicates the weights encoding scheme that was applied to the model. Content, type mixture and arrangement however are determined by user code and can vary depending on project needs.
43
+ 1. **Type**: Indicates the kind of gguf file and the intended purpose for it
44
+ - If missing, then file is by default a typical gguf tensor model file
45
+ - `LoRA` : GGUF file is a LoRA adapter
46
+ - `vocab` : GGUF file with only vocab data and metadata
47
+ 1. **Shard**: (Optional) Indicates and denotes that the model has been split into multiple shards, formatted as `<ShardNum>-of-<ShardTotal>`.
48
+ - *ShardNum* : Shard position in this model. Must be 5 digits padded by zeros.
49
+ - Shard number always starts from `00001` onwards (e.g. First shard always starts at `00001-of-XXXXX` rather than `00000-of-XXXXX`).
50
+ - *ShardTotal* : Total number of shards in this model. Must be 5 digits padded by zeros.
51
+
52
+
53
+ #### Validating Above Naming Convention
54
+
55
+ At a minimum all model files should have at least BaseName, SizeLabel, Version, in order to be easily validated as a file that is keeping with the GGUF Naming Convention. An example of this issue is that it is easy for Encoding to be mistaken as a FineTune if Version is omitted.
56
+
57
+ To validate you can use this regular expression `^(?<BaseName>[A-Za-z0-9\s]*(?:(?:-(?:(?:[A-Za-z\s][A-Za-z0-9\s]*)|(?:[0-9\s]*)))*))-(?:(?<SizeLabel>(?:\d+x)?(?:\d+\.)?\d+[A-Za-z](?:-[A-Za-z]+(\d+\.)?\d+[A-Za-z]+)?)(?:-(?<FineTune>[A-Za-z0-9\s-]+))?)?-(?:(?<Version>v\d+(?:\.\d+)*))(?:-(?<Encoding>(?!LoRA|vocab)[\w_]+))?(?:-(?<Type>LoRA|vocab))?(?:-(?<Shard>\d{5}-of-\d{5}))?\.gguf$` which will check that you got the minimum BaseName, SizeLabel and Version present in the correct order.
58
+
59
+ For example:
60
+
61
+ * `Mixtral-8x7B-v0.1-KQ2.gguf`:
62
+ - Model Name: Mixtral
63
+ - Expert Count: 8
64
+ - Parameter Count: 7B
65
+ - Version Number: v0.1
66
+ - Weight Encoding Scheme: KQ2
67
+
68
+ * `Hermes-2-Pro-Llama-3-8B-F16.gguf`:
69
+ - Model Name: Hermes 2 Pro Llama 3
70
+ - Expert Count: 0
71
+ - Parameter Count: 8B
72
+ - Version Number: v1.0
73
+ - Weight Encoding Scheme: F16
74
+ - Shard: N/A
75
+
76
+ * `Grok-100B-v1.0-Q4_0-00003-of-00009.gguf`
77
+ - Model Name: Grok
78
+ - Expert Count: 0
79
+ - Parameter Count: 100B
80
+ - Version Number: v1.0
81
+ - Weight Encoding Scheme: Q4_0
82
+ - Shard: 3 out of 9 total shards
83
+
84
+
85
+ <details><summary>Example Node.js Regex Function</summary>
86
+
87
+ ```js
88
+ #!/usr/bin/env node
89
+ const ggufRegex = /^(?<BaseName>[A-Za-z0-9\s]*(?:(?:-(?:(?:[A-Za-z\s][A-Za-z0-9\s]*)|(?:[0-9\s]*)))*))-(?:(?<SizeLabel>(?:\d+x)?(?:\d+\.)?\d+[A-Za-z](?:-[A-Za-z]+(\d+\.)?\d+[A-Za-z]+)?)(?:-(?<FineTune>[A-Za-z0-9\s-]+))?)?-(?:(?<Version>v\d+(?:\.\d+)*))(?:-(?<Encoding>(?!LoRA|vocab)[\w_]+))?(?:-(?<Type>LoRA|vocab))?(?:-(?<Shard>\d{5}-of-\d{5}))?\.gguf$/;
90
+
91
+ function parseGGUFFilename(filename) {
92
+ const match = ggufRegex.exec(filename);
93
+ if (!match)
94
+ return null;
95
+ const {BaseName = null, SizeLabel = null, FineTune = null, Version = "v1.0", Encoding = null, Type = null, Shard = null} = match.groups;
96
+ return {BaseName: BaseName, SizeLabel: SizeLabel, FineTune: FineTune, Version: Version, Encoding: Encoding, Type: Type, Shard: Shard};
97
+ }
98
+
99
+ const testCases = [
100
+ {filename: 'Mixtral-8x7B-v0.1-KQ2.gguf', expected: { BaseName: 'Mixtral', SizeLabel: '8x7B', FineTune: null, Version: 'v0.1', Encoding: 'KQ2', Type: null, Shard: null}},
101
+ {filename: 'Grok-100B-v1.0-Q4_0-00003-of-00009.gguf', expected: { BaseName: 'Grok', SizeLabel: '100B', FineTune: null, Version: 'v1.0', Encoding: 'Q4_0', Type: null, Shard: "00003-of-00009"}},
102
+ {filename: 'Hermes-2-Pro-Llama-3-8B-v1.0-F16.gguf', expected: { BaseName: 'Hermes-2-Pro-Llama-3', SizeLabel: '8B', FineTune: null, Version: 'v1.0', Encoding: 'F16', Type: null, Shard: null}},
103
+ {filename: 'Phi-3-mini-3.8B-ContextLength4k-instruct-v1.0.gguf', expected: { BaseName: 'Phi-3-mini', SizeLabel: '3.8B-ContextLength4k', FineTune: 'instruct', Version: 'v1.0', Encoding: null, Type: null, Shard: null}},
104
+ {filename: 'not-a-known-arrangement.gguf', expected: null},
105
+ ];
106
+
107
+ testCases.forEach(({ filename, expected }) => {
108
+ const result = parseGGUFFilename(filename);
109
+ const passed = JSON.stringify(result) === JSON.stringify(expected);
110
+ console.log(`${filename}: ${passed ? "PASS" : "FAIL"}`);
111
+ if (!passed) {
112
+ console.log(result);
113
+ console.log(expected);
114
+ }
115
+ });
116
+ ```
117
+
118
+ </details>
119
+
120
+
121
+ ### File Structure
122
+
123
+ ![image](https://github.com/ggerganov/ggml/assets/1991296/c3623641-3a1d-408e-bfaf-1b7c4e16aa63)
124
+ *diagram by [@mishig25](https://github.com/mishig25) (GGUF v3)*
125
+
126
+ GGUF files are structured as follows. They use a global alignment specified in the `general.alignment` metadata field, referred to as `ALIGNMENT` below. Where required, the file is padded with `0x00` bytes to the next multiple of `general.alignment`.
127
+
128
+ Fields, including arrays, are written sequentially without alignment unless otherwise specified.
129
+
130
+ Models are little-endian by default. They can also come in big-endian for use with big-endian computers; in this case, all values (including metadata values and tensors) will also be big-endian. At the time of writing, there is no way to determine if a model is big-endian; this may be rectified in future versions. If no additional information is provided, assume the model is little-endian.
131
+
132
+ ```c
133
+ enum ggml_type: uint32_t {
134
+ GGML_TYPE_F32 = 0,
135
+ GGML_TYPE_F16 = 1,
136
+ GGML_TYPE_Q4_0 = 2,
137
+ GGML_TYPE_Q4_1 = 3,
138
+ // GGML_TYPE_Q4_2 = 4, support has been removed
139
+ // GGML_TYPE_Q4_3 = 5, support has been removed
140
+ GGML_TYPE_Q5_0 = 6,
141
+ GGML_TYPE_Q5_1 = 7,
142
+ GGML_TYPE_Q8_0 = 8,
143
+ GGML_TYPE_Q8_1 = 9,
144
+ GGML_TYPE_Q2_K = 10,
145
+ GGML_TYPE_Q3_K = 11,
146
+ GGML_TYPE_Q4_K = 12,
147
+ GGML_TYPE_Q5_K = 13,
148
+ GGML_TYPE_Q6_K = 14,
149
+ GGML_TYPE_Q8_K = 15,
150
+ GGML_TYPE_IQ2_XXS = 16,
151
+ GGML_TYPE_IQ2_XS = 17,
152
+ GGML_TYPE_IQ3_XXS = 18,
153
+ GGML_TYPE_IQ1_S = 19,
154
+ GGML_TYPE_IQ4_NL = 20,
155
+ GGML_TYPE_IQ3_S = 21,
156
+ GGML_TYPE_IQ2_S = 22,
157
+ GGML_TYPE_IQ4_XS = 23,
158
+ GGML_TYPE_I8 = 24,
159
+ GGML_TYPE_I16 = 25,
160
+ GGML_TYPE_I32 = 26,
161
+ GGML_TYPE_I64 = 27,
162
+ GGML_TYPE_F64 = 28,
163
+ GGML_TYPE_IQ1_M = 29,
164
+ GGML_TYPE_COUNT,
165
+ };
166
+
167
+ enum gguf_metadata_value_type: uint32_t {
168
+ // The value is a 8-bit unsigned integer.
169
+ GGUF_METADATA_VALUE_TYPE_UINT8 = 0,
170
+ // The value is a 8-bit signed integer.
171
+ GGUF_METADATA_VALUE_TYPE_INT8 = 1,
172
+ // The value is a 16-bit unsigned little-endian integer.
173
+ GGUF_METADATA_VALUE_TYPE_UINT16 = 2,
174
+ // The value is a 16-bit signed little-endian integer.
175
+ GGUF_METADATA_VALUE_TYPE_INT16 = 3,
176
+ // The value is a 32-bit unsigned little-endian integer.
177
+ GGUF_METADATA_VALUE_TYPE_UINT32 = 4,
178
+ // The value is a 32-bit signed little-endian integer.
179
+ GGUF_METADATA_VALUE_TYPE_INT32 = 5,
180
+ // The value is a 32-bit IEEE754 floating point number.
181
+ GGUF_METADATA_VALUE_TYPE_FLOAT32 = 6,
182
+ // The value is a boolean.
183
+ // 1-byte value where 0 is false and 1 is true.
184
+ // Anything else is invalid, and should be treated as either the model being invalid or the reader being buggy.
185
+ GGUF_METADATA_VALUE_TYPE_BOOL = 7,
186
+ // The value is a UTF-8 non-null-terminated string, with length prepended.
187
+ GGUF_METADATA_VALUE_TYPE_STRING = 8,
188
+ // The value is an array of other values, with the length and type prepended.
189
+ ///
190
+ // Arrays can be nested, and the length of the array is the number of elements in the array, not the number of bytes.
191
+ GGUF_METADATA_VALUE_TYPE_ARRAY = 9,
192
+ // The value is a 64-bit unsigned little-endian integer.
193
+ GGUF_METADATA_VALUE_TYPE_UINT64 = 10,
194
+ // The value is a 64-bit signed little-endian integer.
195
+ GGUF_METADATA_VALUE_TYPE_INT64 = 11,
196
+ // The value is a 64-bit IEEE754 floating point number.
197
+ GGUF_METADATA_VALUE_TYPE_FLOAT64 = 12,
198
+ };
199
+
200
+ // A string in GGUF.
201
+ struct gguf_string_t {
202
+ // The length of the string, in bytes.
203
+ uint64_t len;
204
+ // The string as a UTF-8 non-null-terminated string.
205
+ char string[len];
206
+ };
207
+
208
+ union gguf_metadata_value_t {
209
+ uint8_t uint8;
210
+ int8_t int8;
211
+ uint16_t uint16;
212
+ int16_t int16;
213
+ uint32_t uint32;
214
+ int32_t int32;
215
+ float float32;
216
+ uint64_t uint64;
217
+ int64_t int64;
218
+ double float64;
219
+ bool bool_;
220
+ gguf_string_t string;
221
+ struct {
222
+ // Any value type is valid, including arrays.
223
+ gguf_metadata_value_type type;
224
+ // Number of elements, not bytes
225
+ uint64_t len;
226
+ // The array of values.
227
+ gguf_metadata_value_t array[len];
228
+ } array;
229
+ };
230
+
231
+ struct gguf_metadata_kv_t {
232
+ // The key of the metadata. It is a standard GGUF string, with the following caveats:
233
+ // - It must be a valid ASCII string.
234
+ // - It must be a hierarchical key, where each segment is `lower_snake_case` and separated by a `.`.
235
+ // - It must be at most 2^16-1/65535 bytes long.
236
+ // Any keys that do not follow these rules are invalid.
237
+ gguf_string_t key;
238
+
239
+ // The type of the value.
240
+ // Must be one of the `gguf_metadata_value_type` values.
241
+ gguf_metadata_value_type value_type;
242
+ // The value.
243
+ gguf_metadata_value_t value;
244
+ };
245
+
246
+ struct gguf_header_t {
247
+ // Magic number to announce that this is a GGUF file.
248
+ // Must be `GGUF` at the byte level: `0x47` `0x47` `0x55` `0x46`.
249
+ // Your executor might do little-endian byte order, so it might be
250
+ // check for 0x46554747 and letting the endianness cancel out.
251
+ // Consider being *very* explicit about the byte order here.
252
+ uint32_t magic;
253
+ // The version of the format implemented.
254
+ // Must be `3` for version described in this spec, which introduces big-endian support.
255
+ //
256
+ // This version should only be increased for structural changes to the format.
257
+ // Changes that do not affect the structure of the file should instead update the metadata
258
+ // to signify the change.
259
+ uint32_t version;
260
+ // The number of tensors in the file.
261
+ // This is explicit, instead of being included in the metadata, to ensure it is always present
262
+ // for loading the tensors.
263
+ uint64_t tensor_count;
264
+ // The number of metadata key-value pairs.
265
+ uint64_t metadata_kv_count;
266
+ // The metadata key-value pairs.
267
+ gguf_metadata_kv_t metadata_kv[metadata_kv_count];
268
+ };
269
+
270
+ uint64_t align_offset(uint64_t offset) {
271
+ return offset + (ALIGNMENT - (offset % ALIGNMENT)) % ALIGNMENT;
272
+ }
273
+
274
+ struct gguf_tensor_info_t {
275
+ // The name of the tensor. It is a standard GGUF string, with the caveat that
276
+ // it must be at most 64 bytes long.
277
+ gguf_string_t name;
278
+ // The number of dimensions in the tensor.
279
+ // Currently at most 4, but this may change in the future.
280
+ uint32_t n_dimensions;
281
+ // The dimensions of the tensor.
282
+ uint64_t dimensions[n_dimensions];
283
+ // The type of the tensor.
284
+ ggml_type type;
285
+ // The offset of the tensor's data in this file in bytes.
286
+ //
287
+ // This offset is relative to `tensor_data`, not to the start
288
+ // of the file, to make it easier for writers to write the file.
289
+ // Readers should consider exposing this offset relative to the
290
+ // file to make it easier to read the data.
291
+ //
292
+ // Must be a multiple of `ALIGNMENT`. That is, `align_offset(offset) == offset`.
293
+ uint64_t offset;
294
+ };
295
+
296
+ struct gguf_file_t {
297
+ // The header of the file.
298
+ gguf_header_t header;
299
+
300
+ // Tensor infos, which can be used to locate the tensor data.
301
+ gguf_tensor_info_t tensor_infos[header.tensor_count];
302
+
303
+ // Padding to the nearest multiple of `ALIGNMENT`.
304
+ //
305
+ // That is, if `sizeof(header) + sizeof(tensor_infos)` is not a multiple of `ALIGNMENT`,
306
+ // this padding is added to make it so.
307
+ //
308
+ // This can be calculated as `align_offset(position) - position`, where `position` is
309
+ // the position of the end of `tensor_infos` (i.e. `sizeof(header) + sizeof(tensor_infos)`).
310
+ uint8_t _padding[];
311
+
312
+ // Tensor data.
313
+ //
314
+ // This is arbitrary binary data corresponding to the weights of the model. This data should be close
315
+ // or identical to the data in the original model file, but may be different due to quantization or
316
+ // other optimizations for inference. Any such deviations should be recorded in the metadata or as
317
+ // part of the architecture definition.
318
+ //
319
+ // Each tensor's data must be stored within this array, and located through its `tensor_infos` entry.
320
+ // The offset of each tensor's data must be a multiple of `ALIGNMENT`, and the space between tensors
321
+ // should be padded to `ALIGNMENT` bytes.
322
+ uint8_t tensor_data[];
323
+ };
324
+ ```
325
+
326
+ ## Standardized key-value pairs
327
+
328
+ The following key-value pairs are standardized. This list may grow in the future as more use cases are discovered. Where possible, names are shared with the original model definitions to make it easier to map between the two.
329
+
330
+ Not all of these are required, but they are all recommended. Keys that are required are bolded. For omitted pairs, the reader should assume that the value is unknown and either default or error as appropriate.
331
+
332
+ The community can develop their own key-value pairs to carry additional data. However, these should be namespaced with the relevant community name to avoid collisions. For example, the `rustformers` community might use `rustformers.` as a prefix for all of their keys.
333
+
334
+ If a particular community key is widely used, it may be promoted to a standardized key.
335
+
336
+ By convention, most counts/lengths/etc are `uint64` unless otherwise specified. This is to allow for larger models to be supported in the future. Some models may use `uint32` for their values; it is recommended that readers support both.
337
+
338
+ ### General
339
+
340
+ #### Required
341
+
342
+ - **`general.architecture: string`**: describes what architecture this model implements. All lowercase ASCII, with only `[a-z0-9]+` characters allowed. Known values include:
343
+ - `llama`
344
+ - `mpt`
345
+ - `gptneox`
346
+ - `gptj`
347
+ - `gpt2`
348
+ - `bloom`
349
+ - `falcon`
350
+ - `mamba`
351
+ - `rwkv`
352
+ - **`general.quantization_version: uint32`**: The version of the quantization format. Not required if the model is not quantized (i.e. no tensors are quantized). If any tensors are quantized, this _must_ be present. This is separate to the quantization scheme of the tensors itself; the quantization version may change without changing the scheme's name (e.g. the quantization scheme is Q5_K, and the quantization version is 4).
353
+ - **`general.alignment: uint32`**: the global alignment to use, as described above. This can vary to allow for different alignment schemes, but it must be a multiple of 8. Some writers may not write the alignment. If the alignment is **not** specified, assume it is `32`.
354
+
355
+ #### General metadata
356
+
357
+ - `general.name: string`: The name of the model. This should be a human-readable name that can be used to identify the model. It should be unique within the community that the model is defined in.
358
+ - `general.author: string`: The author of the model.
359
+ - `general.version: string`: The version of the model.
360
+ - `general.organization: string`: The organization of the model.
361
+ - `general.basename: string`: The base model name / architecture of the model
362
+ - `general.finetune: string`: What has the base model been optimized toward.
363
+ - `general.description: string`: free-form description of the model including anything that isn't covered by the other fields
364
+ - `general.quantized_by: string`: The name of the individual who quantized the model
365
+ - `general.size_label: string`: Size class of the model, such as number of weights and experts. (Useful for leader boards)
366
+ - `general.license: string`: License of the model, expressed as a [SPDX license expression](https://spdx.github.io/spdx-spec/v2-draft/SPDX-license-expressions/) (e.g. `"MIT OR Apache-2.0`). Do not include any other information, such as the license text or the URL to the license.
367
+ - `general.license.name: string`: Human friendly license name
368
+ - `general.license.link: string`: URL to the license.
369
+ - `general.url: string`: URL to the model's homepage. This can be a GitHub repo, a paper, etc.
370
+ - `general.doi: string`: Digital Object Identifier (DOI) https://www.doi.org/
371
+ - `general.uuid: string`: [Universally unique identifier](https://en.wikipedia.org/wiki/Universally_unique_identifier)
372
+ - `general.repo_url: string`: URL to the model's repository such as a GitHub repo or HuggingFace repo
373
+ - `general.tags: string[]`: List of tags that can be used as search terms for a search engine or social media
374
+ - `general.languages: string[]`: What languages can the model speak. Encoded as [ISO 639](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes) two letter codes
375
+ - `general.datasets: string[]`: Links or references to datasets that the model was trained upon
376
+ - `general.file_type: uint32`: An enumerated value describing the type of the majority of the tensors in the file. Optional; can be inferred from the tensor types.
377
+ - `ALL_F32 = 0`
378
+ - `MOSTLY_F16 = 1`
379
+ - `MOSTLY_Q4_0 = 2`
380
+ - `MOSTLY_Q4_1 = 3`
381
+ - `MOSTLY_Q4_1_SOME_F16 = 4`
382
+ - `MOSTLY_Q4_2 = 5` (support removed)
383
+ - `MOSTLY_Q4_3 = 6` (support removed)
384
+ - `MOSTLY_Q8_0 = 7`
385
+ - `MOSTLY_Q5_0 = 8`
386
+ - `MOSTLY_Q5_1 = 9`
387
+ - `MOSTLY_Q2_K = 10`
388
+ - `MOSTLY_Q3_K_S = 11`
389
+ - `MOSTLY_Q3_K_M = 12`
390
+ - `MOSTLY_Q3_K_L = 13`
391
+ - `MOSTLY_Q4_K_S = 14`
392
+ - `MOSTLY_Q4_K_M = 15`
393
+ - `MOSTLY_Q5_K_S = 16`
394
+ - `MOSTLY_Q5_K_M = 17`
395
+ - `MOSTLY_Q6_K = 18`
396
+
397
+ #### Source metadata
398
+
399
+ Information about where this model came from. This is useful for tracking the provenance of the model, and for finding the original source if the model is modified. For a model that was converted from GGML, for example, these keys would point to the model that was converted from.
400
+
401
+ - `general.source.url: string`: URL to the source of the model's homepage. This can be a GitHub repo, a paper, etc.
402
+ - `general.source.doi: string`: Source Digital Object Identifier (DOI) https://www.doi.org/
403
+ - `general.source.uuid: string`: Source [Universally unique identifier](https://en.wikipedia.org/wiki/Universally_unique_identifier)
404
+ - `general.source.repo_url: string`: URL to the source of the model's repository such as a GitHub repo or HuggingFace repo
405
+
406
+ - `general.base_model.count: uint32`: Number of parent models
407
+ - `general.base_model.{id}.name: string`: The name of the parent model.
408
+ - `general.base_model.{id}.author: string`: The author of the parent model.
409
+ - `general.base_model.{id}.version: string`: The version of the parent model.
410
+ - `general.base_model.{id}.organization: string`: The organization of the parent model.
411
+ - `general.base_model.{id}.url: string`: URL to the source of the parent model's homepage. This can be a GitHub repo, a paper, etc.
412
+ - `general.base_model.{id}.doi: string`: Parent Digital Object Identifier (DOI) https://www.doi.org/
413
+ - `general.base_model.{id}.uuid: string`: Parent [Universally unique identifier](https://en.wikipedia.org/wiki/Universally_unique_identifier)
414
+ - `general.base_model.{id}.repo_url: string`: URL to the source of the parent model's repository such as a GitHub repo or HuggingFace repo
415
+
416
+ ### LLM
417
+
418
+ In the following, `[llm]` is used to fill in for the name of a specific LLM architecture. For example, `llama` for LLaMA, `mpt` for MPT, etc. If mentioned in an architecture's section, it is required for that architecture, but not all keys are required for all architectures. Consult the relevant section for more information.
419
+
420
+ - `[llm].context_length: uint64`: Also known as `n_ctx`. length of the context (in tokens) that the model was trained on. For most architectures, this is the hard limit on the length of the input. Architectures, like RWKV, that are not reliant on transformer-style attention may be able to handle larger inputs, but this is not guaranteed.
421
+ - `[llm].embedding_length: uint64`: Also known as `n_embd`. Embedding layer size.
422
+ - `[llm].block_count: uint64`: The number of blocks of attention+feed-forward layers (i.e. the bulk of the LLM). Does not include the input or embedding layers.
423
+ - `[llm].feed_forward_length: uint64`: Also known as `n_ff`. The length of the feed-forward layer.
424
+ - `[llm].use_parallel_residual: bool`: Whether or not the parallel residual logic should be used.
425
+ - `[llm].tensor_data_layout: string`: When a model is converted to GGUF, tensors may be rearranged to improve performance. This key describes the layout of the tensor data. This is not required; if not present, it is assumed to be `reference`.
426
+ - `reference`: tensors are laid out in the same order as the original model
427
+ - further options can be found for each architecture in their respective sections
428
+ - `[llm].expert_count: uint32`: Number of experts in MoE models (optional for non-MoE arches).
429
+ - `[llm].expert_used_count: uint32`: Number of experts used during each token token evaluation (optional for non-MoE arches).
430
+
431
+ #### Attention
432
+
433
+ - `[llm].attention.head_count: uint64`: Also known as `n_head`. Number of attention heads.
434
+ - `[llm].attention.head_count_kv: uint64`: The number of heads per group used in Grouped-Query-Attention. If not present or if present and equal to `[llm].attention.head_count`, the model does not use GQA.
435
+ - `[llm].attention.max_alibi_bias: float32`: The maximum bias to use for ALiBI.
436
+ - `[llm].attention.clamp_kqv: float32`: Value (`C`) to clamp the values of the `Q`, `K`, and `V` tensors between (`[-C, C]`).
437
+ - `[llm].attention.layer_norm_epsilon: float32`: Layer normalization epsilon.
438
+ - `[llm].attention.layer_norm_rms_epsilon: float32`: Layer RMS normalization epsilon.
439
+ - `[llm].attention.key_length: uint32`: The optional size of a key head, $d_k$. If not specified, it will be `n_embd / n_head`.
440
+ - `[llm].attention.value_length: uint32`: The optional size of a value head, $d_v$. If not specified, it will be `n_embd / n_head`.
441
+
442
+ #### RoPE
443
+
444
+ - `[llm].rope.dimension_count: uint64`: The number of rotary dimensions for RoPE.
445
+ - `[llm].rope.freq_base: float32`: The base frequency for RoPE.
446
+
447
+ ##### Scaling
448
+
449
+ The following keys describe RoPE scaling parameters:
450
+
451
+ - `[llm].rope.scaling.type: string`: Can be `none`, `linear`, or `yarn`.
452
+ - `[llm].rope.scaling.factor: float32`: A scale factor for RoPE to adjust the context length.
453
+ - `[llm].rope.scaling.original_context_length: uint32_t`: The original context length of the base model.
454
+ - `[llm].rope.scaling.finetuned: bool`: True if model has been finetuned with RoPE scaling.
455
+
456
+ Note that older models may not have these keys, and may instead use the following key:
457
+
458
+ - `[llm].rope.scale_linear: float32`: A linear scale factor for RoPE to adjust the context length.
459
+
460
+ It is recommended that models use the newer keys if possible, as they are more flexible and allow for more complex scaling schemes. Executors will need to support both indefinitely.
461
+
462
+ #### SSM
463
+
464
+ - `[llm].ssm.conv_kernel: uint32`: The size of the rolling/shift state.
465
+ - `[llm].ssm.inner_size: uint32`: The embedding size of the states.
466
+ - `[llm].ssm.state_size: uint32`: The size of the recurrent state.
467
+ - `[llm].ssm.time_step_rank: uint32`: The rank of time steps.
468
+
469
+ #### Models
470
+
471
+ The following sections describe the metadata for each model architecture. Each key specified _must_ be present.
472
+
473
+ ##### LLaMA
474
+
475
+ - `llama.context_length`
476
+ - `llama.embedding_length`
477
+ - `llama.block_count`
478
+ - `llama.feed_forward_length`
479
+ - `llama.rope.dimension_count`
480
+ - `llama.attention.head_count`
481
+ - `llama.attention.layer_norm_rms_epsilon`
482
+
483
+ ###### Optional
484
+
485
+ - `llama.rope.scale`
486
+ - `llama.attention.head_count_kv`
487
+ - `llama.tensor_data_layout`:
488
+ - `Meta AI original pth`:
489
+ ```python
490
+ def permute(weights: NDArray, n_head: int) -> NDArray:
491
+ return (weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:])
492
+ .swapaxes(1, 2)
493
+ .reshape(weights.shape))
494
+ ```
495
+ - `llama.expert_count`
496
+ - `llama.expert_used_count`
497
+
498
+ ##### MPT
499
+
500
+ - `mpt.context_length`
501
+ - `mpt.embedding_length`
502
+ - `mpt.block_count`
503
+ - `mpt.attention.head_count`
504
+ - `mpt.attention.alibi_bias_max`
505
+ - `mpt.attention.clip_kqv`
506
+ - `mpt.attention.layer_norm_epsilon`
507
+
508
+ ##### GPT-NeoX
509
+
510
+ - `gptneox.context_length`
511
+ - `gptneox.embedding_length`
512
+ - `gptneox.block_count`
513
+ - `gptneox.use_parallel_residual`
514
+ - `gptneox.rope.dimension_count`
515
+ - `gptneox.attention.head_count`
516
+ - `gptneox.attention.layer_norm_epsilon`
517
+
518
+ ###### Optional
519
+
520
+ - `gptneox.rope.scale`
521
+
522
+ ##### GPT-J
523
+
524
+ - `gptj.context_length`
525
+ - `gptj.embedding_length`
526
+ - `gptj.block_count`
527
+ - `gptj.rope.dimension_count`
528
+ - `gptj.attention.head_count`
529
+ - `gptj.attention.layer_norm_epsilon`
530
+
531
+ ###### Optional
532
+
533
+ - `gptj.rope.scale`
534
+
535
+ ##### GPT-2
536
+
537
+ - `gpt2.context_length`
538
+ - `gpt2.embedding_length`
539
+ - `gpt2.block_count`
540
+ - `gpt2.attention.head_count`
541
+ - `gpt2.attention.layer_norm_epsilon`
542
+
543
+ ##### BLOOM
544
+
545
+ - `bloom.context_length`
546
+ - `bloom.embedding_length`
547
+ - `bloom.block_count`
548
+ - `bloom.feed_forward_length`
549
+ - `bloom.attention.head_count`
550
+ - `bloom.attention.layer_norm_epsilon`
551
+
552
+ ##### Falcon
553
+
554
+ - `falcon.context_length`
555
+ - `falcon.embedding_length`
556
+ - `falcon.block_count`
557
+ - `falcon.attention.head_count`
558
+ - `falcon.attention.head_count_kv`
559
+ - `falcon.attention.use_norm`
560
+ - `falcon.attention.layer_norm_epsilon`
561
+
562
+ ###### Optional
563
+
564
+ - `falcon.tensor_data_layout`:
565
+
566
+ - `jploski` (author of the original GGML implementation of Falcon):
567
+
568
+ ```python
569
+ # The original query_key_value tensor contains n_head_kv "kv groups",
570
+ # each consisting of n_head/n_head_kv query weights followed by one key
571
+ # and one value weight (shared by all query heads in the kv group).
572
+ # This layout makes it a big pain to work with in GGML.
573
+ # So we rearrange them here,, so that we have n_head query weights
574
+ # followed by n_head_kv key weights followed by n_head_kv value weights,
575
+ # in contiguous fashion.
576
+
577
+ if "query_key_value" in src:
578
+ qkv = model[src].view(
579
+ n_head_kv, n_head // n_head_kv + 2, head_dim, head_dim * n_head)
580
+
581
+ q = qkv[:, :-2 ].reshape(n_head * head_dim, head_dim * n_head)
582
+ k = qkv[:, [-2]].reshape(n_head_kv * head_dim, head_dim * n_head)
583
+ v = qkv[:, [-1]].reshape(n_head_kv * head_dim, head_dim * n_head)
584
+
585
+ model[src] = torch.cat((q,k,v)).reshape_as(model[src])
586
+ ```
587
+
588
+ ##### Mamba
589
+
590
+ - `mamba.context_length`
591
+ - `mamba.embedding_length`
592
+ - `mamba.block_count`
593
+ - `mamba.ssm.conv_kernel`
594
+ - `mamba.ssm.inner_size`
595
+ - `mamba.ssm.state_size`
596
+ - `mamba.ssm.time_step_rank`
597
+ - `mamba.attention.layer_norm_rms_epsilon`
598
+
599
+ ##### RWKV
600
+
601
+ The vocabulary size is the same as the number of rows in the `head` matrix.
602
+
603
+ - `rwkv.architecture_version: uint32`: The only allowed value currently is 4. Version 5 is expected to appear some time in the future.
604
+ - `rwkv.context_length: uint64`: Length of the context used during training or fine-tuning. RWKV is able to handle larger context than this limit, but the output quality may suffer.
605
+ - `rwkv.block_count: uint64`
606
+ - `rwkv.embedding_length: uint64`
607
+ - `rwkv.feed_forward_length: uint64`
608
+
609
+ ##### Whisper
610
+
611
+ Keys that do not have types defined should be assumed to share definitions with `llm.` keys.
612
+ (For example, `whisper.context_length` is equivalent to `llm.context_length`.)
613
+ This is because they are both transformer models.
614
+
615
+ - `whisper.encoder.context_length`
616
+ - `whisper.encoder.embedding_length`
617
+ - `whisper.encoder.block_count`
618
+ - `whisper.encoder.mels_count: uint64`
619
+ - `whisper.encoder.attention.head_count`
620
+
621
+ - `whisper.decoder.context_length`
622
+ - `whisper.decoder.embedding_length`
623
+ - `whisper.decoder.block_count`
624
+ - `whisper.decoder.attention.head_count`
625
+
626
+ #### Prompting
627
+
628
+ **TODO**: Include prompt format, and/or metadata about how it should be used (instruction, conversation, autocomplete, etc).
629
+
630
+ ### LoRA
631
+
632
+ **TODO**: Figure out what metadata is needed for LoRA. Probably desired features:
633
+
634
+ - match an existing model exactly, so that it can't be misapplied
635
+ - be marked as a LoRA so executors won't try to run it by itself
636
+
637
+ Should this be an architecture, or should it share the details of the original model with additional fields to mark it as a LoRA?
638
+
639
+ ### Tokenizer
640
+
641
+ The following keys are used to describe the tokenizer of the model. It is recommended that model authors support as many of these as possible, as it will allow for better tokenization quality with supported executors.
642
+
643
+ #### GGML
644
+
645
+ GGML supports an embedded vocabulary that enables inference of the model, but implementations of tokenization using this vocabulary (i.e. `llama.cpp`'s tokenizer) may have lower accuracy than the original tokenizer used for the model. When a more accurate tokenizer is available and supported, it should be used instead.
646
+
647
+ It is not guaranteed to be standardized across models, and may change in the future. It is recommended that model authors use a more standardized tokenizer if possible.
648
+
649
+ - `tokenizer.ggml.model: string`: The name of the tokenizer model.
650
+ - `llama`: Llama style SentencePiece (tokens and scores extracted from HF `tokenizer.model`)
651
+ - `replit`: Replit style SentencePiece (tokens and scores extracted from HF `spiece.model`)
652
+ - `gpt2`: GPT-2 / GPT-NeoX style BPE (tokens extracted from HF `tokenizer.json`)
653
+ - `rwkv`: RWKV tokenizer
654
+ - `tokenizer.ggml.tokens: array[string]`: A list of tokens indexed by the token ID used by the model.
655
+ - `tokenizer.ggml.scores: array[float32]`: If present, the score/probability of each token. If not present, all tokens are assumed to have equal probability. If present, it must have the same length and index as `tokens`.
656
+ - `tokenizer.ggml.token_type: array[int32]`: The token type (1=normal, 2=unknown, 3=control, 4=user defined, 5=unused, 6=byte). If present, it must have the same length and index as `tokens`.
657
+ - `tokenizer.ggml.merges: array[string]`: If present, the merges of the tokenizer. If not present, the tokens are assumed to be atomic.
658
+ - `tokenizer.ggml.added_tokens: array[string]`: If present, tokens that were added after training.
659
+
660
+ ##### Special tokens
661
+
662
+ - `tokenizer.ggml.bos_token_id: uint32`: Beginning of sequence marker
663
+ - `tokenizer.ggml.eos_token_id: uint32`: End of sequence marker
664
+ - `tokenizer.ggml.unknown_token_id: uint32`: Unknown token
665
+ - `tokenizer.ggml.separator_token_id: uint32`: Separator token
666
+ - `tokenizer.ggml.padding_token_id: uint32`: Padding token
667
+
668
+ #### Hugging Face
669
+
670
+ Hugging Face maintains their own `tokenizers` library that supports a wide variety of tokenizers. If your executor uses this library, it may be able to use the model's tokenizer directly.
671
+
672
+ - `tokenizer.huggingface.json: string`: the entirety of the HF `tokenizer.json` for a given model (e.g. <https://huggingface.co/mosaicml/mpt-7b-instruct/blob/main/tokenizer.json>). Included for compatibility with executors that support HF tokenizers directly.
673
+
674
+ #### Other
675
+
676
+ Other tokenizers may be used, but are not necessarily standardized. They may be executor-specific. They will be documented here as they are discovered/further developed.
677
+
678
+ - `tokenizer.rwkv.world: string`: a RWKV World tokenizer, like [this](https://github.com/BlinkDL/ChatRWKV/blob/main/tokenizer/rwkv_vocab_v20230424.txt). This text file should be included verbatim.
679
+ - `tokenizer.chat_template : string`: a Jinja template that specifies the input format expected by the model. For more details see: <https://huggingface.co/docs/transformers/main/en/chat_templating>
680
+
681
+ ### Computation graph
682
+
683
+ This is a future extension and still needs to be discussed, and may necessitate a new GGUF version. At the time of writing, the primary blocker is the stabilization of the computation graph format.
684
+
685
+ A sample computation graph of GGML nodes could be included in the model itself, allowing an executor to run the model without providing its own implementation of the architecture. This would allow for a more consistent experience across executors, and would allow for more complex architectures to be supported without requiring the executor to implement them.
686
+
687
+ ## Standardized tensor names
688
+
689
+ To minimize complexity and maximize compatibility, it is recommended that models using the transformer architecture use the following naming convention for their tensors:
690
+
691
+ ### Base layers
692
+
693
+ `AA.weight` `AA.bias`
694
+
695
+ where `AA` can be:
696
+
697
+ - `token_embd`: Token embedding layer
698
+ - `pos_embd`: Position embedding layer
699
+ - `output_norm`: Output normalization layer
700
+ - `output`: Output layer
701
+
702
+ ### Attention and feed-forward layer blocks
703
+
704
+ `blk.N.BB.weight` `blk.N.BB.bias`
705
+
706
+ where N signifies the block number a layer belongs to, and where `BB` could be:
707
+
708
+ - `attn_norm`: Attention normalization layer
709
+ - `attn_norm_2`: Attention normalization layer
710
+ - `attn_qkv`: Attention query-key-value layer
711
+ - `attn_q`: Attention query layer
712
+ - `attn_k`: Attention key layer
713
+ - `attn_v`: Attention value layer
714
+ - `attn_output`: Attention output layer
715
+
716
+ - `ffn_norm`: Feed-forward network normalization layer
717
+ - `ffn_up`: Feed-forward network "up" layer
718
+ - `ffn_gate`: Feed-forward network "gate" layer
719
+ - `ffn_down`: Feed-forward network "down" layer
720
+ - `ffn_gate_inp`: Expert-routing layer for the Feed-forward network in MoE models
721
+ - `ffn_gate_exp`: Feed-forward network "gate" layer per expert in MoE models
722
+ - `ffn_down_exp`: Feed-forward network "down" layer per expert in MoE models
723
+ - `ffn_up_exp`: Feed-forward network "up" layer per expert in MoE models
724
+
725
+ - `ssm_in`: State space model input projections layer
726
+ - `ssm_conv1d`: State space model rolling/shift layer
727
+ - `ssm_x`: State space model selective parametrization layer
728
+ - `ssm_a`: State space model state compression layer
729
+ - `ssm_d`: State space model skip connection layer
730
+ - `ssm_dt`: State space model time step layer
731
+ - `ssm_out`: State space model output projection layer
732
+
733
+ ## Version History
734
+
735
+ This document is actively updated to describe the current state of the metadata, and these changes are not tracked outside of the commits.
736
+
737
+ However, the format _itself_ has changed. The following sections describe the changes to the format itself.
738
+
739
+ ### v3
740
+
741
+ Adds big-endian support.
742
+
743
+ ### v2
744
+
745
+ Most countable values (lengths, etc) were changed from `uint32` to `uint64` to allow for larger models to be supported in the future.
746
+
747
+ ### v1
748
+
749
+ Initial version.
750
+
751
+ ## Historical State of Affairs
752
+
753
+ The following information is provided for context, but is not necessary to understand the rest of this document.
754
+
755
+ ### Overview
756
+
757
+ At present, there are three GGML file formats floating around for LLMs:
758
+
759
+ - **GGML** (unversioned): baseline format, with no versioning or alignment.
760
+ - **GGMF** (versioned): the same as GGML, but with versioning. Only one version exists.
761
+ - **GGJT**: Aligns the tensors to allow for use with `mmap`, which requires alignment. v1, v2 and v3 are identical, but the latter versions use a different quantization scheme that is incompatible with previous versions.
762
+
763
+ GGML is primarily used by the examples in `ggml`, while GGJT is used by `llama.cpp` models. Other executors may use any of the three formats, but this is not 'officially' supported.
764
+
765
+ These formats share the same fundamental structure:
766
+
767
+ - a magic number with an optional version number
768
+ - model-specific hyperparameters, including
769
+ - metadata about the model, such as the number of layers, the number of heads, etc.
770
+ - a `ftype` that describes the type of the majority of the tensors,
771
+ - for GGML files, the quantization version is encoded in the `ftype` divided by 1000
772
+ - an embedded vocabulary, which is a list of strings with length prepended. The GGMF/GGJT formats embed a float32 score next to the strings.
773
+ - finally, a list of tensors with their length-prepended name, type, and (aligned, in the case of GGJT) tensor data
774
+
775
+ Notably, this structure does not identify what model architecture the model belongs to, nor does it offer any flexibility for changing the structure of the hyperparameters. This means that the only way to add new hyperparameters is to add them to the end of the list, which is a breaking change for existing models.
776
+
777
+ ### Drawbacks
778
+
779
+ Unfortunately, over the last few months, there are a few issues that have become apparent with the existing models:
780
+
781
+ - There's no way to identify which model architecture a given model is for, because that information isn't present
782
+ - Similarly, existing programs cannot intelligently fail upon encountering new architectures
783
+ - Adding or removing any new hyperparameters is a breaking change, which is impossible for a reader to detect without using heuristics
784
+ - Each model architecture requires its own conversion script to their architecture's variant of GGML
785
+ - Maintaining backwards compatibility without breaking the structure of the format requires clever tricks, like packing the quantization version into the ftype, which are not guaranteed to be picked up by readers/writers, and are not consistent between the two formats
786
+
787
+ ### Why not other formats?
788
+
789
+ There are a few other formats that could be used, but issues include:
790
+
791
+ - requiring additional dependencies to load or save the model, which is complicated in a C environment
792
+ - limited or no support for 4-bit quantization
793
+ - existing cultural expectations (e.g. whether or not the model is a directory or a file)
794
+ - lack of support for embedded vocabularies
795
+ - lack of control over direction of future development
796
+
797
+ Ultimately, it is likely that GGUF will remain necessary for the foreseeable future, and it is better to have a single format that is well-documented and supported by all executors than to contort an existing format to fit the needs of GGML.