mradermacher commited on
Commit
798ca52
1 Parent(s): b3c25df

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -3
README.md CHANGED
@@ -11,17 +11,107 @@ language:
11
 
12
  # Mini-FAQ
13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ## What does the "-i1" mean in "-i1-GGUF"?
15
 
16
- tbd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
  ## Why are you doing this?
19
 
20
- tbd
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ## You have amazing hardware!?!?!
23
 
24
- tbd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  ## Why don't you use gguf-split?
27
 
 
11
 
12
  # Mini-FAQ
13
 
14
+ ## I miss model XXX
15
+
16
+ I am not the only one to make quants. For example, Lewdiculous makes high-quality imatrix quants of many
17
+ small models *and has a great presentation*. I either don't bother with imatrix quants for small models (< 30B), or avoid them
18
+ because I saw others already did them, avoiding double work.
19
+
20
+ Other notable people which do quants are Nexesenex, bartowski, dranger003 and Artefact2. I'm not saying
21
+ anything about the quality, because I probably forgot some really good folks in this list, and I wouldn't
22
+ even know, anyways. Model creators also often provide their own quants. I sometimes skip models because of that,
23
+ even if the creator might provide far fewer quants than me.
24
+
25
+ As always, feel free to request a quant, even if somebody else already did one, or request an imatrix version
26
+ for models where I didn't provide them.
27
+
28
+ ## I miss quant type XXX
29
+
30
+ The quant types I currently do regularly are:
31
+
32
+ - static: Q8_0 IQ3_S Q4_K_S IQ3_M Q2_K Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ3_XS IQ4_XS
33
+ - imatrix: Q2_K Q4_K_S IQ3_XXS Q3_K_M Q4_K_M IQ2_M Q6_K IQ4_XS Q3_K_S Q3_K_L Q5_K_S Q5_K_M Q4_0 IQ3_XS IQ3_S IQ3_M IQ2_XXS IQ2_XS IQ2_S IQ1_M IQ1_S
34
+
35
+ And they are generally (but not always) generated in the order above, for which there are deep reasons.
36
+
37
+ For models roughly less than 10B size, I experimentally generate f16 versions at the moment. Or plan to, it's a bit hacky.
38
+
39
+ Older models that pre-date introduction of new quant types generally will have them retrofitted, hopefully
40
+ this year. At least when multiple quant types are missing, as it is hard to justify a big mdoel download
41
+ for just one quant. If you want a quant form the above list and don't want to wait, feel free to request it and I will
42
+ prioritize it to the best of my abilities.
43
+
44
+ I specifically do not do Q2_K_S, because I generally think it is not worth it, and IQ4_NL, because it requires
45
+ a lot of computing and is generally completely superseded by IQ4_XS.
46
+
47
+ You can always try to change my mind.
48
+
49
  ## What does the "-i1" mean in "-i1-GGUF"?
50
 
51
+ "mradermacher imatrix type 1"
52
+
53
+ Originally, I had the idea of using an iterational method of imatrix generation, and wanted to see how well it
54
+ fares. That is, create an imatrix from a bad quant (e.g. static Q2_K), then use the new model to generate a
55
+ possibly better imatrix. It never happened, but I think sticking to something, even if slightly wrong, is better
56
+ changing it. If I make considerable changes to how I create iomatrix data I will probably bump it to `-i2` and so on.
57
+
58
+ since there is some subjectivity/choice in imatrix training data, this also distinguishes it from
59
+ quants by other people who made different choices.
60
+
61
+ ## What is the imatrix training data you use, can I have a copy?
62
+
63
+ My training data consists of about 160k tokens, about half of which is semi-random tokens (sentence fragments)
64
+ taken from stories, the other half is kalomaze's groups_merged.txt and a few other things. I have a half and a quarter
65
+ set for too big or too stubborn models.
66
+
67
+ Neither my set nor kalomaze's data contain large amounts of non-english training data, which is why I tend to
68
+ not generate imatrix quants for models primarily meant for non-english usage. This is a trade-off, emphasizing
69
+ english over other languages. But from (sparse) testing data it looks as if this doesn't actually make a big
70
+ difference. More data are always welcome.
71
+
72
+ Unfortunately, I do not have the righhts to publish the testing data, but I might be able to replicate an
73
+ equivalent set in the future and publish set.
74
 
75
  ## Why are you doing this?
76
 
77
+ Because at some point, I found that some new interesting models weren't available as GGUF anymore - my go-to
78
+ source, TheBloke, had vanished. So I quantized a few models for myself. At the time, it was trivial - no imatrix,
79
+ only a few quant types, all them very fast to generate.
80
+
81
+ I then looked into huggingface more closely than just as adownload source, and decided uploading would be a
82
+ good thing, so others don't have to redo the work on their own. I'm used to sharing most of the things I make
83
+ (mostly in free software), so it felt naturally to contribute, even at a minor scale.
84
+
85
+ Then the number of quant types and their computational complexity exploded, as well as imatrix calculations became a thing.
86
+ This increased the time required to make such quants by an order of magnitude. And also the management overhead.
87
+
88
+ Since I was slowly improving my tooling I grew into it at the same pace as these innovations came out. I probably
89
+ would not have started doing this a month later, as I would have been daunted by the complexity and work required.
90
 
91
  ## You have amazing hardware!?!?!
92
 
93
+ I regularly see people write that, but I probably have worse hardware than them to create my quants. I currently
94
+ have access to eight servers that have good upload speed. Five of them are xeon quad cores class from ~2013, three are
95
+ Ryzen 5 hexacores. The faster the server, the smaller the diskspace they have, so I can't just put the big
96
+ models on the fast(er) servers.
97
+
98
+ Imatrix generation is done on my home/work/gaming computer, which received an upgrade to 96GB DDR5 RAM, and
99
+ originally had an RTX 4070 (now, again, upgraded to a 4090 due to a generous investment of the company I work for).
100
+
101
+ I have good download speeds, but bad upload speeds at home, so it's lucky that model downloads are big and imatrix
102
+ uploads are small.
103
+
104
+ ## How do you create imatrix files for really big models?
105
+
106
+ Through a combination of these ingenuous tricks:
107
+
108
+ 1. I am not above using a low quant (e.g. Q4_K_S, IQ3_XXS or even Q2_K), reducing the size of the model.
109
+ 2. An nvme drive is "only" 25-50 times slower than RAM. I lock the first 80GB of the model in RAM, and
110
+ then stream the remaining data from disk for every iteration.
111
+ 3. Patience.
112
+
113
+ The few evaluations I have suggests that this gives good quality, and my current set-up allows me to
114
+ generate imatrix data for most models in fp16, 70B in Q8_0 and almost everything else in Q4_K_S.
115
 
116
  ## Why don't you use gguf-split?
117