File size: 9,035 Bytes
c2d910d
 
 
 
ce2f8c2
dd9cf68
 
 
 
 
 
 
ce2f8c2
798ca52
 
c345c2c
798ca52
 
 
c345c2c
 
c5a950f
c345c2c
798ca52
 
21116dc
 
 
798ca52
 
 
c5a950f
 
 
 
 
 
 
798ca52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd9cf68
ce2f8c2
798ca52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c345c2c
79c9695
b3c25df
dd9cf68
ce2f8c2
798ca52
 
 
 
c345c2c
798ca52
 
 
 
 
 
 
 
b3c25df
dd9cf68
ce2f8c2
798ca52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
510a000
798ca52
 
 
 
 
 
b3c25df
bb67c97
 
 
 
 
 
 
dd9cf68
ce2f8c2
 
 
 
 
 
 
 
 
 
 
 
 
b3c25df
88a6e8b
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
language:
- en
---

---

# To request a quant, open an new discussion in the Community tab (preferably with the url in the title)

---

# Mini-FAQ

## I miss model XXX

I am not the only one to make quants. For example, **Lewdiculous** makes high-quality imatrix quants of many
small models *and has a great presentation*. I either don't bother with imatrix quants for small models (< 30B), or avoid them
because I saw others already did them, avoiding double work.

Other notable people which do quants are **Nexesenex**, **bartowski**, **dranger003** and **Artefact2**. I'm not saying
anything about the quality of their quants, because I probably forgot some really good folks in this list, and I wouldn't
even know, anyways.
Model creators also often provide their own quants. I sometimes skip models because of that,
even if the creator might provide far fewer quants than me.

I would, however, strongly avoid FaradayDotDev's quants at this point, as they are refusing to understand why
all their llama-3 (and some other) quants are broken.

As always, feel free to request a quant, even if somebody else already did one, or request an imatrix version
for models where I didn't provide them.

## My community discussion is missing

Most likely you brought up problems with the model and I decided I either have to re-do or simply drop the quants.
In the past, I renamed the mode (so you cna see my reply), but the huggingface rename function is borked and leaves the files
available under their old name, keeping me from regenerating them (because my scripts can see them already existing).
The only fix seems to be to delete the repo, which unfortunately also deletes the community discussion.

## I miss quant type XXX

The quant types I currently do regularly are:

- static:  Q8_0 IQ3_S Q4_K_S IQ3_M Q2_K Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ3_XS IQ4_XS
- imatrix: Q2_K Q4_K_S IQ3_XXS Q3_K_M Q4_K_M IQ2_M Q6_K IQ4_XS Q3_K_S Q3_K_L Q5_K_S Q5_K_M Q4_0 IQ3_XS IQ3_S IQ3_M IQ2_XXS IQ2_XS IQ2_S IQ1_M IQ1_S

And they are generally (but not always) generated in the order above, for which there are deep reasons.

For models roughly less than 10B size, I experimentally generate f16 versions at the moment. Or plan to, it's a bit hacky.

Older models that pre-date introduction of new quant types generally will have them retrofitted, hopefully
this year. At least when multiple quant types are missing, as it is hard to justify a big mdoel download
for just one quant. If you want a quant form the above list and don't want to wait, feel free to request it and I will
prioritize it to the best of my abilities.

I specifically do not do Q2_K_S, because I generally think it is not worth it, and IQ4_NL, because it requires
a lot of computing and is generally completely superseded by IQ4_XS.

You can always try to change my mind.

## What does the "-i1" mean in "-i1-GGUF"?

"mradermacher imatrix type 1"

Originally, I had the idea of using an iterational method of imatrix generation, and wanted to see how well it
fares. That is, create an imatrix from a bad quant (e.g. static Q2_K), then use the new model to generate a
possibly better imatrix. It never happened, but I think sticking to something, even if slightly wrong, is better
changing it. If I make considerable changes to how I create iomatrix data I will probably bump it to `-i2` and so on.

since there is some subjectivity/choice in imatrix training data, this also distinguishes it from
quants by other people who made different choices.

## What is the imatrix training data you use, can I have a copy?

My training data consists of about 160k tokens, about half of which is semi-random tokens (sentence fragments)
taken from stories, the other half is kalomaze's groups_merged.txt and a few other things. I have a half and a quarter
set for too big or too stubborn models.

Neither my set nor kalomaze's data contain large amounts of non-english training data, which is why I tend to
not generate imatrix quants for models primarily meant for non-english usage. This is a trade-off, emphasizing
english over other languages. But from (sparse) testing data it looks as if this doesn't actually make a big
difference. More data are always welcome.

Unfortunately, I do not have the rights to publish the testing data, but I might be able to replicate an
equivalent set in the future and publish that.

## Why are you doing this?

Because at some point, I found that some new interesting models weren't available as GGUF anymore - my go-to
source, TheBloke, had vanished. So I quantized a few models for myself. At the time, it was trivial - no imatrix,
only a few quant types, all them very fast to generate.

I then looked into huggingface more closely than just as a download source, and decided uploading would be a
good thing, so others don't have to redo the work on their own. I'm used to sharing most of the things I make
(mostly in free software), so it felt naturally to contribute, even at a minor scale.

Then the number of quant types and their computational complexity exploded, as well as imatrix calculations became a thing.
This increased the time required to make such quants by an order of magnitude. And also the management overhead.

Since I was slowly improving my tooling I grew into it at the same pace as these innovations came out. I probably
would not have started doing this a month later, as I would have been daunted by the complexity and work required.

## You have amazing hardware!?!?!

I regularly see people write that, but I probably have worse hardware than them to create my quants. I currently
have access to eight servers that have good upload speed. Five of them are xeon quad cores class from ~2013, three are
Ryzen 5 hexacores. The faster the server, the smaller the diskspace they have, so I can't just put the big
models on the fast(er) servers.

Imatrix generation is done on my home/work/gaming computer, which received an upgrade to 96GB DDR5 RAM, and
originally had an RTX 4070 (now, again, upgraded to a 4090 due to a generous investment of the company I work for).

I have good download speeds, but bad upload speeds at home, so it's lucky that model downloads are big and imatrix
uploads are small.

## How do you create imatrix files for really big models?

Through a combination of these ingenuous tricks:

1. I am not above using a low quant (e.g. Q4_K_S, IQ3_XS or even Q2_K), reducing the size of the model.
2. An nvme drive is "only" 25-50 times slower than RAM. I lock the first 80GB of the model in RAM, and
   then stream the remaining data from disk for every iteration.
3. Patience.

The few evaluations I have suggests that this gives good quality, and my current set-up allows me to
generate imatrix data for most models in fp16, 70B in Q8_0 and almost everything else in Q4_K_S.

The trick to 3 is not actually having patience, the trick is to automate things to the point where you
don't have to wait for things normally. For example, if all goes well, quantizing a model requires just
a single command (or less) for static quants, and for imatrix quants I need to select the source gguf
and then run another command which handles download/computation/upload. Most of the time, I only have
to do stuff when things go wrong (which, with llama.cpp being so buggy and hard to use,
is unfortunately very frequent).

## Why don't you use gguf-split?

TL;DR: I don't have the hardware/resources for that.

Long answer: gguf-split requires a full copy for every quant.
Unlike what many people think, my hardware is rather outdated and not very fast. The extra processing that gguf-split requires
either runs out of space on my systems with fast disk, or takes a very long time and a lot of I/O bandwidth on the slower
disks, all of which already run at their limits. Supporting gguf-split would mean

While this is the blocking reason, I also find it less than ideal that yet another incompatible file format was created that
requires special tools to manage, instead of supporting the tens of thousands of existing quants, of which the vast majority
could just be mmapped together into memory from split files. That doesn't keep me from supporting it, but it would have
been nice to look at the existing reality and/or consult the community before throwing yet another hard to support format out
there without thinking.

There are some developments to make this less of a pain, and I will revisit this issue from time to time to see if it has
become feasible.

## Why are there so many imatrix quants where llama.cpp crashes?

Because llama.cpp developers see imatrix quants as a useless gimmick, are not very interested in fixing bugs,
instead of mostly just wanting to ignore the problem, and considers merged models as mostly useless. They
are more interested in blaming their users.

As a result I have given up reporting bugs, and my efforts in providing imatrix quants will be reduced.

https://github.com/ggerganov/llama.cpp/issues/6841 is the issue ticket that escalated this.