To request a quant, open an new discussion in the Community tab (if possible with the full url somewhere in the title or body)

You can search models, compare and download quants at https://hf.tst.eu/

You can see the current quant status at https://hf.tst.eu/status.html

huggingface has severely limited my uploads, so requests can be delayed.

HF seems uninterested in doing something about it, so this is likely going to stay. Some workarounds have been implemented, and they mostly work, but sometimes qwuants get delayed because no uploading is possible.

Mini-FAQ

I miss model XXX

I am not the only one to make quants. For example, Lewdiculous makes high-quality imatrix quants of many small models and has a great presentation. I either don't bother with imatrix quants for small models (< 30B), or avoid them because I saw others already did them, avoiding double work.

Some other notable people which do quants are Nexesenex, bartowski, RichardErkhov, dranger003 and Artefact2. I'm not saying anything about the quality of their quants, because I probably forgot some really good folks in this list, and I wouldn't even know, anyways. Model creators also often provide their own quants. I sometimes skip models because of that, even if the creator might provide far fewer quants than me.

As always, feel free to request a quant, even if somebody else already did one, or request an imatrix version for models where I didn't provide them.

My community discussion is missing

Most likely you brought up problems with the model and I decided I either have to re-do or simply drop the quants. In the past, I renamed the mode (so you can see my reply), but the huggingface rename function is borked and leaves the files available under their old name, keeping me from regenerating them (because my scripts can see them already existing). The only fix seems to be to delete the repo, which unfortunately also deletes the community discussion.

I miss quant type XXX

The quant types I currently do regularly are:

static: (f16) Q8_0 Q4_K_S Q2_K Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ4_XS (Q4_0_4)
imatrix: Q2_K Q4_K_S IQ3_XXS Q3_K_M (IQ4_NL) Q4_K_M IQ2_M Q6_K IQ4_XS Q3_K_S Q3_K_L Q5_K_S Q5_K_M Q4_0 IQ3_XS IQ3_S IQ3_M IQ2_XXS IQ2_XS IQ2_S IQ1_M IQ1_S (Q4_0_4_4 Q4_0_4_8 Q4_0_8_8)

And they are generally (but not always) generated in the order above, for which there are deep reasons.

For models less than 11B size, I experimentally generate f16 versions at the moment (in the static repository).

For models less than 19B size, imatrix IQ4_NL quants will be generated, mostly for the benefit of arm, where it can give a speed benefit.

The (static) IQ3 quants are no longer generated, as they consistently seem to result in much lower quality quants than even static Q2_K, so it would be s disservice to offer them. Update: That might no longer be true, and they might come back.

I specifically do not do Q2_K_S, because I generally think it is not worth it (IQ2_M usually being smaller and better, albeit slower), and IQ4_NL, because it requires a lot of computing and is generally completely superseded by IQ4_XS.

Q8_0 imatrix quants do not exist - some quanters claim otherwise, but Q8_0 ggufs do not contain any tensor type that uses the imatrix data, although technically it might be possible to do so.

Older models that pre-date introduction of new quant types generally will have them retrofitted on request.

You can always try to change my mind about all this, but be prepared to bring convincing data.

What does the "-i1" mean in "-i1-GGUF"?

"mradermacher imatrix type 1"

Originally, I had the idea of using an iterational method of imatrix generation, and wanted to see how well it fares. That is, create an imatrix from a bad quant (e.g. static Q2_K), then use the new model to generate a possibly better imatrix. It never happened, but I think sticking to something, even if slightly wrong, is better changing it. If I make considerable changes to how I create imatrix data I will probably bump it to -i2 and so on.

since there is some subjectivity/choice in imatrix training data, this also distinguishes it from quants by other people who made different choices.

What is the imatrix training data you use, can I have a copy?

My training data consists of about 160k tokens, about half of which is semi-random tokens (sentence fragments) taken from stories, the other half is kalomaze's groups_merged.txt and a few other things. I have a half and a quarter set for too big or too stubborn models.

Neither my set nor kalomaze's data contain large amounts of non-english training data, which is why I tend to not generate imatrix quants for models primarily meant for non-english usage. This is a trade-off, emphasizing english over other languages. But from (sparse) testing data it looks as if this doesn't actually make a big difference. More data are always welcome.

Unfortunately, I do not have the rights to publish the testing data, but I might be able to replicate an equivalent set in the future and publish that.

Why are you doing this?

Because at some point, I found that some new interesting models weren't available as GGUF anymore - my go-to source, TheBloke, had vanished. So I quantized a few models for myself. At the time, it was trivial - no imatrix, only a few quant types, all them very fast to generate.

I then looked into huggingface more closely than just as a download source, and decided uploading would be a good thing, so others don't have to redo the work on their own. I'm used to sharing most of the things I make (mostly in free software), so it felt naturally to contribute, even at a minor scale.

Then the number of quant types and their computational complexity exploded, as well as imatrix calculations became a thing. This increased the time required to make such quants by an order of magnitude. And also the management overhead.

Since I was slowly improving my tooling I grew into it at the same pace as these innovations came out. I probably would not have started doing this a month later, as I would have been daunted by the complexity and work required.

You have amazing hardware!?!?!

I regularly see people write that, but I probably have worse hardware than them to create my quants. I currently have access to eight servers that have good upload speed. Five of them are xeon quad cores class from ~2013, three are Ryzen 5 hexacores. The faster the server, the smaller the diskspace they have, so I can't just put the big models on the fast(er) servers.

Imatrix generation is done on my home/work/gaming computer, which received an upgrade to 96GB DDR5 RAM, and originally had an RTX 4070 (now, again, upgraded to a 4090 due to a generous investment of the company I work for).

I have good download speeds, but bad upload speeds at home, so it's lucky that model downloads are big and imatrix uploads are small.

How do you create imatrix files for really big models?

Through a combination of these ingenuous tricks:

I am not above using a low quant (e.g. Q4_K_S, IQ3_XS or even Q2_K), reducing the size of the model.
An nvme drive is "only" 25-50 times slower than RAM. I lock the first 80GB of the model in RAM, and then stream the remaining data from disk for every iteration.
Patience.

The few evaluations I have suggests that this gives good quality, and my current set-up allows me to generate imatrix data for most models in fp16, 70B in Q8_0 and almost everything else in Q4_K_S.

The trick to 3 is not actually having patience, the trick is to automate things to the point where you don't have to wait for things normally. For example, if all goes well, quantizing a model requires just a single command (or less) for static quants, and for imatrix quants I need to select the source gguf and then run another command which handles download/computation/upload. Most of the time, I only have to do stuff when things go wrong (which, with llama.cpp being so buggy and hard to use, is unfortunately very frequent).

What do I need to do to compute imatrix files for large models?

Use llama-imatrix to compute imatrix files.

Hardware

RAM: A lot of RAM is required to compute imatrix files. Example: 512 GB is just enough to compute 405B imatrix quants in Q8.
GPU: At least 8 GB of memory.

Dataset

You want to create a dataset that is around double the size of bartowski1182's imatrix dataset. Quality is far more important than size. If you don't mind long training times, you can make it massive, but if you go beyond 1 MB there will probably be diminishing returns.
Your imatrix dataset should contain the typical output the model would generate when used for the workload you plan on using the model for. If you plan on using the model as a programming assistant, your imatrix dataset should contain the typical code you would ask it to write. The same applies for language. Our dataset is mostly English. If one would use our imatrix models in a different language they will likely perform worse than static quants as only a very small portion of our imatrix training data is multilingual. We only have the resources to generate single generic imatrix quants so our imatrix dataset must contain examples of every common use-case of an LLM.

Extra tips

Computing 405B imatrix quants in Q8 does not seem to have any noticeable quality impact compared to BF16, so to save on hardware requirements, use Q8.
Sometimes, a single node may not have enough RAM to compute the imatrix file. In such cases, llama-rpc inside llama.cpp can be used to combine the RAM/VRAM of multiple nodes. This approach takes longer: computing the 405B imatrix file in BF16 takes around 20 hours using 3 nodes with 512 GB, 256 GB, and 128 GB of RAM, compared to 4 hours for Q8 on a single node.

Why don't you use gguf-split?

TL;DR: I don't have the hardware/resources for that.

Long answer: gguf-split requires a full copy for every quant. Unlike what many people think, my hardware is rather outdated and not very fast. The extra processing that gguf-split requires either runs out of space on my systems with fast disk, or takes a very long time and a lot of I/O bandwidth on the slower disks, all of which already run at their limits. Supporting gguf-split would mean

While this is the blocking reason, I also find it less than ideal that yet another incompatible file format was created that requires special tools to manage, instead of supporting the tens of thousands of existing quants, of which the vast majority could just be mmapped together into memory from split files. That doesn't keep me from supporting it, but it would have been nice to look at the existing reality and/or consult the community before throwing yet another hard to support format out there without thinking.

There are some developments to make this less of a pain, and I will revisit this issue from time to time to see if it has become feasible.

Update 2024-07: llama.cpp probably has most of the features needed to make this reality, but I haven't found time to test and implement it yet.

Update 2024-09: just looked at implementing it, and no, the problems that keep me from doing it are still there :(. Must have fantasized it!!?

So who is mradermacher?

Nobody has asked this, but since there are people who really deserve mention, I'll put this here. "mradermacher" is just a pseudonymous throwaway account I created to goof around, but then started to quant models. A few months later, @nicoboss joined and contributed hardware, power and general support - practically all imatrix computatuions are done on his computer(s). Then @Guilherme34 started to help getting access to models, and @RichardErkhov first gave us the wondrous FATLLAMA-1.7T, followed by access to his server to quant more models, likely to atone for his sins.

So you should consider "mradermacher" to be the team name for a fictional character called Michael Radermacher. There are no connections ot anything else on the internet, other than an mradermacher_hf account on reddit.