model_requests / README.md
mradermacher's picture
Update README.md
798ca52 verified
|
raw
history blame
7.27 kB
metadata
language:
  - en

To request a quant, open an new discussion in the Community tab (preferably with the url in the title)


Mini-FAQ

I miss model XXX

I am not the only one to make quants. For example, Lewdiculous makes high-quality imatrix quants of many small models and has a great presentation. I either don't bother with imatrix quants for small models (< 30B), or avoid them because I saw others already did them, avoiding double work.

Other notable people which do quants are Nexesenex, bartowski, dranger003 and Artefact2. I'm not saying anything about the quality, because I probably forgot some really good folks in this list, and I wouldn't even know, anyways. Model creators also often provide their own quants. I sometimes skip models because of that, even if the creator might provide far fewer quants than me.

As always, feel free to request a quant, even if somebody else already did one, or request an imatrix version for models where I didn't provide them.

I miss quant type XXX

The quant types I currently do regularly are:

  • static: Q8_0 IQ3_S Q4_K_S IQ3_M Q2_K Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ3_XS IQ4_XS
  • imatrix: Q2_K Q4_K_S IQ3_XXS Q3_K_M Q4_K_M IQ2_M Q6_K IQ4_XS Q3_K_S Q3_K_L Q5_K_S Q5_K_M Q4_0 IQ3_XS IQ3_S IQ3_M IQ2_XXS IQ2_XS IQ2_S IQ1_M IQ1_S

And they are generally (but not always) generated in the order above, for which there are deep reasons.

For models roughly less than 10B size, I experimentally generate f16 versions at the moment. Or plan to, it's a bit hacky.

Older models that pre-date introduction of new quant types generally will have them retrofitted, hopefully this year. At least when multiple quant types are missing, as it is hard to justify a big mdoel download for just one quant. If you want a quant form the above list and don't want to wait, feel free to request it and I will prioritize it to the best of my abilities.

I specifically do not do Q2_K_S, because I generally think it is not worth it, and IQ4_NL, because it requires a lot of computing and is generally completely superseded by IQ4_XS.

You can always try to change my mind.

What does the "-i1" mean in "-i1-GGUF"?

"mradermacher imatrix type 1"

Originally, I had the idea of using an iterational method of imatrix generation, and wanted to see how well it fares. That is, create an imatrix from a bad quant (e.g. static Q2_K), then use the new model to generate a possibly better imatrix. It never happened, but I think sticking to something, even if slightly wrong, is better changing it. If I make considerable changes to how I create iomatrix data I will probably bump it to -i2 and so on.

since there is some subjectivity/choice in imatrix training data, this also distinguishes it from quants by other people who made different choices.

What is the imatrix training data you use, can I have a copy?

My training data consists of about 160k tokens, about half of which is semi-random tokens (sentence fragments) taken from stories, the other half is kalomaze's groups_merged.txt and a few other things. I have a half and a quarter set for too big or too stubborn models.

Neither my set nor kalomaze's data contain large amounts of non-english training data, which is why I tend to not generate imatrix quants for models primarily meant for non-english usage. This is a trade-off, emphasizing english over other languages. But from (sparse) testing data it looks as if this doesn't actually make a big difference. More data are always welcome.

Unfortunately, I do not have the righhts to publish the testing data, but I might be able to replicate an equivalent set in the future and publish set.

Why are you doing this?

Because at some point, I found that some new interesting models weren't available as GGUF anymore - my go-to source, TheBloke, had vanished. So I quantized a few models for myself. At the time, it was trivial - no imatrix, only a few quant types, all them very fast to generate.

I then looked into huggingface more closely than just as adownload source, and decided uploading would be a good thing, so others don't have to redo the work on their own. I'm used to sharing most of the things I make (mostly in free software), so it felt naturally to contribute, even at a minor scale.

Then the number of quant types and their computational complexity exploded, as well as imatrix calculations became a thing. This increased the time required to make such quants by an order of magnitude. And also the management overhead.

Since I was slowly improving my tooling I grew into it at the same pace as these innovations came out. I probably would not have started doing this a month later, as I would have been daunted by the complexity and work required.

You have amazing hardware!?!?!

I regularly see people write that, but I probably have worse hardware than them to create my quants. I currently have access to eight servers that have good upload speed. Five of them are xeon quad cores class from ~2013, three are Ryzen 5 hexacores. The faster the server, the smaller the diskspace they have, so I can't just put the big models on the fast(er) servers.

Imatrix generation is done on my home/work/gaming computer, which received an upgrade to 96GB DDR5 RAM, and originally had an RTX 4070 (now, again, upgraded to a 4090 due to a generous investment of the company I work for).

I have good download speeds, but bad upload speeds at home, so it's lucky that model downloads are big and imatrix uploads are small.

How do you create imatrix files for really big models?

Through a combination of these ingenuous tricks:

  1. I am not above using a low quant (e.g. Q4_K_S, IQ3_XXS or even Q2_K), reducing the size of the model.
  2. An nvme drive is "only" 25-50 times slower than RAM. I lock the first 80GB of the model in RAM, and then stream the remaining data from disk for every iteration.
  3. Patience.

The few evaluations I have suggests that this gives good quality, and my current set-up allows me to generate imatrix data for most models in fp16, 70B in Q8_0 and almost everything else in Q4_K_S.

Why don't you use gguf-split?

TL;DR: I don't have the hardware/resources for that.

Long answer: gguf-split requires a full copy for every quant. Unlike what many people think, my hardware is rather outdated and not very fast. The extra processing that gguf-split requires either runs out of space on my systems with fast disk, or takes a very long time and a lot of I/O bandwidth on the slower disks, all of which already run at their limits. Supporting gguf-split would mean

While this is the blocking reason, I also find it less than ideal that yet another incompatible file format was created that requires special tools to manage, instead of supporting the tens of thousands of existing quants, of which the vast majority could just be mmapped together into memory from split files. That doesn't keep me from supporting it, but it would have been nice to look at the existing reality and/or consult the community before throwing yet another hard to support format out there without thinking.

There are some developments to make this less of a pain, and I will revisit this issue from time to time to see if it has become feasible.