Where does the imatrix come from?

#1
by treehugg3 - opened

Hi, first thank you for the quants! It would be a great improvement to include links from the README.md in this (and the other) imatrix quant repositories detailing the source of the training data used to compute the imatrix, since today I discovered that it requires a training text set, while previously I had assumed that these all came from some "official" imatrix.

(Of course the file itself is created via llama-imatrix.)

Thanks!

All of mradermacher's imatrix published in the past 9 month got computed inside the nico1 LXC container on my StormPeak node. We are using 2x RTX 4090 and 512 GB of RAM to compute them. As dataset we use the one from bartowski but with a lot of confidential proprietary high-quality data added so we can offer the highest possible quality of general purpose imatrix quants. If you want to compute your own imatrix you can get bartowski1182's imatrix dataset from https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8 and add your own high-quality data to it. If you want to create a general purpose imatrix dataset I recommend to especially focus on story writing and roleplay data which is mostly missing from bartowski1182's imatrix dataset. If you have a special use case in mind you can always create an imatrix dataset to cover your specific use case to even beat our imatrix quants in that specific area. It is really unfortunate that we are not allowed to share ouer imatrix dataset but by using proprietary data I believe we can offer much better imatrix quants than if we would only relay on non-proprietary data.

Thank you tremendously for the thorough explanation! If possible, would you mind sharing what the approximate size of the proprietary high-quality dataset is? I'm trying to get a sense for how much data I should use. I was surprised to see the gist you linked is only 273 KB in size.

Also: by high-quality data, do you mean domain-specific data that is likely within the dataset of the original model, so we can match the activations from the original model? Or just "important" data in the distribution we aim to output from the model? I think I will try my hand at making a domain-specific dataset as you suggested. I might even try it on a big 405B model I could not hope to run at any "natively decent" quant level.

Thank you again!

If possible, would you mind sharing what the approximate size of the proprietary high-quality dataset is? I'm trying to get a sense for how much data I should use. I was surprised to see the gist you linked is only 273 KB in size.

You want to create a dataset that is around double the size of bartowski1182's imatrix dataset. Quality is far more important than size. If you don't mind long training times you can obviously also make it massive but if you go beyond 1 MB there will probably be diminishing returns.

Here another greate public imatrix dataset. This time based on kalomaze's groups_merged.txt but with added roleplay data serving as great excample how to create an imatrix dataset for a roleplay focused use-case: https://huggingface.co/Lewdiculous/Datura_7B-GGUF-Imatrix/blob/main/imatrix-with-rp-format-data.txt

by high-quality data, do you mean domain-specific data that is likely within the dataset of the original model, so we can match the activations from the original model? Or just "important" data in the distribution we aim to output from the model?

Your imatrix dataset should contain the typical output the model would generate when used for the workload you plan on using the model for. If you plan on using the model as a programming assistant your imatrix dataset should contain the typical code you would ask it to write. The same applies for language. Our dataset is motley English. If one would use our imatrix models in a different language they will likely perform worse than static quants as only a very small portion of our imatrix training data is multilingual. We only have the resources to generate single generic imatrix quants so our imatrix dataset must contain examples of every common use-case of an LLM.

Keep in mind that for MoE models your imatrix dataset needs to be able to activate all experts. This is especially difficult for models with 256 experts like Snowflake Arctic. There even our imatirx dataset only activates 255 out of 256 experts. To still compute the imatrix of such models I create a llama.cpp fork where the mradermacher branch contains a modification that allows llama.cpp to statically quant non-activated experts: https://github.com/nicoboss/llama.cpp/tree/mradermacher

I might even try it on a big 405B model I could not hope to run at any "natively decent" quant level.

Keep in mind that to compute the imatrix of such massive models you need a ton of RAM. For 405B having 512 GB is just enough to compute 405B imatrix quants in Q8. Based on the quality measurements I made computing the imatrix on Q8 will not have any noticeable quality impact but if you are a perfectionist you would want to do imatrix computation in BF16. Because we only have 512 GB of RAM in one node and some models don't fit, we are sometimes using the RPC functionality inside llama.cpp to combine the RAM of all my nodes (512 GiB + 256 GiB + 128 GiB) to then du distributed imatrix computation but unlike the normal imatrix computation which takes like 4 hours for 405B in Q8 this takes 20 hours for 405B in BF16. Really all that matters to compute a imatrix is that it fits into RAM and you have a GPU with at least 8 GB of GPU memory. In case you wonder the by far best 405B model in my opinion is the recently released 405B reasoning model which is a reasoner finetune of my uncensored Hermes-3 405B model: https://huggingface.co/GuilhermeNaturaUmana/Nature-Reason-1

Keep in mind that you can always request a model under https://huggingface.co/mradermacher/model_requests/discussions and we always do provide the imatrix file so unless you are really interested in imatrix generation just request from us all the models you are interested in and we will do everything for you for free.

Wow, thank you so much for such a detailed and extensive response! I bookmarked this page and I immediately switched to downloading a Q8 version. Though I suppose based on your note it's not even worth trying to compute a 405B imatrix on only 128GB RAM. I was hoping the RAM usage would be per-layer, but I guess not. I was hoping to get a 405B model to fit, somehow, onto my machine with better performance than a 70B model.

I am interested in imatrix generation mostly because I want to see how well I can match performance to the (less quantized) bigger models on some specific tasks and output formats while squeezing the model size as small as I can. However, I can run 70B models at Q6_K without much issue, so maybe I will just try to find a better model or make an imatrix for a 70B model so I can offload more layers into the GPU. I'm interested in models that do not carry with them GPT-isms or GPT bias (for this reason why I'm looking at this model!), and my thought was that 405B will be slightly smarter and less repetitive.

...And the reason for all this is that I was not very successful at fine-tuning my own smaller models. Too many GPT-isms continued creeping in when I tried to train an instruct model via QLoRA (probably as you'd expect), and my full fine tunes are just too dumb and too repetitive, albeit completely free of GPT-isms (which is so nice!). So I am leaning on some bigger base models to make up for it.

The extra links are so helpful, too. I have a much better idea of what my imatrix would look like (English writing + some code in the target format). Also I'm surprised you were able to get a uncensored 405B model with only 934 instruct examples in the dataset! That's encouraging.

Though I suppose based on your note it's not even worth trying to compute a 405B imatrix on only 128GB RAM.

You unfortunately need at least 512 GiB of RAM to compute high quality imatrix quants from a 405B model. But don't worry you can just send me your imatrix dataset and the model for which you want me to compute the imatrix for and I can compute it for you.

I was hoping the RAM usage would be per-layer, but I guess not. I was hoping to get a 405B model to fit, somehow, onto my machine with better performance than a 70B model.

It unfortunately all needs to fit in RAM or llama.cpp will have to stream the model from SSD which makes the imatrix computation take weeks instead of hours.

I am interested in imatrix generation mostly because I want to see how well I can match performance to the (less quantized) bigger models on some specific tasks and output formats while squeezing the model size as small as I can.

I'm personally a huge fan of 405B models and am locally running them all the time. They are far superior compared to 70B for my personal use-case which mainly consists of single-turn Q&A 128 GiB of RAM is very tight to run a 405B model. Depending on the amount of GPU memory you have the best you can fit with small context size is likely i1-IQ2_S. Will it beat 70B despite being so heavily quantized? I would say it depends on your workload but probably yes. 405B is a monolithic and not an MoE model and so tolerates high quantization relatively well. Keep in mind that if you have a second PC you can use llama.cpp RPC to combine their RAM in order to run inference on much higher quality quants.

I in fact spent months comparing the quality of different quants including the 405B ones. Here the results copied from my post under https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/2#6732972fc7b41d86099eb5d9 - you can find many other quant quality measurements in this discussion as well if you are interested.

Meta-Llama-3.1-405B-Instruct - KL Divergence.png

Meta-Llama-3.1-405B-Instruct - Perplexity.png

Meta-Llama-3.1-405B-Instruct - Probability of quant generating the same token.png

Meta-Llama-3.1-405B-Instruct - Correct token probability.png

Meta-Llama-3.1-405B-Instruct - Eval (ARC, MMLU, Winogrande).png

However, I can run 70B models at Q6_K without much issue, so maybe I will just try to find a better model or make an imatrix for a 70B model so I can offload more layers into the GPU. I'm interested in models that do not carry with them GPT-isms or GPT bias (for this reason why I'm looking at this model!), and my thought was that 405B will be slightly smarter and less repetitive.

70B is decent. Running in Q6_K is kind of stupid. Everything above i1-Q5_K_M is just placebo as there is no way you would feel any real-world quality difference. i1_Q4_K_M is in my opinion the best trade-off between inference speed and quality.

And the reason for all this is that I was not very successful at fine-tuning my own smaller models. Too many GPT-isms continued creeping in when I tried to train an instruct model via QLoRA (probably as you'd expect), and my full fine tunes are just too dumb and too repetitive, albeit completely free of GPT-isms (which is so nice!). So I am leaning on some bigger base models to make up for it.

I recommend you do the following for good finetunes:

  1. 4 epochs using uncensor
  2. If a Chinese model and you want to be politically unbiased: 6 epochs over GreatFirewall-DPO
  3. If you want a human-like model: 1 epoch over Samantha-newdataset-morelarge
  4. If you want a reasoner model: 1 epoch over Reasoning-deepseek (currently single-turn only)

Here my 405B Samantha if you want to have a model acting as virtual friend: https://huggingface.co/nicoboss/Hermes-3-Llama-3.1-405B-Samantha

The extra links are so helpful, too. I have a much better idea of what my imatrix would look like (English writing + some code in the target format).

Thanks. If you need any more imatrix dataset examples just let me know. I have a small collection of them.

I'm surprised you were able to get a uncensored 405B model with only 934 instruct examples in the dataset! That's encouraging.

Like the imatrix dataset the data quality matters much more than the quantity for finetuning as well. Especially for simple concepts like uncensoring 4 epochs over 1K is enough. For more complex tasks like reasoning 1 epoch over 9K rows is required.

Neat, somehow I was not even aware of llama-rpc... this is going to change things for sure. I will try out the 405B models!

But don't worry you can just send me your imatrix dataset and the model for which you want me to compute the imatrix for and I can compute it for you.

Thank you for offering this!! I will practice a lot on 70B models first, and once I am successful at creating a decent imatrix dataset (compared to the non-imatrix quants) I'll ask!

I'm personally a huge fan of 405B models and am locally running them all the time. They are far superior compared to 70B for my personal use-case which mainly consists of single-turn Q&A 128 GiB of RAM is very tight to run a 405B model. Depending on the amount of GPU memory you have the best you can fit with small context size is likely i1-IQ2_S. Will it beat 70B despite being so heavily quantized? I would say it depends on your workload but probably yes. 405B is a monolithic and not an MoE model and so tolerates high quantization relatively well.

Very valuable information! Unfortunately I need a lot of context since I have very few high-quality examples and basically need to few-shot in order to get decent outputs. I hope splitting via llama-rpc will help with that.

My attempts at instruct-tuning (via SFT) on a tiny Llama-3.2-3B base model (with 30-50k unique iterations) were not very good. It was "moderately" able to follow instructions and sometimes didn't predict the EOT token and so kept going. It repeated itself very often. I'm hoping that's mostly just because of the model size.

@nicoboss many of your answers are so substantive that they really should go into some more permanent place, instead of being buried in random discussions.

@mradermacher I totally agree! And, kind of what prompted this discussion was I wish the README files had more information about the dataset source. Maybe putting imatrix information in a github/hf readme file that people can contribute to, and then linking to it in the README files of the imatrix quant repositories would be a great solution.

@treehugg3 well, you shouldn't really bring this up, because the very faq that explains the contents of the imatrix data set is already linked from every model card, including this one (and people can edit and send patches already). if you can't be bothered to read through it, no amount of additional faqs would have helped you :*)

You're right, I completely missed the FAQ section in the readme. But I promise, it's not from being lazy. The truth is I have downloaded many of your quants and have never noticed it until you brought it up now, which... is shockingly disappointing for me. If I'm not the only one who missed it several times, you could probably give it more visibility by pushing that line into the About section.

@mradermacher Here's a small contribution summarizing @nicoboss 's imatrix comments so you don't treat me as a freeloader quite as much: https://huggingface.co/mradermacher/model_requests/discussions/745 I can't make a pull request but if you want me to send a git diff I'd be happy to do that (screw that, I found it in the HF UI, I'll be the first https://huggingface.co/mradermacher/model_requests/discussions/746) . Yes, I can be bothered :*) and I do appreciate that the FAQ exists

It's totally fine to be a freeloader :)

mradermacher changed discussion status to closed

Sign up or log in to comment