2 abc or not 2 abc
@nicoboss now looking into the IQ4_XS.
We can always re-run everything if the rpc mode is improved. In fact, maybe the idea is so exciting that rgerganov would look into it if asked.
homing in on iq4_xs is going to be very tight, as just a few GB off is going to be a problem
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloaded 24/316 layers to GPU
llm_load_tensors: CPU buffer size = 523472.97 MiB
llm_load_tensors: CUDA0 buffer size = 19645.50 MiB
llm_load_tensors: CUDA1 buffer size = 19645.50 MiB
compute_imatrix: 130.55 seconds per pass - ETA 11 hours 23.22 minutes
| 0% 37C P0 125W / 450W | 22153MiB / 24564MiB | 70% Default |
| 0% 35C P0 69W / 450W | 20531MiB / 24564MiB | 0% Default |
Judging form actual memory usage, we might even get another 30GB or more in there.
And then it just took 5 hours. Ok, it's not done yet, but it it will be done in less then 5h30m.
@nicoboss now looking into the IQ4_XS.
Awesome. Thanks a lot!
We can always re-run everything if the rpc mode is improved. In fact, maybe the idea is so exciting that rgerganov would look into it if asked.
I will experiment with RPC some more. Please keep BigLlama-3.1-1T-Instruct.Q6_K.gguf for a few days unless you need the storage for something more important.
Judging form actual memory usage, we might even get another 30GB or more in there.
You only used the two RTX 4090 GPUs so technically you could get another additional 18 GB of GPU memory by also using RTX 3080 + 2070s. But IQ4_XS will be good enough for now. It’s better what you used for your older large models you never ended up requantizing as far I'm aware.
And then it just took 5 hours. Ok, it's not done yet, but it it will be done in less then 5h30m.
Great. I see it completed successfully and is now working on the BigLlama 1T quant task. They will be great to stress test my new internet gateway using which I had not experienced any internet issues so far.
unless you need the storage for something more important.
Well, in fact, once bigllama is quanted, I will empty out all /*pool's (it's only the source gguf).
Also, since the big models really dried out at the moment,
you could get another additional 18 GB of GPU memory by also using RTX 3080 + 2070s
No, because the kernel doesn't compile on the 3080, and probably also not on the 2070:
ggml_cuda_compute_forward: MUL failed
CUDA error: no kernel image is available for execution on the device
That is probably due to me forcing mmq for quality reasons (a lot of models overflow in f16 but work when mmq is forced), but I haven't verified that yet.
But IQ4_XS will be good enough for now.
Yeah, and eyeballing your graphs, IQ4_XS isn't as bad as we thought, and neither are Q3* (all non-imatrix).
They will be great to stress test my new internet gateway
I am really optimistic that it was the gateway, maybe an overheating problem. It has uploaded quite a bit so far without a hitch, more than with the old gateway at the end.
@nicoboss In other news, I improved the situation with rich1 by downloading the source repo from huggingface on nico1 and converting once more, before rsyncing from rich1. This means the rsync usually does no actual data transfer, which takes one full model upload out of rich1's upload bandwidth - a considerable improvement. And then I realised I can apply it to marco as well, which is also upload-handicapped.
Unfortunately, this greatly increases wear (and I/O usage in general) on nico1 - downloading and converting is two copies, and rsync makes another copy (--inplace doesn't work well with large files), to the point where this makes me a bit worried. It's probably not too much, considering the many other conversions/downloads nico1 already does, but it's unnenecessary. Your thoughts on this is appreciated :)
@nicoboss Ah,and any idea why the IQ3 quants seem fine again in your newer measurements? With your new data, it looks like a mistake not to provide them. Or did I misinterpret?
Would have been nice if you told me earlier - I asked a few times for it, not knowing this is the issue :) But better late than never, I will add them tomorrow.
Sorry I had the impression you did all the quants you wanted me to do. Luckily all the quant download, quant quality measurement and performance measurement scripts are aware of already existing quants and so will just do the missing ones so doing some additional quants will be very little work from my side.
@nicoboss In other news, I improved the situation with rich1 by downloading the source repo from huggingface on nico1 and converting once more, before rsyncing from rich1. This means the rsync usually does no actual data transfer, which takes one full model upload out of rich1's upload bandwidth - a considerable improvement. And then I realised I can apply it to marco as well, which is also upload-handicapped.
Unfortunately, this greatly increases wear (and I/O usage in general) on nico1 - downloading and converting is two copies, and rsync makes another copy (--inplace doesn't work well with large files), to the point where this makes me a bit worried. It's probably not too much, considering the many other conversions/downloads nico1 already does, but it's unnenecessary. Your thoughts on this is appreciated :)
I really couldn't care less about I/O usage and SSD wear. This is the resource I care about the least. Your SSDs are currently at 10% and 14% wearout. If we continue with the current rate, they will last for at least the next 5 years. Those SAMSUNG MZVL22T0HBLB-00B00 are the perfect SSDs for this job. They are one of the early PCIe 4.0 SSDs with 7 GB/s sequential read and 1 million IOPS and we RAID0 connected them together for a theoretical 14 GB/s and 2 million IOPS. Those SSDs cannot really be used for much else as they lose data over time due to bit rot exceeding what error correction can correct if a file isn't read for a few months. During previous use-cases of those SSDs this was especially annoying for backups which stop working due to those corrupted uncorrectable sectors. To fix them the corrupted sector had to be overwritten/trimmed which on an SSD is harder than you think (I had to use https://hdd.by/victoria/ on Windows XP). I'm more than happy to replace them with some decent SSDs should they ever break.
I'm mainly concerned about internet bandwidth usage and electricity. In the past 37 days I used 99.03 TB download and 221.02 TB upload traffic. Luckily my ISP has not yet complained so I guess we should be fine. Technically they advertise unlimited internet and don't have any fair use clause in their contract but we better don't push our luck. Your proposed solution should not result in any meaningful internet bandwidth increase as we just download the model from HuggingFace instead of rich1/marco and HFto GGUF uses almost no electricity making this a perfect solution. It probably is also faster to download from HuggingFace as rich1 sometimes has slow connections.
Really nice to see marco doing some work again. I missed that node and already feared we might have lost it as I haven’t seen him for quite a while. There are exactly two weeks left for db1, db2, db3 and backup. They are now doing quite good work now that they run two tasks in parallel but rich1 is still faster than two of them together.
Any idea how long the queue will keep growing? We are now at around 3700 models and it just seems to keep growing despite having as many workers as never before. Have we now queued all models there are in your backlog or are there still more to be added?
I think the arm quants are identical to Q4_0 in quality - same bits, and this is reflected in e.g. KL-divergence.
I'm quite sure about that as well so I guess measuring them is not of any importance for the quality project and only matters for the performance measurements on ARM devices.
I don't think I want a ranking scale alone, as I can always have another column be the ranking as well, but also express a difference in magnitude. And this can be arbitrary, e.g. (correct_token - 0.5) ** 10, or correct_token linearlyscaled from 0 to 100 - an arbitrary scale where differences between values are soemwhat meaningful, so you cna have an idea of "how much" you lose.
I fully agree.
correct token seems to be pretty close overall.
correct_token is currently my personal favorite quality measurement. It seems to best translate into real world quality in my opinion.
@nicoboss Ah,and any idea why the IQ3 quants seem fine again in your newer measurements? With your new data, it looks like a mistake not to provide them. Or did I misinterpret?
the iq3 quants seem to have been vindicated overall, being much better than q2_k - what happened there? now looks like removing them was a mistake.
This likely is either due to the Qwen 2.5 architecture or something llama.cpp improved since we last measured. I don't really see any changes in llama.cpp that would improve low BPW static quants so I tend more towards the Qwen 2.5 architecture. When I have time, I will generate graphs for the Qwen 2.5 series measurements and redo one of the previously measured bad quants to know for sure so we can react accordingly.
so, here is what i will do: i will take something (probably correct_token, as suggested), and try to come up with an arbitrary-unit 0..100 scale or so, and use that as quality.
Sounds awesome. correct_token is already at a 0..100 scale (just cap it at 100).
then probably update the model cards, as we have the basic feature set ready now
Maybe update one to test first and so Richerd and I can give feedback before you update all of them.
including a simple search (you wanted levenshtein, you got bitap)).
Thanks a lot! You are so awesome.
here is what i plan: the quality column should have a selector on top which allows people to chose perplexity, kl, etc. but also other tests (e.g. winogrande) that we have. it would be idea if this were somehow saved, but the js library i use (tabulator-tables) acts like shit and loses columns when i enable its persistence mode. and its column resizing fails most of the time, too. no clue why this is so recommended. should have gone with dtatables again (which also acts up, but at least not as bad...)
i am pretty sure a lot of people would appreciate letting them chose their favourite scale for comparison.
I fully agree. Even I myself while I like correct_token the most I for sure want to sort by mean KLD or same token depending on my use case. All those measurements offer their own unique value and giving the user the option to choose whoever one he likes would by a perfect solution.
i will also seriouisly consider only having one table for all quants. pro: q8_0 should be available as part of the imatrix quants, when i use that table to select a matching quant, because it is still the highest quality "imatrix" quant, if i would generate it. and quality is directly comparable. con: static vs. imatrix is not just a one-dimensional quality question, as a predominantly english imatrix training set will have quite negative effects on other languages. or so i hear, and i have no reason to doubt it.
I recommend to combine them but make them easily distinguishable for example by keeping their different background colors and/or add a filter. Sorting just makes much more sense when they are combined.
Also, a heads-up, this month will likely be one of the busier ones of my life, so don't worry if I seem to be quiet. If I am quiet.
I will be quite busy in the first half of December as well but luckily should have a lot of time in the second half as the company I'm working on insisted that I take all my overtime hours until the end of the year.
only matters for the performance measurements on ARM devices.
Good point... are you actually planning for that? Wow.
Sorry I had the impression you did all the quants you wanted me to do.
Yeah, we talked past each other, I want all the quants, at least the ones I generate. Am a bit worried about the ternay quants, but I think these are lossless, so I cna alias them to 100%. I do the same for f16/bf16/f32/SOURCE, i.e. just assign the the highest score for selectino purposes.
I really couldn't care less about I/O usage and SSD wear.
Noted :)
Really nice to see marco doing some work again. I missed that node and already feared we might have lost it as I haven’t seen him for quite a while.
The problem is that marco has high electricity costs and is the desktop of my boss. He's very supportive, but using it AND letting him do useful work is a bit of a challenge. So I can't use it easily for automatic operation. Time will tell how it develops.
They are now doing quite good work now that they run two tasks in parallel but rich1
I don't think it gets us more than a few percent, though. Maybe it is even worse. Before, only a few quants did not result in 99% cpu usage. Well, it is what it is.
Any idea how long the queue will keep growing? We are now at around 3700 models and it just seems to keep growing despite having as many workers as never before. Have we now queued all models there are in your backlog or are there still more to be added?
I am at the end of april. So I have been through, strictly speaking, 30%. But the number of repos that are already done is steadily increasing. I also added a fewother sources, which will unlikely to be happening again.
The length of the queue is deceiving, however. While it grows, that is simply because I keep feeding it. Also the ordering is still crucial, with the biggest models done first. And we have crunched through a large number of them (thousands). The long tial is full of 7b models that are static. So a single 70B we do now is worth maybe 40 of these smaller ones at the end.
I would so love to reverse the queue for a while to see it shrink, but since we have limited nodes that can crunch through big models, this might lead to the small ones running out of work, wasting the big nodes on small models. When I had the nice-800 models, nico1 wasn't even finished with its queued models when the rest of the network had eaten up the whole tail of small ones.
It is more than I thought, which, I admit, is partially due to me being able to do it. But there is an enourmous amount of models that still have thousands of downloads per month that have no quantisations. Not sure why that is. And I work on the theory that a static 7b quant costs nothing but space (which has recently become a more important resource, though).
Oh, and lots of these smaler models might simply fail quickly. I am already worried about the amount of manual cleanup that requires :)
So, I am not worried, still, but the queue length looks worrisome. But I think it is a combination of both looks and indeed a big task.
correct token seems to be pretty close overall. ... It seems to best translate into real world quality in my opinion.
The Q4_K_M+ quants should be quite close in real world quality, so that clearly supports your opinion :)= Anyway, it's the one I have chosen by default, and plan to add the others in some way. Not sure if I have mentioned it somewhre, but currently I simply use int +(correct_token - 86.53) * 100 / (100 - 86.53), i.e. i linearly scale correct_token.
I'll now look into adding hopefully all the remainign quants to all the qwen models.
Please do not use IQ1_S, IQ1_M, IQ2_S, IQ2_XXS, IQ2_XS or Q2_K_S quantization without an importance matrix
I remember now. We could do Q2_K_S, but we can't anymore.
About i1-Q2_K_S, from your tables it looks like a useful quant to add. I'll add it to the iquants.
Also, the fact that static IQ3 somehow works fine for qwen is worrisome for another reason: it might not be representive for most other models.
Finally, when I expand the messages, holy shit editing posts gets slow, I'll soon open another discussion :)
Other than Q2_K_S, were there others that are still missing that can be created (other than TQ*)? I looked at a few qwen2.5 repos, and they seem to be there, but I am getting easily confused...
Good point... are you actually planning for that? Wow.
I already started performance measurements of the Qwen 2.5 series of models on my Raspberry Pi 4 a few days ago. I will for sure also run it on my Nintendo Switch (ARM64 with Tegra X1 NVidia GPU) but am facing some outdated CUDA challenges there due to it running Ubuntu 18.04. I could likely just use Vulkan which is surprisingly good based on some first measurements I did on my Windows Laptop with integrated AMD APU. I might also try to run performance measurements on my phone. I already got llama.cpp compiled but NFS on Android is kind of a pain.
I'm even managed to get llama.cpp working on my LicheePi4A 4-core RISC-V SoC but unfortunately it seems to not support the RISC-V vector extension so measuring it is likely not of much use, but I might do so anyways for 0.5B and 1.5B.
Other than that, we have Threadripper and CastlePeak working on it 24/7 for the past few days collecting performance data. I also finally got Samba setup so I can today evening start the performance measurment tasks on all my Windows laptops.
The problem is that marco has high electricity costs and is the desktop of my boss. He's very supportive, but using it AND letting him do useful work is a bit of a challenge. So I can't use it easily for automatic operation. Time will tell how it develops.
Really awesome that your boss lets you use his PC.
I have high electricity cost as well if the weather is poor but after reducing the CPU frequency to 4 GHz things got much more efficient aside from the massive amount of electricity required for the quality/performance measurement project, but they should be completed soon. I should start getting data from SolarEdge but they put in a huge amount of work to make this as annoying as possible.
StormPeak is the machine I use for work as well. There is no way I would otherwise have spent so much money on a workstation. Luckily it doing imatrix/quantization doesn't impact my ability to work on it thanks to Linux being awesome at scheduling. This would not have been possible a few years ago as then maxing out the CPU caused interrupt latency to increase so much that audio started to stutter. I guess the most annoying part is not having any GPU since I started the quality measurement/performance measurement project but that is on me for being too lazy to pause it when I need my PC and instead just use my company notebook to remote connect to it.
I don't think it gets us more than a few percent, though. Maybe it is even worse. Before, only a few quants did not result in 99% cpu usage. Well, it is what it is.
Even a few percentages add up quickly and is worth it over time.
I am at the end of april. So I have been through, strictly speaking, 30%. But the number of repos that are already done is steadily increasing. I also added a fewother sources, which will unlikely to be happening again.
So still quite a long way to go. End of May we started with nico1 so the number of repos that are already done hopefully increases even more.
The length of the queue is deceiving, however. While it grows, that is simply because I keep feeding it. Also the ordering is still crucial, with the biggest models done first. And we have crunched through a large number of them (thousands). The long tial is full of 7b models that are static. So a single 70B we do now is worth maybe 40 of these smaller ones at the end.
True I noticed that all massive models are getting done first.
I would so love to reverse the queue for a while to see it shrink, but since we have limited nodes that can crunch through big models, this might lead to the small ones running out of work, wasting the big nodes on small models. When I had the nice-800 models, nico1 wasn't even finished with its queued models when the rest of the network had eaten up the whole tail of small ones.
It's fine for me to do the huge ones first as they are the one I'm personally most interested in. Once we only have small models left, we should be able to get through the queue very quickly.
It is more than I thought, which, I admit, is partially due to me being able to do it. But there is an enourmous amount of models that still have thousands of downloads per month that have no quantisations. Not sure why that is. And I work on the theory that a static 7b quant costs nothing but space (which has recently become a more important resource, though).
If there is demand, we for sure should offer a quant. Especially if nobody else did so far. Let's just hope HuggingFace doesn't enforce any stupid storage limitations. It seems unlikely they would for us as we hugely benefit their platform and they indicated that this limitation is mostly to prevent idiots abusing HuggingFace as their personal unlimited cloud storage. I also did some cost estimation based on the data they disclosed. They pay around $6 million/year in storage cost but $110 million per year in bandwidth cost so storage cost is almost neglectable compared to bandwidth cost.
Oh, and lots of these smaler models might simply fail quickly. I am already worried about the amount of manual cleanup that requires :)
You could just let them silently fail like Richard did, but I prefer your approach of manually looking into why each of them failed.
So, I am not worried, still, but the queue length looks worrisome. But I think it is a combination of both looks and indeed a big task.
I agree now that I see how many small models are at the end the task of completing all of them seams way less overwhelming.
The Q4_K_M+ quants should be quite close in real world quality, so that clearly supports your opinion :)
Q4_K_M is what I would recommend to most casual user. It is really close to the unquantized model in terms of quality while being much smaller and so having better performance and requires less GPU memory/RAM.I personally mainly use Q5_K_M as I have the RAM for it and to be sure there is near-zero quality loss, but it is mostly a obsession for uncompromised quality than a reasonable thing to use it over Q4_K_M and just regenerate if unsatisfied with the answer.
Anyway, it's the one I have chosen by default, and plan to add the others in some way. Not sure if I have mentioned it somewhre, but currently I simply use int +(correct_token - 86.53) * 100 / (100 - 86.53), i.e. i linearly scale correct_token.
Sounds great!
I'll now look into adding hopefully all the remainign quants to all the qwen models.
That would be highly appreciated. But no hurry if you are too busy with other things.
End of Mai we started with nico1 so the number of repos that are already hopefully increase even more.
Wow, is it that long already :) What mostly increased then is the number of imatrix quants vs. static-only ones though, and it will ramp up slowly from there. But yeah, february was overhwhelming.
After that, I want to go to the other direction (2023), with a different filtering mode (probably only rp finetunes or names i recognize and think they deserve modern quants).
If ever.
True I noticed that all massive models are getting done first.
I would also like to point out that I have seen every single mdel page with my own eyes :) And this is how it looks like nowadays: http://data.plan9.de/hfus.jpg
they indicated that this limitation is mostly to prevent idiots abusing HuggingFace as their personal unlimited cloud storage.
And this is absolutely what is happening. There are lots of repositories that look like hf models, but aren't, with slightly off json files and so on.
However, I think theyx did this in a stupid way, which is mostly bad for them - people have started deleting stuff, in panic, and introducing these "limits" without much explanation was not good. Worse, the explanation I saw was basically "the limit existed before, they are not just being displayed". But when I signed up, it said unlimited uploads.
I just hope they find a way of surviving, fighting abusers without losing real data and contributors.
On the other hand, enshittification is the only way forward nowadays.
They pay around $6 million/year in storage cost but $110 million per year in bandwidth cost so storage cost is almost neglectable compared to bandwidth cost.
That's good to keep in mind.