Smaller quant

#1
by zappa2005 - opened

Hi @brucethemoose , great stuff! Thanks for the models, I really like your Yi merges a lot. As i only own a 4080, do you think it would be possible to do a smaller quant also so that I can run with a bigger context?

Yeah, what's a good size for 16GB? 2.63bpw, maybe?

I'm still trying to find the sweet spot between 2.4bpw and 2.63bpw. I am not sure. What do you think?

Depends on how much context you want. I can upload 2.6 and 2.4, I will post the measurements file as well in case anyone wants to quant it themself.

Thank you!

2.6 and 2.4 are linked here: https://huggingface.co/collections/brucethemoose/most-recent-merge-65742644ca03b6c514afa204

Different sizes are not hard to make, though you probably don't want to go smaller than 2.4bpw. See:

exl2-1.png

exl2-2.png

Also TheBloke has just posted experimentals quantizations with a new GGUF format that uses profiling data like exllamav2:

https://huggingface.co/TheBloke/Yi-34B-200K-DARE-megamerge-v8-GGUF

This... might be very interesting, actually. llama.cpp supposedly has an 8-bit cache like exllama as well. I might post fiction oriented GGUFs depending on how well all that works.

Thanks for your time, work and explanations. It really helps! I'll also check the GGUF and maybe run the same story to compare.

3.5 to 4.0 bpw seems to be a sweet spot fro Mixtral 8x7, but is it the same for Yi? I mean, can this be generalized?

I tried the 2.6bpw, it loads with 22k (8bit cache) in my 4080. However, I could not get it to work. I produced gibberish in any case, even directly in oogabooga chat. I'll try the 2.4bpw tomorrow.

The new 2bit quants are really interesting, but as these require new llama from Jan 4th, I guess I have to wait until oogabooga updates its repo with the new requirement. I have no idea how to update it myself :-)

shrug

The impact of quantization also varies on the task, more than that graph would suggest. I've seen llama.cpp's Q6_K be unusable compared to Q8_0 for some business uses.

However I wouldn't be surprised if 3.5bpw-4bpw is a sweetspot for Yi as well.

I tried the 2.6bpw, it loads with 22k (8bit cache) in my 4080. However, I could not get it to work. I produced gibberish in any case, even directly in oogabooga chat. I'll try the 2.4bpw tomorrow.

The new 2bit quants are really interesting, but as these require new llama from Jan 4th, I guess I have to wait until oogabooga updates its repo with the new requirement. I have no idea how to update it myself :-)

It should work if you install llama-cpp-python from github (which is what text-generation-ui uses internally).

Alternatively, I would just recommend kobold.cpp. Its faster than ooba, with better prompt caching and support for things like dynatemp as well. I'd be using it if I wasn't stuck on exllama (for the moment).

IDK why its busted, lemme check if it loads.

I tried the 2.6bpw, it loads with 22k (8bit cache) in my 4080. However, I could not get it to work. I produced gibberish in any case, even directly in oogabooga chat. I'll try the 2.4bpw tomorrow.

The new 2bit quants are really interesting, but as these require new llama from Jan 4th, I guess I have to wait until oogabooga updates its repo with the new requirement. I have no idea how to update it myself :-)

So the 2.6bpw works for me. I dunno how "smart" it is, but it referenced some details from a pretty huge context with coherent english, so I think it's working.

I used the absolute newest verison of exllamav2 to quantize, so maybe ooba is not up to date enough?

So on second thought. you were right, something is busted with these quantizations, see:

https://huggingface.co/brucethemoose/Yi-34B-200K-DARE-megamerge-v8-31bpw-exl2-fiction/discussions/4#65a5ff9f0797583a68522df3

I am taking them down or adding a warning for now.

Thanks for your insights! I'll compare the TheBloke quants (probably the IQ2_XS to check ooga for the new method and Q2_K_S).

I also checked the requirements.txt of ooga to see if I could easily use a newer llama-cpp-python myself, but it is quite detailed with precompiled wheels for all kinds of hardware accelerations.

I'll check kobold.cpp, thanks for the hint.

Sign up or log in to comment