Smaller quant

by zappa2005 - opened Jan 15

zappa2005

Jan 15

Hi @brucethemoose , great stuff! Thanks for the models, I really like your Yi merges a lot. As i only own a 4080, do you think it would be possible to do a smaller quant also so that I can run with a bigger context?

brucethemoose

Owner Jan 15

Yeah, what's a good size for 16GB? 2.63bpw, maybe?

zappa2005

Jan 15

I'm still trying to find the sweet spot between 2.4bpw and 2.63bpw. I am not sure. What do you think?

brucethemoose

Owner Jan 15

Depends on how much context you want. I can upload 2.6 and 2.4, I will post the measurements file as well in case anyone wants to quant it themself.

zappa2005

Jan 15

Thank you!

brucethemoose

Owner Jan 15

•

edited Jan 15

2.6 and 2.4 are linked here: https://huggingface.co/collections/brucethemoose/most-recent-merge-65742644ca03b6c514afa204

Different sizes are not hard to make, though you probably don't want to go smaller than 2.4bpw. See:

Also TheBloke has just posted experimentals quantizations with a new GGUF format that uses profiling data like exllamav2:

https://huggingface.co/TheBloke/Yi-34B-200K-DARE-megamerge-v8-GGUF

This... might be very interesting, actually. llama.cpp supposedly has an 8-bit cache like exllama as well. I might post fiction oriented GGUFs depending on how well all that works.

zappa2005

Jan 15

Thanks for your time, work and explanations. It really helps! I'll also check the GGUF and maybe run the same story to compare.

3.5 to 4.0 bpw seems to be a sweet spot fro Mixtral 8x7, but is it the same for Yi? I mean, can this be generalized?

zappa2005

Jan 15

I tried the 2.6bpw, it loads with 22k (8bit cache) in my 4080. However, I could not get it to work. I produced gibberish in any case, even directly in oogabooga chat. I'll try the 2.4bpw tomorrow.

The new 2bit quants are really interesting, but as these require new llama from Jan 4th, I guess I have to wait until oogabooga updates its repo with the new requirement. I have no idea how to update it myself :-)

brucethemoose

Owner Jan 15

shrug

The impact of quantization also varies on the task, more than that graph would suggest. I've seen llama.cpp's Q6_K be unusable compared to Q8_0 for some business uses.

However I wouldn't be surprised if 3.5bpw-4bpw is a sweetspot for Yi as well.

brucethemoose

Owner Jan 15

•

edited Jan 15

I tried the 2.6bpw, it loads with 22k (8bit cache) in my 4080. However, I could not get it to work. I produced gibberish in any case, even directly in oogabooga chat. I'll try the 2.4bpw tomorrow.

The new 2bit quants are really interesting, but as these require new llama from Jan 4th, I guess I have to wait until oogabooga updates its repo with the new requirement. I have no idea how to update it myself :-)

It should work if you install llama-cpp-python from github (which is what text-generation-ui uses internally).

Alternatively, I would just recommend kobold.cpp. Its faster than ooba, with better prompt caching and support for things like dynatemp as well. I'd be using it if I wasn't stuck on exllama (for the moment).

IDK why its busted, lemme check if it loads.

brucethemoose

Owner Jan 15

I tried the 2.6bpw, it loads with 22k (8bit cache) in my 4080. However, I could not get it to work. I produced gibberish in any case, even directly in oogabooga chat. I'll try the 2.4bpw tomorrow.

The new 2bit quants are really interesting, but as these require new llama from Jan 4th, I guess I have to wait until oogabooga updates its repo with the new requirement. I have no idea how to update it myself :-)

So the 2.6bpw works for me. I dunno how "smart" it is, but it referenced some details from a pretty huge context with coherent english, so I think it's working.

I used the absolute newest verison of exllamav2 to quantize, so maybe ooba is not up to date enough?

brucethemoose

Owner Jan 16

•

edited Jan 16

So on second thought. you were right, something is busted with these quantizations, see:

https://huggingface.co/brucethemoose/Yi-34B-200K-DARE-megamerge-v8-31bpw-exl2-fiction/discussions/4#65a5ff9f0797583a68522df3

I am taking them down or adding a warning for now.

zappa2005

Jan 16

Thanks for your insights! I'll compare the TheBloke quants (probably the IQ2_XS to check ooga for the new method and Q2_K_S).

I also checked the requirements.txt of ooga to see if I could easily use a newer llama-cpp-python myself, but it is quite detailed with precompiled wheels for all kinds of hardware accelerations.

I'll check kobold.cpp, thanks for the hint.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment