2.40bpw and 2.55bpw 6h EXL2 request.

#2
by Undi95 - opened

Hi, it's been 3 days I try to have an usable EXL2 quant of my 70B model, but I don't succeed, tried the following :

  • 2048 length with wikiset dataset
  • 4096 length with wikiset dataset
  • 2048 length with WizardLM dataset
  • 4096 length with WizardLM dataset

And those bpw :

  • 2.40bpw / h6
  • 2.55bpw / h6

I do this on an A100 on collab, and THIS SPECIFIC POST tell me the issue could be the A100 : https://huggingface.co/AzureBlack/UtopiaXL-13B-exl2/discussions/1#6547a6f86c818bb7b5ca30f7
I'm clueless, an helping hand to guide me to the right path or someone that can do it for me (4096 length if possible for better output) would really be helpful.
Pinning this as I'm going crazy and can't use another GPU (shitty internet, only sub to colab), will link you on the model page and credit you, thank you all!

Undi95 pinned discussion

Update: Got another machine to try this

From the tests that folks have done on TheBloke's discord, the length of the measurement dataset doesn't make much of a difference. What issues are you running into with the exl2 quants? The low bpw quants are more likely to go off the rails and generate gibberish. They're also more sensitive to the prompt format, so it helps if you can follow the prompt format more closely.

From the tests that folks have done on TheBloke's discord, the length of the measurement dataset doesn't make much of a difference. What issues are you running into with the exl2 quants? The low bpw quants are more likely to go off the rails and generate gibberish. They're also more sensitive to the prompt format, so it helps if you can follow the prompt format more closely.

I get some letter missing, some bad punctuation and bad reply in general...
I tested Q2_K and Q3_K_S quantization on my 3090 and they seems really better, it's day and night, I dont recognize my model on the 2.4/2.55 bpw.
I'm trying one last time on a L40 to see if the A100 is at fault because I never saw so much damage done to a 70B before with EXL2 quant, and yes, I use the good prompting (Alpaca, work on the Q2_K and Q3_K_S too).
Will finish soon, but will take a little time to download.

Don't we expect those to do better though, given that they weigh an extra 30% per weight? The Q3_K_S 3.47BPW, and the Q2_K is barely any smaller at least in terms of filesize.

It strikes me that there's probably a reason there was never a Q3_0 never mind a Q2_0/Q2_1.

I done it again on L40 but no change, setting need to be changed for better result or use higher bpw, it's the only way.
Or use GGUF...
I close the thread, thanks for all your help, even if you didn't posted here but was helping on Discord or other platform!

Undi95 changed discussion status to closed
Undi95 unpinned discussion

I also find that this setting fuck thing up, it's somewhat usable when put to 0

image.png

You can also try disabling this ooba option. I think it's helped me make some 2.4bpw models stop spitting gibberish:
image.png

Sign up or log in to comment