smaller quant

by zappa2005 - opened Jan 18, 2024

zappa2005

Jan 18, 2024

Would it be possible to do a smaller quant for 16GB 4080 owners? Maybe 2.4bpw? I'd like to test it, too.
Btw, did you base on instruct-v0.1 or instruct-v0.2?

Thank you!

zappa2005 changed discussion title from Thanks! to smaller quant Jan 18, 2024

zaq-hack

Owner Jan 18, 2024

•

edited Jan 18, 2024

Why not ... I'll do a 2.40 for ya. :-)

The original model is here, and I'm not sure about the specifics of the recipe. I just know it (1) isn' t dumber than a bag of hair, and (2) when it works, it puts out some hot stuff. https://huggingface.co/NeverSleep/Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss

On the downside, it can be prone to getting stuck in a loop once you are past the context - even at 32k. I'm not sure how to avoid it, but it's very annoying. Especially since one of my chats went over 800 messages, and I was really loving it. Then it fell into a loop, yesterday, and I couldn't get it unstuck. I'm sure it is my fault, but I'm not sure what to do to move it on other than starting a fresh chat. Even so, 800 messages feels like a huge win compared to most others.

zaq-hack

Owner Jan 18, 2024

https://huggingface.co/zaq-hack/Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss-bpw240-h6-exl2

I think it might still be a touch too large for a 16GB card while using longer context. The problem is, we are getting further down the accuracy curve, so I don't know how far you want to go. I'm not a big fan of 2.0 ... but I'll try a 2.25 to see if that might still have enough of the original flavor in it to get you by.

zaq-hack

Owner Jan 19, 2024

https://huggingface.co/zaq-hack/Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss-bpw225-h6-exl2

Okay: I haven't tried this one, but I hope it works great for ya!

zappa2005

Jan 19, 2024

•

edited Jan 19, 2024

I will check and report back! Thanks for the quants and your time, appreciated.
Btw, did you base on instruct-v0.1 or instruct-v0.2?

zaq-hack

Owner Jan 19, 2024

Again, I'm not sure what's under the hood. You'd have to ask Undi and Ikaridev.

zappa2005

Jan 19, 2024

The 2.25bpw loads with 16k context on a 4080, it even still surpasses the one of my standard reasoning tests (only a models get that right).

You: If I have 7 apples today, and ate 3 last week, how many do I have now?
AI: You currently have 7 apples. Your consumption of 3 apples from last week does not affect your current apple count.

Looks promising, I'll check the RP stuff later. Thanks again for your time!

zappa2005

Jan 19, 2024

I had a bit of time on my hands and can say that the 2.25bpw works very well on my end, no obvious shortcomings, and it ranks pretty high on my personal favorite list!

On a side note, do you know of the exl2 quants already support this new 2-bit SOTA stuff that was merged recently in llama-cpp?

https://www.reddit.com/r/LocalLLaMA/comments/19anqbc/llamacpp_now_supports_quip_2bit_quant_mixtral_in/

zaq-hack

Owner Jan 20, 2024

I actually don't know. I started playing with Aphrodite-engine on Thursday, and it doesn't even support EXL2. I've had to use GPTQ, and this model doesn't work because I can't split it across cards. That said, the inference is INSANELY fast. 8k context response in 4 seconds. 32k context in like 9-12 seconds. It definitively changes the experience, but I've had to drop model size down to MistralTrix. https://huggingface.co/zaq-hack/MistralTrix-v1-GPTQ

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment