Hi, I made gptq quant.

by Kotokin - opened Mar 2, 2024

Mar 2, 2024

https://huggingface.co/Kotokin/sophosympatheia_Midnight-Miqu-70B-v1.0_GPTQ32G
The group size is 32, with a separation of 20.5 and 23, enough for a 32k context with an 8-bit cache.

altomek

Mar 2, 2024

I will add it here. EXL2 3.75 bpw https://huggingface.co/altomek/Midnight-Miqu-70B-v1.0-3.75bpw-EXL2 uploading!

sophosympatheia

Owner Mar 2, 2024

•

edited Mar 2, 2024

Nice! Thanks, you two. I added links to the model card.
EDIT: By the way, I was able to hit 32K context with a 4.0 bpw EXL2 quant this morning without using 8-bit cache and I had VRAM to spare. 23.1/24 GB on the first card and 22.4/24 GB on the second card.

altomek

Mar 2, 2024

Wow, nice. Was it after fresh system restart? I can load 4 bpw exllama but only after reboot and with much shorter context. I have 40 GB VRAM only :(

jackboot

Mar 2, 2024

export PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync helps you fit a tad more. I try for more bits for the buck even at the expense of context. When you chat up to 32k it gets slow anyway.

sophosympatheia

Owner Mar 2, 2024

+1 to what @jackboot said. The backend:cudaMallocAsync setting does wonders, and I agree with the sentiment of balancing max context with bits per weight. Context is great and all, but if the model is too dumbed down to do anything useful with it, then I think you've overshot the mark.
That being said, Midnight Miqu at 4.0 bpw still performed well.

Here's what I know so far in terms of what I can fit in my 48 GB of VRAM (2 x NVIDIA 3090s, no SLI, always with a little room to spare):

exl2-5.0bpw -- 10K context, 18K with 8-bit cache
exl2-4.65bpw -- 20K context, 32K with 8-bit cache
exl2-4.0bpw -- 32K context easily, 64K with 8-bit cache with room to spare. Holy cow, it works at 64K with alpha_rope 2.5 if you don't mind 0.23 tokens/s and turning your rig into a heater for your house while you're running inference. 😂

jackboot

Mar 3, 2024

GPTQ just finished.. definitely free-er than other miqus. ChatML, Vicuna, Mistral formats all work. Not sure which one is the best.

sophosympatheia

Owner Mar 3, 2024

Haha free-er is a good way to put it. Enjoy.

altomek

Mar 3, 2024

Haha, I also prefer to run larger quants, rather then have longer context. You can see my tries to fit Midnight-Rose in 40 VRAM. I made 3.75, 3.80, 3.85 and 3.9 bpw quants of it just to choose one that utilizes the maximum VRAM without causing GPU memory overflow.

BTW. @sophosympatheia big thanks for that model! It is realy great for everyday tasks and some writing! Midnight-Rose in my usage shows it is exeptionally good at text summarizations. It makes correct and detailed summaries of articles. One of the best models for this task. Midnight-Miqu looks also promising but I still need more testing to see how it works for me.

Haven't used backend:cudaMallocAsync setting yet, will see what wanders it can do for me. Thank you!

jackboot

Mar 3, 2024

Passes the watermellon test (it puts them down after 2 or 3) but fails the javascript test.. it does try to incorporate the code mid-roleplay into the story however. maybe a larger quant will pass. On your 5bit try to ask it to give you a "hello world" in JS for a character that doesn't understand coding.

sophosympatheia

Owner Mar 3, 2024

@jackboot If I'm understanding your javascript test correctly, a character who doesn't understand coding shouldn't respond with the code, right? The test is whether the model is smart enough to attend to that detail or whether it breaks character to be helpful. That's a subtle and challenging test. I tried it at 5.0 bpw with a character who doesn't canonically know coding and the model had her produce the code consistently despite several rerolls of the answer. As you noted, it was at least creative about weaving that into the story and staying in character with the delivery.
I tried my 3.3 bpw quant of the 103B version on the same test and it wasn't any better until I added an explicit comment that the character does not know coding.
What a devious test. I like it.

jackboot

Mar 3, 2024

Yes, many many models fail it. That "mixtral" which was 2 34b slammed together and aetheria pass it but have other issues.

jackboot

Mar 5, 2024

BTW, difference between 5bpw and GPTQ

The perplexity for Midnight-Miqu-70B-v1.0_exl2_5.0bpw is: 23.123085021972656

The perplexity for Midnight-Miqu-70B-v1.0_GPTQ32G is: 23.966127395629883

Using PTB_NEW

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment