Merci Romain !

#1
by Nexesenex - opened

I'm downloading and can't wait to test them, starting with their perplexity.

If you're in mood for another 70b model to quantize in GGUF 2 bits anytime soon, I'd suggest LZLV ( https://huggingface.co/lizpreciatior/lzlv_70b_fp16_hf ).
It is a classic which performs very well in both logical and creative tasks!

Edit : here are the perplexities at 512ctx :
Aurora-Nights-70B-v1.0-IQ2_XXS-2.12bpw.gguf,-,wikitext,4.9372,
Aurora-Nights-70B-v1.0-IQ2_XS-2.36bpw.gguf,-,wikitext,4.5800 -> best for 24GB VRAM users for long context with rope, I guess.
Aurora-Nights-70B-v1.0-Q2_K_S-2.70bpw.gguf,-,wikitext,4.5313
Aurora-Nights-70B-v1.0-Q2_K-2.95bpw.gguf,-,wikitext,4.2042

Thanks for the feedback!

I'll have a look at making the new 2bit quants of lzlv-70b. Should be fairly quick now that the longest step (calculating the imatrix) can be offloaded to the GPU.

You're welcome. The model is quite sensical in IQ2_XS, and that's much better than the experience with the exllama v2 quants, even post 0.0.11. Some regens are enough to steer the model right when it deviates from the context, this way beyond 1,000 or even 2,000 tokens.

Also, great news for the imatrix on GPU, I will be able to toy with it soon enough then on smaller models. Thanks for LZLV, I'm a bit short on hardware to quantize efficiently 70b models!

Here comes a relatively extensive perplexity test of your 4 quants, because they are the first 70b quants I test beyond Ikawakow's and I wanted to get a good look at the quality of the new quantizations.

2024-01-15 22_27_43-evaluations.csv β€” LibreOffice Calc.png

Super, merci beaucoup!

Once again, you nailed it with your iMatrix, because it's quite tricky.
I played with the LZLV-70B-v1.0-IQ2_XS quant. It's honestly rich and coherent for a single GPU use, I could push a few stories to 7.4k tokens (with rope 1 22277) without problems at mono-GPU speed.
If you're in the mood to make more quants like this, I'd suggest you :

Benchs :
LZLV-70B-v1.0-IQ2_XS-2.36bpw.gguf,-,hellaswag,81.75
LZLV-70B-v1.0-IQ2_XS-2.36bpw.gguf,-,wikitext,4.4105,512

LZLV-70B-v1.0-IQ2_XXS-2.12bpw.gguf,-,hellaswag,83.25
LZLV-70B-v1.0-IQ2_XXS-2.12bpw.gguf,-,wikitext,4.7768,512

LZLV-70B-v1.0-Q2_K-2.95bpw.gguf,-,hellaswag,82.75
LZLV-70B-v1.0-Q2_K-2.95bpw.gguf,-,wikitext,4.1369,512

LZLV-70B-v1.0-Q2_K_S-2.70bpw.gguf,-,hellaswag,81.5
LZLV-70B-v1.0-Q2_K_S-2.70bpw.gguf,-,wikitext,4.3750,512

LZLV-70B-v1.0-Q3_K_S-3.47bpw.gguf,-,hellaswag,82.75
LZLV-70B-v1.0-Q3_K_S-3.47bpw.gguf,-,wikitext,3.7827,512

Thanks, I'll get started on these two, should be up relatively soon!

You beat me on Midnight Rose, I started it 1h ago but I'm still at the Q8_0.
Thanks, Man, I'll use yours instead lol!

Here's some numbers :

Midnight-Rose-70B-v1.0-IQ2_XS_Art_Wiki.gguf,-,wikitext,5.4897,512,512,2024-01-09 01:40:00,,70b,Llama_2,4096,15:35,1/5.18,GGUF,Sophosympatheia,Artefact2,
Midnight-Rose-70B-v1.0-IQ2_XS_Art_Wiki.gguf,-,wikitext,4.7105,2048,2048,2024-01-09 01:40:00,,70b,Llama_2,4096,15:35,1/5.18,GGUF,Sophosympatheia,Artefact2,
Midnight-Rose-70B-v1.0-IQ2_XS_Art_Wiki.gguf,-,wikitext,4.5765,4096,4096,2024-01-09 01:40:00,,70b,Llama_2,4096,15:35,1/5.18,GGUF,Sophosympatheia,Artefact2,
Midnight-Rose-70B-v1.0-IQ2_XS_Art_Wiki.gguf,-,hellaswag,85.75,400,2024-01-09 01:40:00,,70b,Llama_2,4096,15:35,1/5.18,GGUF,Sophosympatheia,Artefact2,
Midnight-Rose-70B-v1.0-IQ2_XS_Art_Wiki.gguf,-,Winogrande,74.0331,,1267,2024-01-19 05:40:00,,01.3b,Llama_2,4096,,,GGUF,Sophosympatheia,Artefact2,

Edit : I know I'm asking a lot. But here comes something else which came to interest :
mishima/WinterGoddess-1.4x-limarpv3-70B-L2-32k.GGUF

A guy left this model without the fp16 weights. I tested it, it works, including at long context because it has a rope 8, which scales down nicely to 4 and even 2, for a better perplexity and hellaswag. I chatted a bit with it and it's allright.
Could you iMatrix it, and publish the 2 bits and Q3_K_S/Q3_K_M quants?
It'll be a base perplexity loss of 0.05 and a Hellaswag loss of 1-2 compared to the fp16, but as Aurelian is still in fine tuning stage, it's for now the best 70b 32k (or even 16 or 8) that we have).

Here are my tests of this lost gem :
Rope 8 10000
WinterGoddess-1.4x-limarpv3-70B-L2-32k.Q4_K_S.gguf,-,wikitext,4.2177,4096
WinterGoddess-1.4x-limarpv3-70B-L2-32k.Q4_K_S.gguf,-,wikitext,4.1324,6144
WinterGoddess-1.4x-limarpv3-70B-L2-32k.Q4_K_S.gguf,-,wikitext,4.3923,2048
WinterGoddess-1.4x-limarpv3-70B-L2-32k.Q4_K_S.gguf,-,wikitext,4.4945,1536
WinterGoddess-1.4x-limarpv3-70B-L2-32k.Q4_K_S.gguf,-,wikitext,4.6700,1024
WinterGoddess-1.4x-limarpv3-70B-L2-32k.Q4_K_S.gguf,-,wikitext,5.2577,512
WinterGoddess-1.4x-limarpv3-70B-L2-32k.Q4_K_S.gguf,-,hellaswag,84.5,,400
Rope 4 10000
WinterGoddess-1.4x-limarpv3-70B-L2-32k.Q4_K_S.gguf,-,wikitext,3.5762,2048
WinterGoddess-1.4x-limarpv3-70B-L2-32k.Q4_K_S.gguf,-,wikitext,4.1235,512
WinterGoddess-1.4x-limarpv3-70B-L2-32k.Q4_K_S.gguf,-,hellaswag,87.25,,400
Rope 2 10000
WinterGoddess-1.4x-limarpv3-70B-L2-32k.Q4_K_S.gguf,-,wikitext,3.3394 *327,2048
WinterGoddess-1.4x-limarpv3-70B-L2-32k.Q4_K_S.gguf,-,wikitext,3.8254,512
WinterGoddess-1.4x-limarpv3-70B-L2-32k.Q4_K_S.gguf,-,hellaswag,88,,400
Rope 1 10000
WinterGoddess-1.4x-limarpv3-70B-L2-32k.Q4_K_S.gguf,-,hellaswag,85,,400

https://huggingface.co/Artefact2/WinterGoddess-1.4x-70B-L2-GGUF

As for WinterGoddess-1.4x-limarpv3-70B-L2-32k, there's not much that can be done without f16 weights available.

Actually, it can be done.

  1. Requantize the Q4_K_S in q8_0, the best base for requant even from a smaller quant source (I tested that a while ago).
  2. Make the iMatrix of the obtained q8_0 (I'm not sure if rope needs to be set or not, I'd say yes though, and the iMatrix would logically be calibrated on the chosen rope).
  3. Make the quants from q8_0 with the q8_0 iMatrix.

Thanks for WinterGoddess!

I don't think it's worth requantizing from Q4KS, so I won't do it. But feel free to try it yourself!

Okay, that one is for me then, and I hope I'll make it right!
I'll still come to beg for quants of more appetizing quants, though!

Actually, it can be done.

  1. Requantize the Q4_K_S in q8_0, the best base for requant even from a smaller quant source (I tested that a while ago).
  2. Make the iMatrix of the obtained q8_0 (I'm not sure if rope needs to be set or not, I'd say yes though, and the iMatrix would logically be calibrated on the chosen rope).
  3. Make the quants from q8_0 with the q8_0 iMatrix.

Thanks for WinterGoddess!

Nexesenex changed discussion status to closed
Nexesenex changed discussion status to open

So here's what I did :

https://huggingface.co/Nexesenex/WinterGoddess-1.4x-limarpv3-70B-L2-32k-Requant.GGUF

The IQ2 quants are very slow to crunch on my hardware, I'll do them a bit later, but the result in Q2_K and Q3_K_S are extremely satisfactory!

Here's another one in fp16 which might be worth a 2/3 bits quantization for the Llama 2 70b vanilla experience at 32k context :
https://huggingface.co/NousResearch/Yarn-Llama-2-70b-32k

I just don't have the CPU power to make the IQ2 quants in a reasonable time ! :)

Thanks man.
I tested it, and it works like a charm.
Also :
Rope 8 :
Yarn-Llama-2-70b-32k-Q3_K_S,-,wikitext,3.6948,512
Rope 2 :
Yarn-Llama-2-70b-32k-Q3_K_S,-,wikitext,3.6868,512
Your quants are really neat!
And basically, it seems useless to lower the rope with Yarn. I love it!

Thanks! Always open to more model suggestions (70B or under).

Sign up or log in to comment