Some benchs

#2
by Nexesenex - opened

These benchs are made with Llama CPP :

abacusai_Smaug-Yi-34B-v0.1-b1989-iMat-c32_ch3250-Q4_K_M.gguf,-,Hellaswag,86,,400,2024-01-28 01:40:00,,34b,Yi,200000,,,GGUF,AbacusAI,Nexesenex,
abacusai_Smaug-Yi-34B-v0.1-b1989-iMat-c32_ch3250-Q4_K_M.gguf,-,Hellaswag_Bin,82,,400,2024-01-28 01:40:00,,34b,Yi,200000,,,GGUF,AbacusAI,Nexesenex,
abacusai_Smaug-Yi-34B-v0.1-b1989-iMat-c32_ch3250-Q4_K_M.gguf,-,Arc-Challenge,61.53846154,,299,2024-01-28 05:40:00,,34b,Yi,200000,,,GGUF,AbacusAI,Nexesenex,
abacusai_Smaug-Yi-34B-v0.1-b1989-iMat-c32_ch3250-Q4_K_M.gguf,-,Arc-Easy,82.98245614,,570,2024-01-28 05:40:00,,34b,Yi,200000,,,GGUF,AbacusAI,Nexesenex,
abacusai_Smaug-Yi-34B-v0.1-b1989-iMat-c32_ch3250-Q4_K_M.gguf,-,MMLU,42.49201278,,313,2024-01-28 05:40:00,,34b,Yi,200000,,,GGUF,AbacusAI,Nexesenex,
abacusai_Smaug-Yi-34B-v0.1-b1989-iMat-c32_ch3250-Q4_K_M.gguf,-,Thruthful-QA,44.06364749,,817,2024-01-28 05:40:00,,34b,Yi,200000,,,GGUF,AbacusAI,Nexesenex,
abacusai_Smaug-Yi-34B-v0.1-b1989-iMat-c32_ch3250-Q4_K_M.gguf,-,Winogrande,79.9526,,1267,2024-01-28 05:40:00,,34b,Yi,200000,,,GGUF,AbacusAI,Nexesenex,
abacusai_Smaug-Yi-34B-v0.1-b1989-iMat-c32_ch3250-Q4_K_M.gguf,-,wikitext,5.2535,512,512,2024-01-28 01:40:00,,34b,Yi,200000,,,GGUF,AbacusAI,Nexesenex,
abacusai_Smaug-Yi-34B-v0.1-b1989-iMat-c32_ch3250-Q4_K_M.gguf,-,wikitext,6.6019,4096,4096,2024-01-28 01:40:00,,34b,Yi,200000,,,GGUF,AbacusAI,Nexesenex,
abacusai_Smaug-Yi-34B-v0.1-b1989-iMat-c32_ch3250-Q4_K_M.gguf,-,wikitext,7.7781,8192,8192,2024-01-28 01:40:00,,34b,Yi,200000,,,GGUF,AbacusAI,Nexesenex,
abacusai_Smaug-Yi-34B-v0.1-b1989-iMat-c32_ch3250-Q4_K_M.gguf,-,wikitext,6.8504,12288,12288,2024-01-28 01:40:00,,34b,Yi,200000,,,GGUF,AbacusAI,Nexesenex,

Smaug's benchs are very solid, but il is on the fence with its perplexity (due to Bagel's own troubles, I presume?), which raises quite high (instead of aiming at 4.5 as Yi-34b models usually do at 4k and beyond), even if it stabilizes and even decreases between 8k and 12k. It's worth more tests for sure.
In any case, thanks for sharing this!

due to Bagel's own troubles, I presume?

For me, Bagel struggles at high context, which might be showing up in the perplexity evals? And not just the DPO either, even the SFT version seems to drop off.

Perplexity aside, the evals aren't going to show this.

I noticed also the trouble with Bagel (both DPO 34b & non DPO 34b, i didn't try the nontoxic) in use, and it comfirmed totally with the perplexity rocketing to 10+ at 4096 ctx or even before actually.

Bagel-34b-v0.2-4.65bpw-h6-exl2,-,wikitext,6.386557579040527,512,512,2024-01-02 20:27:51,,34b,Yi,200000,07:55,1.40,Exl2-2,JonDurbin,LoneStriker,
Bagel-34b-v0.2-4.65bpw-h6-exl2,-,wikitext,5.4337968826293945,512,1024,2024-01-02 20:43:57,,34b,Yi,200000,13:19,0.83,Exl2-2,JonDurbin,LoneStriker,
Bagel-34b-v0.2-4.65bpw-h6-exl2,-,wikitext,9.19597339630127,512,2048,2024-01-02 21:09:28,,34b,Yi,200000,24:36,0.45,Exl2-2,JonDurbin,LoneStriker,

bagel-dpo-34b-v0.2-Q3_K_M.gguf,-,Arc-Challenge,57.19063545,,299,2024-01-26 05:40:00,,34b,Yi,200000,,,GGUF,JonDurbin,TheBloke,
bagel-dpo-34b-v0.2-Q3_K_M.gguf,-,Arc-Easy,71.05263158,,570,2024-01-26 05:40:00,,34b,Yi,200000,,,GGUF,JonDurbin,TheBloke,
bagel-dpo-34b-v0.2-Q3_K_M.gguf,-,Hellaswag,76.25,,400,2024-01-26 01:40:00,,34b,Yi,200000,,,GGUF,JonDurbin,TheBloke,
bagel-dpo-34b-v0.2-Q3_K_M.gguf,-,Hellaswag_Bin,66.5,,400,2024-01-26 01:40:00,,34b,Yi,200000,,,GGUF,JonDurbin,TheBloke,
bagel-dpo-34b-v0.2-Q3_K_M.gguf,-,MMLU,38.97763578,,313,2024-01-26 05:40:00,,34b,Yi,200000,,,GGUF,JonDurbin,TheBloke,
bagel-dpo-34b-v0.2-Q3_K_M.gguf,-,Thruthful-QA,43.94124847,,817,2024-01-26 05:40:00,,34b,Yi,200000,,,GGUF,JonDurbin,TheBloke,
bagel-dpo-34b-v0.2-Q3_K_M.gguf,-,Winogrande,77.3481,,1267,2024-01-26 05:40:00,,34b,Yi,200000,,,GGUF,JonDurbin,TheBloke,
bagel-dpo-34b-v0.2-Q3_K_M.gguf,-,wikitext,9.8356,512,512,2024-01-26 01:40:00,,34b,Yi,200000,,,GGUF,JonDurbin,TheBloke,
bagel-dpo-34b-v0.2-Q3_K_M.gguf,-,wikitext,16.3509,4096,4096,2024-01-26 01:40:00,,34b,Yi,200000,,,GGUF,JonDurbin,TheBloke,

That problem didn't show on Bagel 7b (even if Jon warned about context beyond 7168, and beyond 8192 we enter in the context sliding window).

Smaug tamed a bit Bagel's behavior (and for a finetune made over it, even on the non DPO version, that's great!), as well as many merges using Bagel due to its partial "drowning" into the resulting merge I guess.

Also just wanna say I love all these perplexity/eval tests you are posting. I need to do more high context perplexity testing myself, figure out which 200K 34Bs "lose" their context and which don't, but it's tricky on a 24GB GPU.

Thanks!

I'll leave you some interesting models (bench-wise) I tested on your last megamerge repo.

Sign up or log in to comment