Your benchs on 2 more models?

#1
by Nexesenex - opened

I saw a table of benchs recently.

llm-results.png

Is it yours? If yes, could you bench :

From my own LlamaCPP benchs, they are good. And it reflects in ST chat. I'm curious about what they'd give on yours.

Nexesenex changed discussion title from Your bench to Your benchs on 2 more models?

Where did you see it? Yeah, I made it, you can find the latest version here. I'll bench those two models.

I saw it flying around on Reddit I guess.
Thanks for the url!
And for the benchmarking!

Reddit? Not the secret discord? Not the secret whatsapp group? Not that other, more based forum that discusses LLMs? I'm surprised.

I'm still a profane playing above my paygrade, Chuck! I don't know these exclusive places you mention! :D

Shady cryptobro vs fellow meme merger... Here are the results:
image.png
Neither of them significantly improve on this benchmark compared to original Miqu, if you look at totals. Senku got a bit drier and formal which caused its S-score to go down significantly; Undi's Miqu somewhat improved on P-test. Senku was less censored during the tests and needed less jailbreaking than Undi's model.

Thank you very much, man. I noticed that Senku was a bit drier as well than Miqu & Miqu DPO both. I'll keep these 3 models around for sure.

If you're in for more testing, I'd suggest you Kyllene 34B : https://huggingface.co/Nexesenex/TeeZee_Kyllene-Yi-34B-v1.1-iMat.GGUF

This, because the model uses MergeMonster, which trims the GPTisms and Llamaisms (or any unwanted sequence usually sabotaging a chat or a reasoning) during the merge process itself. It benches higher than any 34b I tested, without apparent overfit, and was surprisingly good for a 34b model in ST.

And also this one, as a second choice : https://huggingface.co/Nexesenex/cloudyu_Mixtral_34Bx2_MoE_60B-iMat.GGUF/blob/main/README.md

This, because it's highly praised on Reddit by Wolfram who tests a lot of a models extensively, and it benched higher than any other model I ever benched without apparent overfit so visible at 4k context like Smaug & consorts. TheBloke's GGUF quants are good, I tried to do some with iMatrix but I somehow failed to quantize properly.

Also, because I benched a lot of these tweaked 2x7b MOE, I found one which seems viable. It holds until 8k context and benches very high for its size, if you have spare time for the curiosity :

https://huggingface.co/Nexesenex/TomGrc_FusionNet_7Bx2_MoE_v0.1-iMat.GGUF

image.png
Tested them and the results were pleasantly surprising. Except for 2x34B, it was way too censored, complained way too much. What did they train it on? LLama-chat?

Thanks!

Kyllene is a 34b merge made with Gryphe MergeMonster which allows to decrease the prevalence of certain text sequence during the merging process. For me, this way of decensoring should become the standard, because it damn works way beyond any margin of error or lucky run. And I had quite the basline feel already, after testing most of other Yi-34b merge. This technique is a must imho.

I don't know much about Mixtral 34bx2. I don't like it much personally, it's too slow on my rig and too "Yi" after having tested Miqu (that model spoiled everything else for me). But it benched well on LlamaCPP, and Wolfram enjoyed it, so..

As for TomGrc's models, they are an enigma to me. It's close to the world of the overfit & contaminated models, but the experiences seems actually genuine enough to not give a retarded model which goes dumb after a few thousands of tokens. I gave you the best of his 7x2 moe, bench-wise. Benched very high for its size and without a massive overfit, seemed fine like a good Mistral 7b on a SillyTavern run, so I sent it your way.

As for the specific of the models, I don't know more than their model cards!

Sign up or log in to comment