"Only 8bit for now"

#1
by CulturedMan - opened

The 'for now' part of your statement compels me to ask the question. Are you planning on releasing other variants of this in the future? I've been wanting to try this one out for a while, but I can't find 3bpw variants of it.

I did wind up doing some benchmarks and superficially it seems pretty mid; more or less like a worse version of Echidna 0.3 in everything except WizardLM instruct perplexity.

So my answer is for now is "no". Kooten has Echidna and Nethena in 3bpw which should ideally perform better.

Additionally if you're all the way down to 3bpw, a Mistral tune like OpenHermes or Toppy at a higher depth would probably be a lot more engaging to work with.

Thanks for the tips. I'm curious, what are your thoughts on Utopia and NoroChronos? From my personal experience, those have performed better for me than Echidna 0.3, but my measure is hardly scientific. Still they impressed me quite a bit.

I don't always use 3bpw. Sometimes I start at 4 or 5 bpw, and switch to 3 bpw as the chat goes on so as to go beyond 4k context. I'm working with a 3060.

I haven't used either, but UtopiaXL performed very poorly for me in both real world and synthetic benchmarks. Real world "just talk to it for a while tests" (oobabooga 0.15 min_p sampling), my own MythoMax quant and Kooten's Nete quant are my personal favorites. I haven't used Echidna, Nethena, or Stheno much in real-world tests.

Some superficial synthetic benchmarks of common mixes.
image.png
I say superficial because Tiefighter is dead last in everything by a large margin yet people still seem to really enjoy it. I personally found it extremely frustrating to speak to, which leads me to reference these tests when choosing a model to monkey with in real-world tests as they seem to at least vaguely align with my own personal experiences. If you're someone instead who places a large value on wildly creative (or naughty) responses and care less about the model adhering to the established format then this chart is probably useless.

Additionally, if it's a 3060 12GB you should be able to easily do 5bpw 4k context if you either enable exl2's 8bit caching or use Flash Attention 2. I decided to let a 5bit Stheno quant run while I do thanksgiving stuff so that'll be up in idk couple hours ish.

I tried UtopiaXL as well and found it to be worse than Utopia. The addition of Tiefighter and other such models into the mix is likely what dragged it down. I most certainly do prefer adhering to the format over creative/naughty responses. It has been rather frustrating trying to find the most intelligent models. The leaderboard on this site includes models that were trained specifically to excel at scoring high on the leaderboard, but fail in real life application. Meanwhile, other popular lists like Ayumi's LLM Benchmark prioritizes 'naughtiness' above all else.

So, thanks for your list. I think I'll give a few of those models a try.

Also, thanks for letting me know about 8-bit caching. I had not updated Oobabooga in a while, for fear that downloading the latest update would break it again. I just updated it, and now see the 8-bit cache option.

5bit Stheno is up

Sign up or log in to comment