3.75bpw quant

#1
by Serpen - opened

Hey! I love what you are doing with these models. I have tested your 4bpw rpcal quant of noromaid and so far it's almost a night and day difference between yours and lonestriker's. How can the calibration dataset make this much of a difference?

I would also like to ask if it would be possible for you to do a 3.75bpw quant with this method as that is the largest that can comfortably run on 24gb of vram with decent context. I would greatly appreciate it if you could!

I would love to know how you quantify the difference. I agree: It seems a lot different when you use them, but when I have them open side-by-side to test, they seem the same. I'm clearly doing something wrong when trying to "benchmark" these. Please let me know what you think is the "night and day difference."

I'm not sure I'll do a 3.75 because I don't believe there would be a truly appreciable difference between 3.75 and a 3.50 in RP output. Also, this one is still uploading at the time of this comment! :-)

Meh. Why not? It's in progress while this one uploads. What I'd really, really love to see, though, is something like Ayumi's benchmarks run on a variety of bpw and data set calibrations for the same model.

3.75 version will be here when upload completes

Thank you for doing it!

I use text gen webui as the backend with 15000 context and 3 experts per token. I use silly tavern as the front end with these sampler settings: https://files.catbox.moe/f06wx9.json . Contrary to what the model card recommends I consistently get better results with the default alpaca-roleplay template of silly tavern rather than chatml. These are my instruct settings: https://files.catbox.moe/eq5y27.json . With noromaid specifically I found that a shorter system prompt produces better results.

For me the main difference usually pops up after about 5-6k context when the default wikitext version seems to get stuck into a specific message formatting. Mainly it will write a ~13 word sentence that the character says with " ". Then a ~ 12 word sentence that happens with * *. Then it will start a new paragraph repeating the same formatting. Like kind of writes 2 sentences for each paragraph that rigidly connect to each other, then the next paragraph will be something unreleated. And from there on no matter what I do it will get stuck into this very rigid format.

This almost never happens with the rpcal versions. The whole formatting and prose(?) just feels way more natural.
As for the difference between 3.5 and 3.75, in my experience quantization hits sparse models, like Mixtral way more than it would most dense models. Mixtral takes a huge hit in it's ability to conceptualize and understand the scene by even just a 0.25 difference. In my tests 3.75 just has a better grasp on what's happening.

For example if you enter a lavishly decorated room 3.75 will usually know kind of what that means. If a character in rags enters the scene it will remark that their attire contrasts with the expensive curtains. Or if you enter a room with a special vault door that can be only opened by spells 3.75 will understand that, the door probably has no keyholes, while 3.5 will just try to open it with keys anyway.

As for how I test, I first run a model through a simple test that I found on chub.ai called RP-test. I'll link it since it's almost impossible to find by searching. https://chub.ai/characters/fiveroomdungeons/rp-test-7f49debb It's not the most comprehensive test or anything but I can get a good idea for how well the model follows the character card, how well it understands things and I can quickly filter out models that are just pure incoherence. After that I run ayumi's Sarah test. Then I run what I call an insanity test, where after a few messages into a scenario I'll ask the ai to do something insane like start yelling "I'll cut you down" while running after a character with a broadsword just after pleasantly chatting with them the moment before or for them to call the pope or something. Most lesser model's just kind of freeze up and don't really know what to do, while better models can run with it and weave it into a fun narrative.

Then I simply run through a few chats with some characters and more or less try to feel how good the model is. This is the most subjective part, but I feel like after about 10k tokens you can get a good grasp of how well the model does. I mainly tend to lookout for repetition, how well the model interacts with the scene, incoherence and prose. For example at first I really liked smaug 34b because it had a very nice and crisp writing style but it would ocassionally just become unhinged, the character going for a swim in the sea while the moment before they were in a mountain. Or when I tried brucethemoose's Rp-merge it would have really nice prose but would be unable to interact with object properly. It wouldn't be able to figure out that a goblet filled with wine should spill when tipped over. I mainly try to look for issues like these. Then finally I try to gauge how well the model uses the info on the char card, how well it adapts a character's personality, etc.

Well I rambled a lot but to summarize, for me the biggest difference between rpcal and wikitext is the prose and formatting, which for wikitext is more rigid, while for rpcal it's more free flowing and pleasant. As for 3.5 vs 3.75 it's mainly the model's ability to keep track of events, locations and characters, where 3.75 just seems much smarter.

If you want to try another mixtral model I would recommend Crunchy-Onion, it is similar to noromaid but has a very distinct writing style or 'voice' that I find very enjoyable.

A lot of what you describe are things I can see between q3 and q4, but EXL2 is a lot more precise about the details it squeezes down. The perplexity margin between 3.50 bpw EXL2 and 3.75 bpw EXL2 of the same model with the same calibration will be absolutely miniscule. Nothing nearly as obvious as what you've described (and certainly not in my experience). A simplistic way to view it would be to say that 50% of the weights of a 3.50 model are at 3.0 bpw while the other 50% are at 4.0 bpw (I know, I know - that's not how it turn out). However, 3.75 would just be 75% at 4.0 bpw with 25% at 3.0 bpw. Again, it's tough for me to think there is a stark difference in that small margin ... but, you can show me, because it is uploading now. Should be up there in another 3.75 hours. :-)

Note: I always run the Sarah test. My wife's name is Sara and our daughter has DID. To me, that whole test is like a glitch in the matrix!

Sign up or log in to comment