Settings from the test team

#1
by sandmanbuzz - opened

The test team came up with the below settings as the best we could do. I don't think any of us are going to use this model as a daily driver, but here's what we did to salvage the model:

Samplers Preset (this is the only thing that worked at all):
https://files.catbox.moe/adpb5r.json
t=1.50, min.p=0.018, and it's not a terrible idea to add some tfs to take the edge off so I did that in this version of the file

Fixed Instruct Preset:
There was some wishful thinking junk in the existing preset that I removed, and some style guide stuff that shouldn't have been there (and would break lots of cards), and the model does MUCH better using mistral instruct than it does under Alpaca. I fixed that.
https://files.catbox.moe/bjibow.json

It's probably not a bad idea to fix your context template to have [INST] [/INST] around it, too. I'm not going to bother to make a preset for that, anyone can do that themselves.

Thanks, Undi!

Like I said in discord, this is the closest Noromaid to "usable" we've seen so far. Better than the non-instruct.

NeverSleep org

You know this model is only intended for rp right?

That's almost entirely the test use case we put it through.

NeverSleep org

Ok then i really dont understand why it failed/did bad at most tests, this models is by far the best mistral model ive tried for rp, which some other agreed with. And no we are not gonna adjust our template to use [INST] stuff as the model was trained on alpaca.

NeverSleep org

*Mixtral

It was really nice to see the fresh prose and card behavior coming out of it, similar to base Mixtral-instruct. However it has the typical Noro symptoms of maddeningly bad terseness, hence the ridiculously low recommended min.p, which was the only way we could get it to stop that. And generally the inability to crit. Crit rate (positive reader surprise at the response) went up when we moved away from Alpaca-like to [INST] stuff on a lark knowing full well the training was against Alpaca, but there's still only that tiny narrow band of min.p that is usable at all. Outside that band, it falls into terseness or plain/boring responses (with higher min.p), overly-prosaic pseudo-intellectual excessiveness (with lower min.p or higher temp), or loses traction on card instructions. We didn't find a way to get it closer to a razor edge of crit rate (high level of reader surprise) / card adherence / token surprise like one gets with ultra-high miro —platinum and above— on non-MoE models with Lightning instruct and Lightning 1.2 AN.

There's clearly more to come from this line of work, but overcoming the head-desk factor of Noro blurting out tiny, non-continuable responses and low crit rate is going to take some research on the prompting side.

For those folks continuing to use this model, we highly recommend one or more of the presets I provided. The sampler preset alone did help, and [INST] stuff really did get results.

It's also worth mentioning that I personally found it stronger in ERP chat segments—and in those unit tests—over a broader range of presets, but that non-ERP segments of chat fell apart badly outside the tiny min.p band. Other testers seemed to prefer the narrow band. For regular users, it might be worth noting that turning up the temp sampler is something you can do when "turning up the heat" in your chat.

NeverSleep org

I understand your point, maybe im just into differerent rp but for me personally i have not encountered these problems. Almost all sampler present in sillytavern worked for me. If you have any example chats you could kindly provide i could maybe catch on the problems you were having. One thing i should note is that the majority of this model was trained on this format: *do* say.

NeverSleep org

Also i should ask, when you mean "most usable in the noromaid series", do you mean the entire series(7b, 13b, 20b), or just the mixtral noromaids?

Our testing of Noromaids has been pretty sporadic; tester bandwidth and attention isn't unlimited, and it's a relatively manual—and frankly, somewhat ad hoc—process since it's done by editorial reading and evaluation rather than by statistical analysis. The llama Noromaids got fairly cursory examination, without a lot of fussing around with samplers and instruct since they respond to standard miro settings. Since it was much harder to break them of terseness, testing effort tapered off pretty quickly, pretty much "yep, it's a Noro, wait for the next one."

We put a little more effort into non-instruct, but since early results looked similar to past results we threw it back into the pond. This one got focused attention at Undi's request, and by serendipity we had a nice day with multiple testers available to collaborate synchronously, which gives faster, deeper results.

The "do say" dataset definitely comes out, and in my purely personal opinion, I think it's overtrained to it and that's what leads me to bounce off of Noro models. Pacing is often slow and terse, with do being at most a very short paragraph, and "say" being a single line of dialogue. It doesn't give enough room for character expression, action, or reaction, or allow for multiple threads of conversation. Cards respond with brevity that doesn't reflect the user prompt, or even the description or AN instruction. This winds up being disappointing in early chat, especially, and limits richness and immersion. It's especially tough with some types of character RP spawners, since spawners rely on the desc instructions to fire a rich "#2" as their real FM.

Purely personal anecdote again, rather than recorded test result, but cards with stronger FMs tend to do better than weak FMs, as you'd expect a lot of models to do, but because of the more constrained options for high temp or other "digging deep for tokens" sampling you can't prod your way past the "do say" training to get around it. Some of the more old school llama models (like Emer, Psyfighter, or Sydney) can really shine in situations like this where the user needs the model to bootstrap a chat out of the desc.

For use cases where a paragraph of do say is what the user wants, Noromaid default does the deed. Getting a solid, 300-600 token response that doesn't quite start to trail off into white text or parroting yet, that's not what it does.

Sign up or log in to comment