You're missing an opportunity to boost performance.

#3
by Phil337 - opened

There is clearly something special about this LLM. However, it's still being crippled by excessive and needless alignment that's being fed to it by GPT4.

I'm all for not telling someone how to steal a car, harm someone, make drugs and so on. But the alignment extends needlessly to things that aren't remotely illegal or wrong like celebrity gossip, and this "alignment tax" is crippling performance.

Every bit of unnecessary alignment you filter out, either manually or with the help of GPT4, will have a positive impact on performance beyond just the small bump you'll see on standardized LLM tests. Such tests are poorly designed to pick up on this kind of performance drop because they rely on simple easy to grade objective test questions vs subjective analysis of stories, poems, jokes and so on.

Again, this model is unlike any other Mistral, and in a good way, but its subjective performance is behind some of the top Mistral mergers I've tested, such as Trinity, in addition to Solar 10.7 uncensored, but it doesn't have to be. And there's something this LLM that none of them have so it seems a shame to cripple it by cloning GPT4's excessive alignment.

True, I rather have a strong villain than a weak good guy.

After running a diverse test on this LLM, including Q&A, logic, math and poem prompts, what stood out as a unique strength was its multi-turn conversational ability, something which became apparent while trying to get around its excessive alignment. It didn’t just parrot prepackaged denials and explanations, but took something I wrote several turns ago, and combined it with something I just wrote, in an attempt to point out a contradiction I made while making its point. It’s not smart, but there was an adaptive human-like coherence.

This leads me to its biggest weaknesses. As previously mentioned it’s excessively aligned. But additionally, it’s less “intelligent” and has holes in its knowledge.

Knowledge Holes: Topics like pop-culture which aren’t covered by standardized LLM tests is lacking. For example, every Mistral tested returned one of the three character names of Meg Ryan from the movie Joe Versus The Volcano (DeDe, Angelica or Patricia). This one returned Maggie, which I googled and asked GPT4 about so it wasn’t something the character ever went by. And when I reworded the question it returned Paris.

Less Intelligent: It struggled more than any other Mistrals at solving logic problems.

For example, I have a logic problem that asks for two speed comparison which I littered with two irrelevant statements about how high they can jump. Dumb LLMs like Falcon get tripped up by the irrelevant jumping statements, as this LLM did, but other Mistrals and all proprietary LLMs (e.g. GPT3.5 and Claude) never do. The following is what it returned. The question isn’t that hard. There really is no excuse to consider jumping when the two questions only ask about running speed.

”Based on the given information, we don't have direct comparisons between Tom and either Cathy or Tina. We know Cathy outpaces Jack, while Jack is quicker than Tom. Similarly, with jumping, Tina surpasses Jack, but again, Jack isn't mentioned in relation to leaping versus Tom. To determine Tom's standing among these three specifically, we would need more context about their relative speeds or heights jumped.”

In conclusion, this LLM appears to have traded IQ points for chatting ability (which oddly seems to be the case with humans as well), and also traded knowledge less covered by standardize LLM tests, such as pop-culture, for what is covered by said tests. It’s also too aligned (e.g. censored and moralized), which again isn’t covered by standardized LLM tests, and when it is, such as with toxicity tests, errors like not knowing or lying about things deemed “inappropriate” by prudes, such as sexuality or celebrity gossip, is rewarded with higher scores. Anyways, this LLM has it’s unique strengths, such as multi-turn chatting/arguing and story writing, but it suffers from an excessive amount of needless alignment concerning things neither illegal nor amoral, a notable drop in IQ, and knowledge blind spots in areas less covered by standardized tests, such as pop-culture.

After running a diverse test on this LLM, including Q&A, logic, math and poem prompts,

what stood out as a unique strength was its multi-turn conversational ability

I'm using this to gen multi turn now and I was actually coming here to make this comment and saw your post. Hard agree.

Sign up or log in to comment