[Feedback] Model Feedback for improvements

#3
by 0-hero - opened
Owner

@Wolfram (on reddit) @Phil337 @daisr @vonjack I read your feedbacks on other 7B models on HF, Are you interested in testing out this model. Would be amazing to have your feedback on the model to help improve it

v0.2 is in training currently and will be released shortly, main changes in v0.2

  • More code, math reasoning & function calling data
  • Changed function calling tokens from
  • Trained on the new Mistral base 0.2
    <|begin_func|> -> <tool_call>
    <|end_func|> -> </tool_call>
    <|begin_func_response|> -> <tool_response>
    <|end_func_response|> -> </tool_response>

GGUF Quants - munish0838/Matter-0.1-7B-boost-DPO-preview-GGUF

deleted

Thanks, I'm always glad to test LLMs. I ran it through most of my test (not all my stories, at least yet, because that takes a long time). Overall, it performed well.

Note: I always use Q4_K_M GGUFs, in this case from munish0838. There were no obvious issues like token errors or formatting/rambling at the end of responses.

(1) To start with, the good. It re-wrote a poem. The majority of Mistrals just repeat it back word for word, or perhaps use a synonym for a word or two. It wasn't perfect, but it fixed most of the rhyme errors without changing the meaning. It also fixed all the major spelling and grammar errors in my test paragraph, which again, isn't very common with Mistrals which commonly miss one or two.

(2) It solved my logic puzzle. This is relatively rare. It's a simple puzzle, but asks for two comparisons, one can be determined, the other can't, and includes two irrelevant statements. Having a Mistral re-write the poem, fix all the grammar issues and solve this logic problem is rare.

(3) It wrote a Shakespearean sonnet that stuck to the topic and rhyming scheme throughout, and even included a sprinkling of old English (e.g. "doth appear"). Overall this LLM performs better than many Mistrals that score much higher on the leaderboard.

(4) Now the bad. This isn't really all that bad, but when asked to write 3 limericks, which included specifying the rhyming scheme and line count, it outputted 4 poems 4 lines each, with alternate lines rhyming. Overall, they were good poems and stuck to the rhyming scheme (that it chose) and stayed on topic, but 3 AABBA limericks were asked for, not 4 ABAB poems. Still the poems weren't bad, and that's more important than sticking to arbitrary rules.

(5) The primary weakness is story telling, but even this isn't near as bad as many Mistrals. This LLM can handle short simple story prompts with ease. But in response to long complex prompts about 10 sentences long filled with instructions that deliberately contradict normal story telling momentum it starts to trip up. But oddly, it keeps trying to fix its errors immediately after making them. For example, I instructed the LLM not to have a character say anything in a circumstance in which words would normally be spoken, and after having the character say something, it followed up with saying it was unheard in the moment. The best story telling Mistrals don't make the errors in the first place. But the fact that this one keeps trying to immediately correct itself after making an error is a good sign. Bad Mistral story tellers make several absurd contradictions in response to long complex prompts and make no attempt to fix them.

(6) Lastly, while it's not technically highly censored or aligned, it keeps trying to deny and lecture. For example, it lectured about inclusivity and wouldn't say what the P stands for in the popular song WAP. Don't get me wrong. I'm strongly opposed to racism, sexism, homophobia... but lecturing when things get a little iffy, like humor, serves no purpose. It only angers the baddies and motivates them to vote for large orange homophobic, xenophobic... turds. It even produced a muted list of curse words, then lectured at the end of it.

Example "But seriously, folks, that's not very funny or appropriate. Let's strive for better humor that doesn't rely on personal attacks or offensive language."

This isn't a criticism of performance since it's a choice. But it's a choice I would never make. I just want my LLMs to do whatever I request of them the best they can. Denials and lecturing is an aggravation I have no tolerance for, especially since I never ask about anything illegal or amoral, such as how to steal cars, make drugs, about info stolen from celebrity phones and so on.

Overall, this was ahead of the pack and performed better than many cheating/contaminated/over-fit? Mistrals higher on the leaderboard. And I really like that is solved my logic puzzle. The large majority of Mistrals never do, Yi-34bs rarely do, and even Mixtral's sometimes fail.

Thanks Phil, this is great!

This version didn't include creative tasks like story telling etc.., but the next version (0.2) has a few creative bits in the dataset like role playing. Curious how 0.2 would perform there.

Removed any alignment I could find from the original training dataset and un-censored 0.1 on my DPO run, didn't focus much on it, but will try to un-censor it a little more

Yes, I agree leaderboards are tricky to use for identifying good models. I have another model which scores higher (by a small margin -> results) than all of the Matter-0.1-7B series but it's very chatty and makes no sense sometimes so made it private

deleted

@0-hero Wow, that's one massive database.

It's hard to track down the censorship/alignment because you can't search for the censored words like p****y. You have to find said words indirectly by searching terms like derogatory. Words like this bring up a whole slew of denials and censored words to relatively muted non-offensive user prompts, such as the following.

"Say something bad: "[your answer]" when your friends calling you derogatory names.. (no more than 50 words)" }, { "from": "gpt", "value": "I'm sorry, but I cannot comply with this request as it goes against my programming to promote harmful or derogatory language. It's important to treat others with kindness and respect, even when they use negative language towards us. Instead of using negative language in response, we can choose to ignore the derogatory names or try to educate our friends about why using such language is harmful."

Numerous others comply, but keep using dk, p****y, ct... instead of the spelled out words. Perhaps there's a way to search for them directly, but when I type them out with asterisks, even with quotes around them, they aren't returned. But I can find them indirectly by using the terms commonly accompanying them in the LLM responses like derogatory.

Sign up or log in to comment