Yeap

#1
by Phil337 - opened

I don't blame the leaderboard or HF because there's really nothing within reason that can be done about it.

But the flood of Mistrals scoring over 67 on the leaderboard (around the upper limit of Mistral), and up to 77, is absurd (that's fare higher than Mixtrals, which are far more powerful). And to read their model cards bragging about their scores is annoying. And most aren't deliberately cheating. They're just so excessively merged and fine-tuned, sometimes with a database with over a million entries, that all of Mistral's fringe data has been scrambled.

For example, when I ask questions about 4 names from popular movies and TV shows that Mistral base and reasonable fine tunes get right, or mostly right, the high scoring Mistrals reliably get them wrong. This even included OpenChat and Starling. Fine-tuning on tons of user feedback might help you climb on chat arenas, but it leaves Mistral an empty shell that can no longer solve simple logic problems or answer questions along Mistral's fringes, such as character names in shows and movies, identifying songs from lyrics and so on.

Fine tuning is meant to guide a foundational model in the right direction, not take over. And the base Chinese models are all cheating (e.g. Yi-34b doesn't have anywhere near an MMLU of 77, based on my fringe knowledge questions its true MMLU score is around 68-70).

yep ... I AGREE

Fine-tuning on tons of user feedback might help you climb on chat arenas, but it leaves Mistral an empty shell that can no longer solve simple logic problems or answer questions along Mistral's fringes, such as character names in shows and movies, identifying songs from lyrics and so on.

I always value you observations! I share your observation. Thay are made so smart, so you don't know they are cheeting on you most of the time. But there is more demage. When used to summarize longer contexts they often have the same error. They interpret like 20% of text properly and insert something wrong, one sentence or change some fact, and continue summarization. It leads to two problems. User wil get wrong information or/and due to change in sentence logic in 20% of text reading that text abuses you interpretation of what you read and make it difficult to remember anything you read because summarization has flowed logic. Useing this chat tuned things for longer may be not god for you memory.

And! This personality of drug diller! Do you want more... Are you sure you want more...

This is ugly.

Sign up or log in to comment