Impressive...

#13

by ChuckMcSneed - opened Mar 2

Mar 2

... for 7B! Congratulations, your model has performed better than 50 models(judging by total score, or 56 if judged by SP/creativity score) on my benchmark at the moment of writing, placing you in the middle of the scoreboard. I'm not sure if I should be happy because of your performance or be sad because a lot of 70b models underperform severely.

senseable

Owner Mar 2

Cool, I just took a look.
The scores seem pretty low which is great to help identify ways to improve.
Can you elaborate on B and D, are these single question tests?

ChuckMcSneed

Mar 3

B is multi-turn and your model failed quite early. Tried to repeat the examples instead of doing the task, common issue.
C is multi-turn and your model didn't want to stop doing the task after I asked it to stop. Maybe train on long-term task memory?
D is single-turn reading comprehension(no math involved). Your model got to the right conclusion, but it clearly didn't really get the problem. Maybe try training on problems with a twist for this one?
You can improve a lot on Poems by adding more dealignment data, your model simply decided to switch the subject half way through on some of them. Rhymes were okay most of the time. Same problem as with Miqu.
For Styles I suggest adding some training on unorthodox speech and writing styles. Stuff like old English, strong German accent, with a mouth full of water... (None of those are in the test, just examples.) Only ChatGPT and Miqu-120B got 100% on those.

senseable

Owner Mar 3

Is your evaluation based off a single question per each section ?

ChuckMcSneed

Mar 3

BC are each based on a single very simple task, with the score determined at the earliest failure. So, if a model fails early, it will get 0, if it fails towards the end, it will gain 2 in B and 1 in C. Models can also get 0.5 subtracted for including moralizing notes in each reply.
D is indeed a single-question task, which is a bit flawed.
S has 8 different styles. 1 point per style, with subtractions for flaws.
P has 6 poems. 1 point per poem, with subtractions for flaws.

senseable

Owner Mar 3

Starting out I did something quite similar for a quick eval but realized in my case it was kinda pointless due to overfitting.
You may want to expand the examples per category to get a more precise score, a lot of those zeros could be 13, 50, 86 even.
Also I'd recommend sorting your leaderboard by score, at first glance mine looked the worst :)

ChuckMcSneed

Mar 3

Current testing already takes >2 hours for 70B and has to be done by hand. I don't have any more patience, so I won't expand it.

froggeric

Mar 3

Same here, really impressive work! I don't know how you did it, but seeing a 7B model perform so well and beat some much larger models is completely unexpected. I would love to see what happens if you applied the same training to miqu-1-70b. Here is how it fares on my benchmark: https://huggingface.co/datasets/froggeric/creativity

For uncensored storytelling (sorted by story, then nsfw):

For smart assistant (sorted by smart, then saw):

senseable

Owner Mar 3

@ChuckMcSneed Thanks for manually evaluating my model. You must be an English major, thanks for the advice.
Do you think there's a way to automate? Seems possible by doing regex checks for key phrases on each step perhaps?

@froggeric Wow, very nice benchmark you've created. I really haven't been a fan of 70B+ models due to the speed and memory requirements.
I just don't know how practical it is for portal or distributed use. Mistral is very easy to improve and adding the richness and depth of
the larger models into 7B has been my current goal.

I have a lot of respect for you guys, these benchmarks are much appreciated especially due to the current state of the LLM Leaderboard.
Keep up the good work!

ChuckMcSneed

Mar 3

For B and C, I think it's definitely possible, since the tasks are so simple, but will likely slow down the process, since some models like to complain while doing the task(Note: as an AI language model...). D can only be automatically prompted, but evaluation is absolutely not possible. For S it's possible to automate prompting, but needs human eval. P needs human intervention for most models(jailbreak). The main bottleneck for all those tests is currently generation(1.2s/t for 70B), not evaluation.

Thanks for the compliment, but I didn't major in English, or even computer science. I'm just a guy who likes messing around with LLMs.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment