MT-Bench Results

#2
by leonardlin - opened

I was curious so gave it a spin with MT-Bench (HF transformers). It's on par with gpt-3.5 (but so is mlabonne/OmniBeagle-7B so take these results for what they are).

########## First turn ##########                                                                                                    
gpt-4                       1     8.956250
Senku-70B-Full              1     8.387500
omnibeagle14-7b             1     8.325000
claude-v1                   1     8.150000
orion-14b-chat              1     8.143750
gpt-3.5-turbo               1     8.075000            

########## Second turn ##########                                                                                                                                   
gpt-4                       2     9.025000
claude-instant-v1           2     8.012658
gpt-3.5-turbo               2     7.812500
claude-v1                   2     7.650000
Senku-70B-Full              2     7.600000
omnibeagle14-7b             2     7.587500 

########## Average ##########                                                                                                                                       
gpt-4                        8.990625
Senku-70B-Full               7.993750
omnibeagle14-7b              7.956250
gpt-3.5-turbo                7.943750

And here's the spider plot. The major outlier is that the OmniBeagle-7B higher than Senku-70B or gpt-3.5 on "reasoning" which seems rather unlikely:

newplot(12).png

Still, looks like MT-Bench not too out of line with EQ-Bench and as a SlimOrca-only tune, it seems to point to MiQu still having a lot of untapped potential.

Shinoji Research org

Very interesting, although it seems something odd is happening with reasoning (or prompting in general). I may need to do a bit more tweaking on that. I imagine the 7B model that is ahead of GPT 3.5 is probably overfit somehow.

https://huggingface.co/ShinojiResearch/Senku-70B-Full/discussions/3

I have a fair number of MT-Bench runs and will upload the .jsonl outputs soon so you can review it if you want. I used llama2 formatting btw (the chat_template enforces a format that breaks). I include a system prompt "You are a helpful assistant."

I'd agree that the 7Bs are likely overfitting (and the chances that a 7B are actually smarter than a 70B, I'd say is about 0.00%), although mlabonne's merges are primarily targeting Nous Suite, there are others that are purposely training for MT-Bench scores (this of course, makes it rather useless for comparison).

I do think it should be possible to improve "reasoning" style responses fairly easily while making sure that MT-Bench questions remain firmly out of distribution (so that it still remains as a useful yardstick).

@leonardlin great job as always Leonard!

I think the MT-bench shows that there is more untapped potential here as to how humans can perceive the mode.
Relying solely on MT-bench of course wouldn't be helpful, but if it falls short compared to some other model, it does mean there is areas for improvement in being better conversationalist.

Shinoji Research org

Exactly. I am actively training another version that I think will fix the prompting issues (original axolotl config for Senku V1 used chatml, which I think is conflicting with the mistral one). Some other people are also working on similar finetunes.

can't wait!

Sign up or log in to comment