mlabonne/OmniBeagle-7B · MT-Bench Scores

Feb 1, 2024

Since I noticed OmniBeagle and still had my setup for NeuralBeagle open:

# First Turn
omnibeagle-7b             1     8.32500 
# Second Turn
omnibeagle-7b             2     7.587500
# Average
omnibeagle-7b              7.956250

In context average:

gpt-4                        8.990625                                  
omnibeagle-7b                7.956250                                  
gpt-3.5-turbo                7.943750                                  
claude-instant-v1            7.905660                                  
claude-v1                    7.900000                                  
neuralbeagle14-7b            7.628125                                  
orion-14b-chat               7.415625

gblazex

Feb 1, 2024

Great work guys! Needs some more coding/math related merges.

Those two are hardest for GPT-4 to judge on MT though so take it with a grain of perspective.
(as discussed in the MT-bench paper)

gblazex

Feb 1, 2024

@leonardlin how many MT-benches have you ran?
Do you keep the model outputs and judge annotations?

I was thinking of setting up a public & crowdsourced dataset for those because, unlike Alpaca where they publish it all,
MT-bench details are hard to come by.
Usually it's just the final numbers.

leonardlin

Feb 1, 2024

Honestly code should be eval'd with EvalPlus https://evalplus.github.io/leaderboard.html - there are probably some specialized math benchmarks as well. While MT-Bench has been analyzed to be the best correlation (0.89) w/ Chatbot Arena rankings, I still have my doubts about the reasoning capabilities of 7B-parameter models. I guess others will have to play around with it and give their feedback.

dvilasuero

Feb 1, 2024

fully agree @gblazex ! I'd be happy to share our model outputs & judge annotations, we've run a few models: mistral7binstructv0.2, zephyr7B-beta, Notus7B, OpenHermes, and our latest CapybaraHermes

leonardlin

Feb 1, 2024

@gblazex I've run... a lot of MT-bench (and JA MT-bench). I do have basically all the outputs and annotations. I have some other fish to fry (new training run, etc) but it's very high on my list to have a refactored llm-as-judge codebase that will be a public repo and allow people to easily merge results/metadata that hopefully will help build a dataset (it'll also have much better inference flexibility, yaml config files since prompting/templating can have a huge effect on benchmark scores, and ideally be a lot more push-button for the whole process).

leonardlin

Feb 1, 2024

@dvilasuero one of the things just higher on my priority list is getting argilla properly setup for improving the datasets for our next training run :)

We (anyone interested in better MT-bench, compiling scores) should definitely try to coordinate soon!

gblazex

Feb 1, 2024

@dvilasuero that's a great offer Daniel, thank you!
I know @abacaj @xDAN2099 @SanjiWatsuki have private MT-bench runs too, they might contribute their outputs.

I'll set up a space where people can easily upload files.

gblazex

Feb 1, 2024

@leonardlin how does that sound?

xDAN2099

Feb 1, 2024

@dvilasuero that's a great offer Daniel, thank you!
I know @abacaj @xDAN2099 @SanjiWatsuki have private MT-bench runs too, they might contribute their outputs.

I'll set up a space where people can easily upload files.

Feel free to use. : )
@gblazex @dvilasuero
https://huggingface.co/xDAN-AI/xDAN-L1-Chat-RL-v1

mlabonne

Owner Feb 1, 2024

Wow thanks @leonardlin ! Always happy to see numbers go up :)

Also very curious to see an exhaustive MT-Bench leaderboard. I'm planning to add it to LLM AutoEval.

zcmcdonough

Feb 6, 2024

@dvilasuero that's a great offer Daniel, thank you!
I know @abacaj @xDAN2099 @SanjiWatsuki have private MT-bench runs too, they might contribute their outputs.

I'll set up a space where people can easily upload files.

Feel free to use. : )
@gblazex @dvilasuero
https://huggingface.co/xDAN-AI/xDAN-L1-Chat-RL-v1

@gblazex @dvilasuero Here's the best site I've found with benchmarks for the latest models: https://llm.extractum.io/model/mlabonne%2FOmniBeagle-7B,6F7N4LPPCLWWbfV9hmJ3Va

mlabonne

Owner Feb 9, 2024

@leonardlin Is it possible to share the commands you used to evaluate the model? I tried running MT-Bench and got significantly worse results, but I'm not sure I used the correct parameters. Thanks in advance!

mlabonne

Owner Feb 9, 2024

More specifically, can you detail how you handled the second turn of MT-Bench: have you used the ID of another model (and if so, which one?)?

leonardlin

Feb 9, 2024

•

edited Feb 9, 2024

From FastChat/fastchat/llm_judge

I generate answers via:

time python gen_model_answer.py --bench-name mt_bench --model-path mlabonne/OmniBeagle-7B --model-id omnibeagle14-7b --num-gpus-total 2

And judge via

time python gen_judgment.py --bench-name mt_bench --model-list omnibeagle14-7b --judge-file data/judge_prompts.jsonl --parallel 2

And show results

python show_result.py --bench-name mt_bench

I'm looking at my scripts and I believe that's it (no changes to conversation.py either. On some of my testing, I have my own codebase (part of that rewrite) where I go and apply custom chat_templates (either those in the tokenizer_config.json or the ones I determine myself) to make sure models have optimal output but I just ran the default omnibeagle and got a decent score with this one.

2nd turn evaluation should be automatic.

SanjiWatsuki

Feb 9, 2024

Notably, the model name is used to determine the prompt that you use. I forked MT-Bench to let my models use their proper prompts during answer generation. If you used just Omnibeagle14-7B as the name, it probably used the zero-shot template which is ... probably fine? Most of these mergers have no coherent prompting format due to having so many pieces merged in, unfortunately

gblazex

Feb 10, 2024

Second turn shouldn’t need special treatment I think.

Or only the second scores differ significantly?

dvilasuero

Feb 10, 2024

As stated above, the key here is to make sure the right chat template is applied. If we don't use the right one, second turn is the most affected.

@SanjiWatsuki very cool! Is the fork accessible?

SanjiWatsuki

Feb 10, 2024

It's nothing special, I just edited fastchat/model/model_adapter.py to just apply the Alpaca prompt when I included the word "maid". It's not something really worth uploading.

mlabonne

Owner Feb 11, 2024

@leonardlin Thanks for the info, I replaced the default conv template with alpaca (https://github.com/mlabonne/FastChat) and managed to reproduce your results.

Here's what I got with NeuralOmniBeagle-v2:

########## First turn ##########
                                    score
model                       turn         
gpt-4                       1     8.95625
OmniBeagle-7B               1     8.31250
NeuralOmniBeagle-7B-v2      1     8.24375

########## Second turn ##########
                                     score
model                       turn          
gpt-4                       2     9.025000
OmniBeagle-7B               2     7.837500
NeuralOmniBeagle-7B-v2      2     7.825000

########## Average ##########
                                score
model                                
gpt-4                        8.990625
OmniBeagle-7B                8.075000
NeuralOmniBeagle-7B-v2       8.034375