Benchmarks are here!

#4
by 0-hero - opened

Updated ARC
Screenshot 2024-04-10 at 9.30.27 PM.png

Original
Image 10-04-24 at 9.00 PM.JPG

Unofficial Mistral Community org

!!! Actually wild! Great work on the benchmarks!

deleted
edited Apr 10

@0-hero The Arc scores don't make sense to me. Doesn't the Llama 2 70b base only have an Arc of around 68? I read that the Arc used to evaluate GPT4, Opus and other proprietary models is very different. So for example, if Mixtral 8x22b has an Arc of 70.5, then GPT4 has an equivalent Arc score of about 83, not 96.3.

Yes my bad there, I was going to add a note here. Forgot about that due to my excitement!

Image 10-04-24 at 9.19 PM.JPG

EDIT - Added a new image

@0-hero The Arc scores don't make sense to me. Doesn't the Llama 2 70b base only have an Arc of around 68? I read that the Arc use to evaluate GPT4, Opus and other proprietary models is very different. So for example, if Mixtral 8x22b has an Arc of 70.5, then GPT has an equivalent Arc score of about 83, not 96.3.

Yep thats also what I thought.
Improvement on GSM8K is hopfully great for reasoning (but still significantly behind the smaller Claude models...); MMLU with a great improvement as well (Im skeptical that Qwen 1.5s MMLU score is legit).

@jphme Improvement on GSM8K is hopfully great for reasoning (but still significantly behind the smaller Claude models...);

However, the Claude models are aligned LLMs rather than foundation LLMs. Perhaps the aligned version of Mixtral-8x22B-v0.1 will be greatly improved.🧐

I doubt if this is an older base version of mistral-large

Any benchmarks on code writing?

Will do evalplus tomorrow on any finetune if someone hasn’t already done it

We did also some benchmark runs on the NousResearch suite earlier today (by @bjoernp /DiscoResearch) - note that in some of these, finetuned versions usually do significantly better than base models (thanks @clem for the reminder to post here as well) :

0-shot

|            Tasks             |Version|Filter|n-shot| Metric |Value |   |Stderr|
|------------------------------|------:|------|-----:|--------|-----:|---|-----:|
|agieval_sat_math              |      1|none  |     0|acc     |0.5409|±  |0.0337|
|                              |       |none  |     0|acc_norm|0.4227|±  |0.0334|
|agieval_sat_en_without_passage|      1|none  |     0|acc     |0.5825|±  |0.0344|
|                              |       |none  |     0|acc_norm|0.4903|±  |0.0349|
|agieval_sat_en                |      1|none  |     0|acc     |0.8301|±  |0.0262|
|                              |       |none  |     0|acc_norm|0.7476|±  |0.0303|
|agieval_lsat_rc               |      1|none  |     0|acc     |0.7472|±  |0.0265|
|                              |       |none  |     0|acc_norm|0.5799|±  |0.0301|
|agieval_lsat_lr               |      1|none  |     0|acc     |0.5745|±  |0.0219|
|                              |       |none  |     0|acc_norm|0.4471|±  |0.0220|
|agieval_lsat_ar               |      1|none  |     0|acc     |0.2435|±  |0.0284|
|                              |       |none  |     0|acc_norm|0.2174|±  |0.0273|
|agieval_logiqa_en             |      1|none  |     0|acc     |0.3963|±  |0.0192|
|                              |       |none  |     0|acc_norm|0.3840|±  |0.0191|
|agieval_aqua_rat              |      1|none  |     0|acc     |0.2677|±  |0.0278|
|                              |       |none  |     0|acc_norm|0.2795|±  |0.0282|
 
Mixtral-8x22b

|    Tasks    |Version|Filter|n-shot| Metric |Value |   |Stderr|
|-------------|------:|------|-----:|--------|-----:|---|-----:|
|piqa         |      1|none  |     0|acc     |0.8313|±  |0.0087|
|             |       |none  |     0|acc_norm|0.8487|±  |0.0084|
|boolq        |      2|none  |     0|acc     |0.8780|±  |0.0057|
|arc_challenge|      1|none  |     0|acc     |0.5922|±  |0.0144|
|             |       |none  |     0|acc_norm|0.6365|±  |0.0141|
|arc_easy     |      1|none  |     0|acc     |0.8577|±  |0.0072|
|             |       |none  |     0|acc_norm|0.8401|±  |0.0075|
|winogrande   |      1|none  |     0|acc     |0.7979|±  |0.0113|
|openbookqa   |      1|none  |     0|acc     |0.3640|±  |0.0215|
|             |       |none  |     0|acc_norm|0.4960|±  |0.0224|
|hellaswag    |      1|none  |     0|acc     |0.6719|±  |0.0047|
|             |       |none  |     0|acc_norm|0.8617|±  |0.0034|

How much sense does it make to use base models for 0-shot-benchmarks like TruthfulQA? I mean, this benchmark is asking questions, which is perfect for instruction fine-tuned models, but not for base models. Or do I miss something?

This was quick, thanks for all the work! I'd love to see the numbers for GPQA 0-shot COT or DROP 3-shot as well please if its possible!

@Acrobatix still useful to have baseline numbers pre-instruction tuning. say you have 2 base models A and B you only have budget to fine-tune one of them, if A outperforms B on the target task, makes sense to pick A for tuning. also having baseline numbers will confirm that your instruction tuning was set up correctly & useful (because the numbers post-tuning should be substantially improved)

Are you sure this is BASE, not INSTRUCT version? It looks better than GPT-35, I am confused

Unofficial Mistral Community org

Are you sure this is BASE, not INSTRUCT version? It looks better than GPT-35, I am confused

The instruct version has not been released yet.

deleted

@ASHIDAKA Even though it's a foundational model it responds surprisingly well to various prompts.

Hey @jphme what is the average score you get for AGIEval?

Oh nice this is fine tuned? Chat is good

Hey @jphme what is the average score you get for AGIEval?

52.23, see also here for a graphic comparing to Mixtral 8x7 + Qwen 72b base models.

IMG_5764.png

Source - https://x.com/alpayariyak/status/1778329833514098832?s=46&t=ZC6wgu7iLucRMVlNeDgmYQ

That chart can be misleading.
"They're all base models", so there is still room for improvement.

https://x.com/AlpayAriyak/status/1778332041261572388

What are the gpu requirements for finetuning this model?

Hi everyone!

We're thrilled to announce that after finishing our evaluations, the new Mixtral model is the top-performing pretrained model on the Open LLM Leaderboard! Congratulations to Mistral! 🏆👏

For a detailed breakdown, you can view the results here: results

To see how it compares with others, visit the leaderboard: leaderboard page

For an in-depth look at the evaluation process, check the "📝 About" section on the leaderboard page
Screenshot 2024-04-11 at 21.30.00.png

@jphme Improvement on GSM8K is hopfully great for reasoning (but still significantly behind the smaller Claude models...);

However, the Claude models are aligned LLMs rather than foundation LLMs. Perhaps the aligned version of Mixtral-8x22B-v0.1 will be greatly improved.🧐

do you mean the instruct version? given that today it was finally released, hope someone can benchmark it

Working on mt-bench for now

MT-Bench

Model MT-Bench
Claude 3 Opus 9.43
GPT-4-1106-Preview 9.32
Claude 3 Sonnet 9.18
WizardLM-2 8x22B 9.12
GPT-4-0314 8.96
Mixtral-8x22B-Instruct-v0.1 8.66
zephyr-orpo-141b-A35b-v0.1 8.17
Matter-0.2-8x22B 8.00

Sign up or log in to comment