mistral-community/Mixtral-8x22B-v0.1

0-hero

Apr 10, 2024

•

edited Apr 10, 2024

Updated ARC

Original

danielhanchen

Unofficial Mistral Community org Apr 10, 2024

!!! Actually wild! Great work on the benchmarks!

deleted

Apr 10, 2024

•

edited Apr 10, 2024

@0-hero The Arc scores don't make sense to me. Doesn't the Llama 2 70b base only have an Arc of around 68? I read that the Arc used to evaluate GPT4, Opus and other proprietary models is very different. So for example, if Mixtral 8x22b has an Arc of 70.5, then GPT4 has an equivalent Arc score of about 83, not 96.3.

0-hero

Apr 10, 2024

•

edited Apr 10, 2024

Yes my bad there, I was going to add a note here. Forgot about that due to my excitement!

EDIT - Added a new image

jphme

Apr 10, 2024

@0-hero The Arc scores don't make sense to me. Doesn't the Llama 2 70b base only have an Arc of around 68? I read that the Arc use to evaluate GPT4, Opus and other proprietary models is very different. So for example, if Mixtral 8x22b has an Arc of 70.5, then GPT has an equivalent Arc score of about 83, not 96.3.

Yep thats also what I thought.
Improvement on GSM8K is hopfully great for reasoning (but still significantly behind the smaller Claude models...); MMLU with a great improvement as well (Im skeptical that Qwen 1.5s MMLU score is legit).

Wanfq

Apr 10, 2024

@jphme Improvement on GSM8K is hopfully great for reasoning (but still significantly behind the smaller Claude models...);

However, the Claude models are aligned LLMs rather than foundation LLMs. Perhaps the aligned version of Mixtral-8x22B-v0.1 will be greatly improved.🧐

0-hero

Apr 10, 2024

•

edited Apr 10, 2024

I doubt if this is an older base version of mistral-large

dima9450

Apr 10, 2024

Any benchmarks on code writing?

0-hero

Apr 10, 2024

•

edited Apr 11, 2024

Will do evalplus tomorrow on any finetune if someone hasn’t already done it

jphme

Apr 10, 2024

•

edited Apr 10, 2024

We did also some benchmark runs on the NousResearch suite earlier today (by @bjoernp /DiscoResearch) - note that in some of these, finetuned versions usually do significantly better than base models (thanks @clem for the reminder to post here as well) :

0-shot

|            Tasks             |Version|Filter|n-shot| Metric |Value |   |Stderr|
|------------------------------|------:|------|-----:|--------|-----:|---|-----:|
|agieval_sat_math              |      1|none  |     0|acc     |0.5409|±  |0.0337|
|                              |       |none  |     0|acc_norm|0.4227|±  |0.0334|
|agieval_sat_en_without_passage|      1|none  |     0|acc     |0.5825|±  |0.0344|
|                              |       |none  |     0|acc_norm|0.4903|±  |0.0349|
|agieval_sat_en                |      1|none  |     0|acc     |0.8301|±  |0.0262|
|                              |       |none  |     0|acc_norm|0.7476|±  |0.0303|
|agieval_lsat_rc               |      1|none  |     0|acc     |0.7472|±  |0.0265|
|                              |       |none  |     0|acc_norm|0.5799|±  |0.0301|
|agieval_lsat_lr               |      1|none  |     0|acc     |0.5745|±  |0.0219|
|                              |       |none  |     0|acc_norm|0.4471|±  |0.0220|
|agieval_lsat_ar               |      1|none  |     0|acc     |0.2435|±  |0.0284|
|                              |       |none  |     0|acc_norm|0.2174|±  |0.0273|
|agieval_logiqa_en             |      1|none  |     0|acc     |0.3963|±  |0.0192|
|                              |       |none  |     0|acc_norm|0.3840|±  |0.0191|
|agieval_aqua_rat              |      1|none  |     0|acc     |0.2677|±  |0.0278|
|                              |       |none  |     0|acc_norm|0.2795|±  |0.0282|
 
Mixtral-8x22b

|    Tasks    |Version|Filter|n-shot| Metric |Value |   |Stderr|
|-------------|------:|------|-----:|--------|-----:|---|-----:|
|piqa         |      1|none  |     0|acc     |0.8313|±  |0.0087|
|             |       |none  |     0|acc_norm|0.8487|±  |0.0084|
|boolq        |      2|none  |     0|acc     |0.8780|±  |0.0057|
|arc_challenge|      1|none  |     0|acc     |0.5922|±  |0.0144|
|             |       |none  |     0|acc_norm|0.6365|±  |0.0141|
|arc_easy     |      1|none  |     0|acc     |0.8577|±  |0.0072|
|             |       |none  |     0|acc_norm|0.8401|±  |0.0075|
|winogrande   |      1|none  |     0|acc     |0.7979|±  |0.0113|
|openbookqa   |      1|none  |     0|acc     |0.3640|±  |0.0215|
|             |       |none  |     0|acc_norm|0.4960|±  |0.0224|
|hellaswag    |      1|none  |     0|acc     |0.6719|±  |0.0047|
|             |       |none  |     0|acc_norm|0.8617|±  |0.0034|

Acrobatix

Apr 10, 2024

How much sense does it make to use base models for 0-shot-benchmarks like TruthfulQA? I mean, this benchmark is asking questions, which is perfect for instruction fine-tuned models, but not for base models. Or do I miss something?

shockroborty

Apr 10, 2024

This was quick, thanks for all the work! I'd love to see the numbers for GPQA 0-shot COT or DROP 3-shot as well please if its possible!

k8si

Apr 10, 2024

•

edited Apr 10, 2024

@Acrobatix still useful to have baseline numbers pre-instruction tuning. say you have 2 base models A and B you only have budget to fine-tune one of them, if A outperforms B on the target task, makes sense to pick A for tuning. also having baseline numbers will confirm that your instruction tuning was set up correctly & useful (because the numbers post-tuning should be substantially improved)

ASHIDAKA

Apr 11, 2024

Are you sure this is BASE, not INSTRUCT version? It looks better than GPT-35, I am confused

mrfakename

Unofficial Mistral Community org Apr 11, 2024

Are you sure this is BASE, not INSTRUCT version? It looks better than GPT-35, I am confused

The instruct version has not been released yet.

deleted

Apr 11, 2024

@ASHIDAKA Even though it's a foundational model it responds surprisingly well to various prompts.

lewtun

Apr 11, 2024

Hey @jphme what is the average score you get for AGIEval?

allknowingroger

Apr 11, 2024

Testing...
https://labs.perplexity.ai/

0-hero

Apr 11, 2024

•

edited Apr 11, 2024

Oh nice this is fine tuned? Chat is good

jphme

Apr 11, 2024

Hey @jphme what is the average score you get for AGIEval?

52.23, see also here for a graphic comparing to Mixtral 8x7 + Qwen 72b base models.

0-hero

Apr 11, 2024

Source - https://x.com/alpayariyak/status/1778329833514098832?s=46&t=ZC6wgu7iLucRMVlNeDgmYQ

youknow04

Apr 11, 2024

Source - https://x.com/alpayariyak/status/1778329833514098832?s=46&t=ZC6wgu7iLucRMVlNeDgmYQ

That chart can be misleading.
"They're all base models", so there is still room for improvement.

https://x.com/AlpayAriyak/status/1778332041261572388

ajibawa-2023

Apr 11, 2024

What are the gpu requirements for finetuning this model?

alozowski

Apr 12, 2024

•

edited Apr 12, 2024

Hi everyone!

We're thrilled to announce that after finishing our evaluations, the new Mixtral model is the top-performing pretrained model on the Open LLM Leaderboard! Congratulations to Mistral! 🏆👏

For a detailed breakdown, you can view the results here: results

To see how it compares with others, visit the leaderboard: leaderboard page

For an in-depth look at the evaluation process, check the "📝 About" section on the leaderboard page

TeaDiffusion

Apr 17, 2024

@jphme Improvement on GSM8K is hopfully great for reasoning (but still significantly behind the smaller Claude models...);

However, the Claude models are aligned LLMs rather than foundation LLMs. Perhaps the aligned version of Mixtral-8x22B-v0.1 will be greatly improved.🧐

do you mean the instruct version? given that today it was finally released, hope someone can benchmark it

0-hero

Apr 17, 2024

Working on mt-bench for now

0-hero

Apr 17, 2024

•

edited Apr 17, 2024

MT-Bench

Model	MT-Bench
Claude 3 Opus	9.43
GPT-4-1106-Preview	9.32
Claude 3 Sonnet	9.18
WizardLM-2 8x22B	9.12
GPT-4-0314	8.96
Mixtral-8x22B-Instruct-v0.1	8.66
zephyr-orpo-141b-A35b-v0.1	8.17
Matter-0.2-8x22B	8.00

mistral-community
/

Mixtral-8x22B-v0.1

Benchmarks are here!