test-lol-lol / README.md
reach-vb's picture
reach-vb HF staff
Update README.md
bb9f925 verified
Category Benchmark Phi-3.5 Mini-Ins Mistral-Nemo-12B-Ins-2407 Llama-3.1-8B-Ins Gemma-2-9B-Ins Gemini 1.5 Flash
Popular aggregated benchmark Arena Hard 37 39.4 25.7 42 55.2
BigBench Hard CoT (0-shot) 69 60.2 63.4 63.5 66.7
MMLU (5-shot) 69 67.2 68.1 71.3 78.7
MMLU-Pro (0-shot, CoT) 47.4 40.7 44 50.1 57.2
Reasoning ARC Challenge (10-shot) 84.6 84.8 83.1 89.8 92.8
TruthfulQA (MC2) (10-shot) 64 68.1 69.2 76.6 76.6
WinoGrande (5-shot) 68.5 70.4 64.7 74 74.7
Multilingual Multilingual MMLU (5-shot) 55.4 58.9 56.2 63.8 77.2
Math GSM8K (8-shot, CoT) 86.2 84.2 82.4 84.9 82.4
MATH (0-shot, CoT) 48.5 31.2 47.6 50.9 38
Long context Qasper 41.9 30.7 37.2 13.9 43.5
SQuALITY 24.3 25.8 26.2 0 23.5
Code Generation HumanEval (0-shot) 62.8 63.4 66.5 61 74.4
MBPP (3-shot) 69.6 68.1 69.4 69.3 77.5