README.md · reach-vb/test-lol-lol at main

Category	Benchmark	Phi-3.5 Mini-Ins	Mistral-Nemo-12B-Ins-2407	Llama-3.1-8B-Ins	Gemma-2-9B-Ins	Gemini 1.5 Flash
Popular aggregated benchmark	Arena Hard	37	39.4	25.7	42	55.2
	BigBench Hard CoT (0-shot)	69	60.2	63.4	63.5	66.7
	MMLU (5-shot)	69	67.2	68.1	71.3	78.7
	MMLU-Pro (0-shot, CoT)	47.4	40.7	44	50.1	57.2
Reasoning	ARC Challenge (10-shot)	84.6	84.8	83.1	89.8	92.8
	TruthfulQA (MC2) (10-shot)	64	68.1	69.2	76.6	76.6
	WinoGrande (5-shot)	68.5	70.4	64.7	74	74.7
Multilingual	Multilingual MMLU (5-shot)	55.4	58.9	56.2	63.8	77.2
Math	GSM8K (8-shot, CoT)	86.2	84.2	82.4	84.9	82.4
	MATH (0-shot, CoT)	48.5	31.2	47.6	50.9	38
Long context	Qasper	41.9	30.7	37.2	13.9	43.5
	SQuALITY	24.3	25.8	26.2	0	23.5
Code Generation	HumanEval (0-shot)	62.8	63.4	66.5	61	74.4
	MBPP (3-shot)	69.6	68.1	69.4	69.3	77.5