Marcus2112/pythia-160m-minipile_reproduction

Benchmark	Measure		160M MiniPile	160M Reproduction	Percentage Difference of Means	95% Confidence Interval	Interpretation
ARC-Challenge	acc	↑	0.2125 ± 0.0120	0.1894 ± 0.0115	-10.8706	(0.0095; -0.0577)	Difference not significant
MMLU	acc	↑	0.2699 ± 0.0037	0.2295 ± 0.0035	-14.9685	(-0.0304; -0.0504)	MiniPile better
HellaSwag	acc	↑	0.2560 ± 0.0044	0.2604 ± 0.0044	1.7188	(0.0166; -0.0078)	Difference not significant
WinoGrande	acc	↑	0.4720 ± 0.0140	0.5122 ± 0.0140	8.5169	(0.0790; 0.0014)	Reproduction better
Lambada (OpenAI)	acc	↑	0.0000 ± 0.0000	0.0000 ± 0.0000	-	-	-
Lambada (OpenAI)	perplexity	↓	3033175.2693 ± 288926.5827	1854408.3999 ± 148101.5978	-38.8625	(-542407.4980; -1815126.2408)	Reproduction severely better
Lambada (Std)	acc	↑	0.0000 ± 0.0000	0.0000 ± 0.0000	-	-	-
Lambada (Std)	perplexity	↓	27067951.3460 ± 2710040.191	11927123.2514 ± 1063672.928	-55.9364	(-9434663.1814; -20846993.0080)	Reproduction severely better
BLiMP	acc	↑	0.5194 ± 0.0018	0.5481 ± 0.0017	5.5256	(0.0336; 0.0238)	Reproduction better

Marcus2112
/

pythia-160m-minipile_reproduction