broken bc of updates to transformers library, let me reimplement and train

GLORT2 (GLORT2 Low Rank Transformer Transformer) is a transformer model where every single linear layer is another smaller transformer model. I combined qkv into one operation which means one transformer instead of 3 to save on parameters, I played w using a transformer on the embeddings but it wasnt .. great, it's 768 dim 10 layers w/ 384 dim 1 layer as the replacements for linear layers (besides embed and lm head)

also sorry I just realized theres some residual from where I copied the model code from in my own projects that includes some "expanded lm head size" stuff just ignore that if you're looking at the config and code this isn't a serious project so I don't care too much that it's there

model	512-token strided perplexity on a pile test set	tokens
cerebras 111m	21.550655364990234	2.2b
cerebras 256m	15.203496932983398	5.1b
cerebras 590m	12.098200798034668	11.something b
deduped pythia 70m (95.6M)	22.393400192260742	300b
deduped pythia 160m (213M)	13.933751106262207	300b
deduped pythia 410m (506M)	9.61842155456543	300b
llama w same settings as cerebras 111m (119m)	13.882301330566406	2.2b
llama plus w same settings as cerebras 111m and llama 70b embeddings (369m)	13.565109252929688	2.2b
GLORT2 (205m)	13.051741600036621	2.2b

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
arc_challenge	1	none	25	acc	0.1706	±	0.0110
		none	25	acc_norm	0.2099	±	0.0119
truthfulqa_mc2	2	none	0	acc	0.4599	±	0.0154
winogrande	1	none	5	acc	0.5083	±	0.0141
hellaswag	1	none	10	acc	0.2728	±	0.0044
		none	10	acc_norm	0.2815	±	0.0045
gsm8k	2	get-answer	5	exact_match	0	±	0

mmlu

mean is 0.26394385964912276 i think

Tasks	Filter	n-shot	Metric	Value		Stderr
world_religions	none	5	acc	0.1988	±	0.0306
virology	none	5	acc	0.1928	±	0.0307
us_foreign_policy	none	5	acc	0.2600	±	0.0441
sociology	none	5	acc	0.2438	±	0.0304
security_studies	none	5	acc	0.4000	±	0.0314
public_relations	none	5	acc	0.2273	±	0.0401
professional_psychology	none	5	acc	0.2484	±	0.0175
professional_medicine	none	5	acc	0.4485	±	0.0302
professional_law	none	5	acc	0.2445	±	0.0110
professional_accounting	none	5	acc	0.2482	±	0.0258
prehistory	none	5	acc	0.2562	±	0.0243
philosophy	none	5	acc	0.2186	±	0.0235
nutrition	none	5	acc	0.2941	±	0.0261
moral_scenarios	none	5	acc	0.2503	±	0.0145
moral_disputes	none	5	acc	0.1965	±	0.0214
miscellaneous	none	5	acc	0.2554	±	0.0156
medical_genetics	none	5	acc	0.3000	±	0.0461
marketing	none	5	acc	0.1966	±	0.0260
management	none	5	acc	0.1942	±	0.0392
machine_learning	none	5	acc	0.2321	±	0.0401
logical_fallacies	none	5	acc	0.2331	±	0.0332
jurisprudence	none	5	acc	0.2407	±	0.0413
international_law	none	5	acc	0.3719	±	0.0441
human_sexuality	none	5	acc	0.2137	±	0.0360
human_aging	none	5	acc	0.2646	±	0.0296
high_school_world_history	none	5	acc	0.2489	±	0.0281
high_school_us_history	none	5	acc	0.2304	±	0.0296
high_school_statistics	none	5	acc	0.4722	±	0.0340
high_school_psychology	none	5	acc	0.3083	±	0.0198
high_school_physics	none	5	acc	0.3046	±	0.0376
high_school_microeconomics	none	5	acc	0.3361	±	0.0307
high_school_mathematics	none	5	acc	0.2630	±	0.0268
high_school_macroeconomics	none	5	acc	0.3231	±	0.0237
high_school_government_and_politics	none	5	acc	0.3523	±	0.0345
high_school_geography	none	5	acc	0.3384	±	0.0337
high_school_european_history	none	5	acc	0.2909	±	0.0355
high_school_computer_science	none	5	acc	0.2600	±	0.0441
high_school_chemistry	none	5	acc	0.2709	±	0.0313
high_school_biology	none	5	acc	0.3161	±	0.0265
global_facts	none	5	acc	0.1800	±	0.0386
formal_logic	none	5	acc	0.1667	±	0.0333
elementary_mathematics	none	5	acc	0.2540	±	0.0224
electrical_engineering	none	5	acc	0.3103	±	0.0386
econometrics	none	5	acc	0.2895	±	0.0427
conceptual_physics	none	5	acc	0.2553	±	0.0285
computer_security	none	5	acc	0.1900	±	0.0394
college_physics	none	5	acc	0.3431	±	0.0472
college_medicine	none	5	acc	0.2312	±	0.0321
college_mathematics	none	5	acc	0.1800	±	0.0386
college_computer_science	none	5	acc	0.3000	±	0.0461
college_chemistry	none	5	acc	0.2900	±	0.0456
college_biology	none	5	acc	0.2083	±	0.0340
clinical_knowledge	none	5	acc	0.2038	±	0.0248
business_ethics	none	5	acc	0.2100	±	0.0409
astronomy	none	5	acc	0.1908	±	0.0320
anatomy	none	5	acc	0.2963	±	0.0394
abstract_algebra	none	5	acc	0.2000	±	0.0402

crumb
/

GLORT2

broken bc of updates to transformers library, let me reimplement and train

mmlu

Dataset used to train crumb/GLORT2