broken bc of updates to transformers library, let me reimplement and train

GLORT2 (GLORT2 Low Rank Transformer Transformer) is a transformer model where every single linear layer is another smaller transformer model. I combined qkv into one operation which means one transformer instead of 3 to save on parameters, I played w using a transformer on the embeddings but it wasnt .. great, it's 768 dim 10 layers w/ 384 dim 1 layer as the replacements for linear layers (besides embed and lm head)

also sorry I just realized theres some residual from where I copied the model code from in my own projects that includes some "expanded lm head size" stuff just ignore that if you're looking at the config and code this isn't a serious project so I don't care too much that it's there

model 512-token strided perplexity on a pile test set tokens
cerebras 111m 21.550655364990234 2.2b
cerebras 256m 15.203496932983398 5.1b
cerebras 590m 12.098200798034668 11.something b
deduped pythia 70m (95.6M) 22.393400192260742 300b
deduped pythia 160m (213M) 13.933751106262207 300b
deduped pythia 410m (506M) 9.61842155456543 300b
llama w same settings as cerebras 111m (119m) 13.882301330566406 2.2b
llama plus w same settings as cerebras 111m and llama 70b embeddings (369m) 13.565109252929688 2.2b
GLORT2 (205m) 13.051741600036621 2.2b
Tasks Version Filter n-shot Metric Value Stderr
arc_challenge 1 none 25 acc 0.1706 ± 0.0110
none 25 acc_norm 0.2099 ± 0.0119
truthfulqa_mc2 2 none 0 acc 0.4599 ± 0.0154
winogrande 1 none 5 acc 0.5083 ± 0.0141
hellaswag 1 none 10 acc 0.2728 ± 0.0044
none 10 acc_norm 0.2815 ± 0.0045
gsm8k 2 get-answer 5 exact_match 0 ± 0

mmlu

mean is 0.26394385964912276 i think

Tasks Version Filter n-shot Metric Value Stderr
world_religions 0 none 5 acc 0.1988 ± 0.0306
virology 0 none 5 acc 0.1928 ± 0.0307
us_foreign_policy 0 none 5 acc 0.2600 ± 0.0441
sociology 0 none 5 acc 0.2438 ± 0.0304
security_studies 0 none 5 acc 0.4000 ± 0.0314
public_relations 0 none 5 acc 0.2273 ± 0.0401
professional_psychology 0 none 5 acc 0.2484 ± 0.0175
professional_medicine 0 none 5 acc 0.4485 ± 0.0302
professional_law 0 none 5 acc 0.2445 ± 0.0110
professional_accounting 0 none 5 acc 0.2482 ± 0.0258
prehistory 0 none 5 acc 0.2562 ± 0.0243
philosophy 0 none 5 acc 0.2186 ± 0.0235
nutrition 0 none 5 acc 0.2941 ± 0.0261
moral_scenarios 0 none 5 acc 0.2503 ± 0.0145
moral_disputes 0 none 5 acc 0.1965 ± 0.0214
miscellaneous 0 none 5 acc 0.2554 ± 0.0156
medical_genetics 0 none 5 acc 0.3000 ± 0.0461
marketing 0 none 5 acc 0.1966 ± 0.0260
management 0 none 5 acc 0.1942 ± 0.0392
machine_learning 0 none 5 acc 0.2321 ± 0.0401
logical_fallacies 0 none 5 acc 0.2331 ± 0.0332
jurisprudence 0 none 5 acc 0.2407 ± 0.0413
international_law 0 none 5 acc 0.3719 ± 0.0441
human_sexuality 0 none 5 acc 0.2137 ± 0.0360
human_aging 0 none 5 acc 0.2646 ± 0.0296
high_school_world_history 0 none 5 acc 0.2489 ± 0.0281
high_school_us_history 0 none 5 acc 0.2304 ± 0.0296
high_school_statistics 0 none 5 acc 0.4722 ± 0.0340
high_school_psychology 0 none 5 acc 0.3083 ± 0.0198
high_school_physics 0 none 5 acc 0.3046 ± 0.0376
high_school_microeconomics 0 none 5 acc 0.3361 ± 0.0307
high_school_mathematics 0 none 5 acc 0.2630 ± 0.0268
high_school_macroeconomics 0 none 5 acc 0.3231 ± 0.0237
high_school_government_and_politics 0 none 5 acc 0.3523 ± 0.0345
high_school_geography 0 none 5 acc 0.3384 ± 0.0337
high_school_european_history 0 none 5 acc 0.2909 ± 0.0355
high_school_computer_science 0 none 5 acc 0.2600 ± 0.0441
high_school_chemistry 0 none 5 acc 0.2709 ± 0.0313
high_school_biology 0 none 5 acc 0.3161 ± 0.0265
global_facts 0 none 5 acc 0.1800 ± 0.0386
formal_logic 0 none 5 acc 0.1667 ± 0.0333
elementary_mathematics 0 none 5 acc 0.2540 ± 0.0224
electrical_engineering 0 none 5 acc 0.3103 ± 0.0386
econometrics 0 none 5 acc 0.2895 ± 0.0427
conceptual_physics 0 none 5 acc 0.2553 ± 0.0285
computer_security 0 none 5 acc 0.1900 ± 0.0394
college_physics 0 none 5 acc 0.3431 ± 0.0472
college_medicine 0 none 5 acc 0.2312 ± 0.0321
college_mathematics 0 none 5 acc 0.1800 ± 0.0386
college_computer_science 0 none 5 acc 0.3000 ± 0.0461
college_chemistry 0 none 5 acc 0.2900 ± 0.0456
college_biology 0 none 5 acc 0.2083 ± 0.0340
clinical_knowledge 0 none 5 acc 0.2038 ± 0.0248
business_ethics 0 none 5 acc 0.2100 ± 0.0409
astronomy 0 none 5 acc 0.1908 ± 0.0320
anatomy 0 none 5 acc 0.2963 ± 0.0394
abstract_algebra 0 none 5 acc 0.2000 ± 0.0402
Downloads last month
15
Safetensors
Model size
205M params
Tensor type
BF16
·
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.

Dataset used to train crumb/GLORT2