metadata
datasets:
- EleutherAI/the_pile_deduplicated
language:
- en
broken bc of updates to transformers library, let me reimplement and train
GLORT2 (GLORT2 Low Rank Transformer Transformer) is a transformer model where every single linear layer is another smaller transformer model. I combined qkv into one operation which means one transformer instead of 3 to save on parameters, I played w using a transformer on the embeddings but it wasnt .. great, it's 768 dim 10 layers w/ 384 dim 1 layer as the replacements for linear layers (besides embed and lm head)
also sorry I just realized theres some residual from where I copied the model code from in my own projects that includes some "expanded lm head size" stuff just ignore that if you're looking at the config and code this isn't a serious project so I don't care too much that it's there
model | 512-token strided perplexity on a pile test set | tokens |
---|---|---|
cerebras 111m | 21.550655364990234 | 2.2b |
cerebras 256m | 15.203496932983398 | 5.1b |
cerebras 590m | 12.098200798034668 | 11.something b |
deduped pythia 70m (95.6M) | 22.393400192260742 | 300b |
deduped pythia 160m (213M) | 13.933751106262207 | 300b |
deduped pythia 410m (506M) | 9.61842155456543 | 300b |
llama w same settings as cerebras 111m (119m) | 13.882301330566406 | 2.2b |
llama plus w same settings as cerebras 111m and llama 70b embeddings (369m) | 13.565109252929688 | 2.2b |
GLORT2 (205m) | 13.051741600036621 | 2.2b |
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|
arc_challenge | 1 | none | 25 | acc | 0.1706 | ± | 0.0110 |
none | 25 | acc_norm | 0.2099 | ± | 0.0119 | ||
truthfulqa_mc2 | 2 | none | 0 | acc | 0.4599 | ± | 0.0154 |
winogrande | 1 | none | 5 | acc | 0.5083 | ± | 0.0141 |
hellaswag | 1 | none | 10 | acc | 0.2728 | ± | 0.0044 |
none | 10 | acc_norm | 0.2815 | ± | 0.0045 | ||
gsm8k | 2 | get-answer | 5 | exact_match | 0 | ± | 0 |
mmlu
mean is 0.26394385964912276 i think
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|
world_religions | 0 | none | 5 | acc | 0.1988 | ± | 0.0306 |
virology | 0 | none | 5 | acc | 0.1928 | ± | 0.0307 |
us_foreign_policy | 0 | none | 5 | acc | 0.2600 | ± | 0.0441 |
sociology | 0 | none | 5 | acc | 0.2438 | ± | 0.0304 |
security_studies | 0 | none | 5 | acc | 0.4000 | ± | 0.0314 |
public_relations | 0 | none | 5 | acc | 0.2273 | ± | 0.0401 |
professional_psychology | 0 | none | 5 | acc | 0.2484 | ± | 0.0175 |
professional_medicine | 0 | none | 5 | acc | 0.4485 | ± | 0.0302 |
professional_law | 0 | none | 5 | acc | 0.2445 | ± | 0.0110 |
professional_accounting | 0 | none | 5 | acc | 0.2482 | ± | 0.0258 |
prehistory | 0 | none | 5 | acc | 0.2562 | ± | 0.0243 |
philosophy | 0 | none | 5 | acc | 0.2186 | ± | 0.0235 |
nutrition | 0 | none | 5 | acc | 0.2941 | ± | 0.0261 |
moral_scenarios | 0 | none | 5 | acc | 0.2503 | ± | 0.0145 |
moral_disputes | 0 | none | 5 | acc | 0.1965 | ± | 0.0214 |
miscellaneous | 0 | none | 5 | acc | 0.2554 | ± | 0.0156 |
medical_genetics | 0 | none | 5 | acc | 0.3000 | ± | 0.0461 |
marketing | 0 | none | 5 | acc | 0.1966 | ± | 0.0260 |
management | 0 | none | 5 | acc | 0.1942 | ± | 0.0392 |
machine_learning | 0 | none | 5 | acc | 0.2321 | ± | 0.0401 |
logical_fallacies | 0 | none | 5 | acc | 0.2331 | ± | 0.0332 |
jurisprudence | 0 | none | 5 | acc | 0.2407 | ± | 0.0413 |
international_law | 0 | none | 5 | acc | 0.3719 | ± | 0.0441 |
human_sexuality | 0 | none | 5 | acc | 0.2137 | ± | 0.0360 |
human_aging | 0 | none | 5 | acc | 0.2646 | ± | 0.0296 |
high_school_world_history | 0 | none | 5 | acc | 0.2489 | ± | 0.0281 |
high_school_us_history | 0 | none | 5 | acc | 0.2304 | ± | 0.0296 |
high_school_statistics | 0 | none | 5 | acc | 0.4722 | ± | 0.0340 |
high_school_psychology | 0 | none | 5 | acc | 0.3083 | ± | 0.0198 |
high_school_physics | 0 | none | 5 | acc | 0.3046 | ± | 0.0376 |
high_school_microeconomics | 0 | none | 5 | acc | 0.3361 | ± | 0.0307 |
high_school_mathematics | 0 | none | 5 | acc | 0.2630 | ± | 0.0268 |
high_school_macroeconomics | 0 | none | 5 | acc | 0.3231 | ± | 0.0237 |
high_school_government_and_politics | 0 | none | 5 | acc | 0.3523 | ± | 0.0345 |
high_school_geography | 0 | none | 5 | acc | 0.3384 | ± | 0.0337 |
high_school_european_history | 0 | none | 5 | acc | 0.2909 | ± | 0.0355 |
high_school_computer_science | 0 | none | 5 | acc | 0.2600 | ± | 0.0441 |
high_school_chemistry | 0 | none | 5 | acc | 0.2709 | ± | 0.0313 |
high_school_biology | 0 | none | 5 | acc | 0.3161 | ± | 0.0265 |
global_facts | 0 | none | 5 | acc | 0.1800 | ± | 0.0386 |
formal_logic | 0 | none | 5 | acc | 0.1667 | ± | 0.0333 |
elementary_mathematics | 0 | none | 5 | acc | 0.2540 | ± | 0.0224 |
electrical_engineering | 0 | none | 5 | acc | 0.3103 | ± | 0.0386 |
econometrics | 0 | none | 5 | acc | 0.2895 | ± | 0.0427 |
conceptual_physics | 0 | none | 5 | acc | 0.2553 | ± | 0.0285 |
computer_security | 0 | none | 5 | acc | 0.1900 | ± | 0.0394 |
college_physics | 0 | none | 5 | acc | 0.3431 | ± | 0.0472 |
college_medicine | 0 | none | 5 | acc | 0.2312 | ± | 0.0321 |
college_mathematics | 0 | none | 5 | acc | 0.1800 | ± | 0.0386 |
college_computer_science | 0 | none | 5 | acc | 0.3000 | ± | 0.0461 |
college_chemistry | 0 | none | 5 | acc | 0.2900 | ± | 0.0456 |
college_biology | 0 | none | 5 | acc | 0.2083 | ± | 0.0340 |
clinical_knowledge | 0 | none | 5 | acc | 0.2038 | ± | 0.0248 |
business_ethics | 0 | none | 5 | acc | 0.2100 | ± | 0.0409 |
astronomy | 0 | none | 5 | acc | 0.1908 | ± | 0.0320 |
anatomy | 0 | none | 5 | acc | 0.2963 | ± | 0.0394 |
abstract_algebra | 0 | none | 5 | acc | 0.2000 | ± | 0.0402 |