model	avg	arc	truthfulqa	hellaswag	mmlu
crumb/d1536-250MT-full	30.61	23.29	47.1	26.60	25.45787898435144

trained for 250MT/16GT, compared to GPT-2 suite: between gpt2-124m and gpt2-345m (the params are between as well). 254,883,841 non-embedding parameters, this is a roughly 1Tok/Param initial saturation of weights to be further pretrained with ReLora.

Downloads last month: 12