Edit model card

Mixsmol-4x400M-v0.1 by Ontocord

This is the first checkpoint (Epoch 1) of Mixsmol-4x400M-v0.1 Note that this is an experimental in data mixing. Therefore, we only trained the model on 50B tokens (95% English and 5% Vietnamese) to test the following:

  • Reasoining capabilities through high-quality synthetic textbooks data pretraining
  • Crosslingual understanding through machine translation and multilingual + multiple tasks pretraining

After verifying our hypothesis with this run, we will schedule a second run on bigger data and compute for it to achieve its maximum capability.

Data

  • Synthetic Textbooks: 8M samples
  • RefinedWeb: 1M samples
  • RedPajama-v2: 500K samples
  • MathPile: Everything
  • ThePile: MiniPile Subset
  • GoodWiki
  • The Stack Smol XL
  • The Vault: train_small split
  • Instruction Pretraining: 250k samples
Tasks Version Filter n-shot Metric Value Stderr
arc_challenge Yaml none 25 acc 0.1937 ± 0.0115
none 25 acc_norm 0.2329 ± 0.0124
hellaswag Yaml none 10 acc 0.2856 ± 0.0045
none 10 acc_norm 0.3090 ± 0.0046
mmlu N/A none 0 acc 0.2536 ± 0.0483
- humanities N/A none 5 acc 0.2408 ± 0.0341
- other N/A none 5 acc 0.2475 ± 0.0443
- social_sciences N/A none 5 acc 0.2567 ± 0.0456
- stem N/A none 5 acc 0.2756 ± 0.0653
truthfulqa_mc2 Yaml none 0 acc 0.3909 ± 0.0148
winogrande Yaml none 5 acc 0.5107 ± 0.014
gsm8k Yaml get-answer 5 exact_match 0 ± 0

Contribution

This work is a shared contribution between Ontocord, BEE-spoke-data and VILM

Downloads last month
1,765
Safetensors
Model size
1.77B params
Tensor type
BF16
·

Collection including vilm/Mixsmol-4x400M-v0.1-epoch1