Mixsmol-4x400M-v0.1 by Ontocord

This is the first checkpoint (Epoch 1) of Mixsmol-4x400M-v0.1 Note that this is an experimental in data mixing. Therefore, we only trained the model on 50B tokens (95% English and 5% Vietnamese) to test the following:

Reasoining capabilities through high-quality synthetic textbooks data pretraining
Crosslingual understanding through machine translation and multilingual + multiple tasks pretraining

After verifying our hypothesis with this run, we will schedule a second run on bigger data and compute for it to achieve its maximum capability.

Data

Synthetic Textbooks: 8M samples
RefinedWeb: 1M samples
RedPajama-v2: 500K samples
MathPile: Everything
ThePile: MiniPile Subset
GoodWiki
The Stack Smol XL
The Vault: train_small split
Instruction Pretraining: 250k samples

vilm
/

Mixsmol-4x400M-v0.1-epoch2

Mixsmol-4x400M-v0.1 by Ontocord

Data

Collection including vilm/Mixsmol-4x400M-v0.1-epoch2

Mixsmol