Edit model card

Multilingual Generative Pretrained Transformer with 176B parameters with capacity for Finnish. This model is built upon pretrained BLOOM which is then further pretrained with a combined ROOTS + Finnish (without weighting) dataset for 40B tokens.

Datasets

We used a combination of multiple Finnish resources.

Sampling ratios for Finnish

Dataset Chars Ratio Weight W.Ratio
Parsebank 35.0B 16.9% 1.5 22.7%
mC4-Fi 46.3B 22.4% 1.0 20.0%
CC-Fi 79.6B 38.5% 1.0 34.4%
Fiwiki 0.8B 0.4% 3.0 1.0%
Lönnrot 0.8B 0.4% 3.0 1.0%
Yle 1.6B 0.8% 2.0 1.4%
STT 2.2B 1.1% 2.0 1.9%
ePub 13.5B 6.5% 1.0 5.8%
Lehdet 5.8B 2.8% 1.0 2.5%
Suomi24 20.6B 9.9% 1.0 8.9%
Reddit-Fi 0.7B 0.4% 1.0 0.3%
TOTAL 207.0B 100.0% N/A 100.0%

And for whole continued pretraining, ROOTS is mixed in.

Downloads last month
13