Edit model card

Multilingual Generative Pretrained Transformer with 176B parameters with capacity for Finnish. This model is built upon pretrained BLOOM which is then further pretrained with a combined ROOTS + Finnish (without weighting) dataset for 40B tokens.

Datasets

We used a combination of multiple Finnish resources.

Sampling ratios for Finnish

Dataset Chars Ratio Weight W.Ratio
Parsebank 35.0B 16.9% 1.5 22.7%
mC4-Fi 46.3B 22.4% 1.0 20.0%
CC-Fi 79.6B 38.5% 1.0 34.4%
Fiwiki 0.8B 0.4% 3.0 1.0%
Lönnrot 0.8B 0.4% 3.0 1.0%
Yle 1.6B 0.8% 2.0 1.4%
STT 2.2B 1.1% 2.0 1.9%
ePub 13.5B 6.5% 1.0 5.8%
Lehdet 5.8B 2.8% 1.0 2.5%
Suomi24 20.6B 9.9% 1.0 8.9%
Reddit-Fi 0.7B 0.4% 1.0 0.3%
TOTAL 207.0B 100.0% N/A 100.0%

And for whole continued pretraining, ROOTS is mixed in.

Downloads last month
24
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.