File size: 2,886 Bytes
dcd733d
 
 
 
22b9378
dcd733d
f938b23
dcd733d
f938b23
 
 
 
 
 
 
1c83495
b946580
f938b23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
language:
- fi
pipeline_tag: text-generation
license: apache-2.0
---
Generative Pretrained Transformer with 1.5B parameteres for Finnish.

TurkuNLP Finnish GPT-3-models are a model family of pretrained monolingual GPT-style language models that are based on BLOOM-architecture.
Note that the models are pure language models, meaning that they are not [instruction finetuned](https://arxiv.org/abs/2203.02155) for dialogue
or answering questions.

These models are intended to be used as foundational models that can be e.g. instruction finetuned to serve as modern chat-models.


All models are trained for 300B tokens.


**Parameters**
| Model | Layers | Dim  | Heads | Params |
|--------|--------|------|-------|--------|
| Small  | 12     | 768  | 12    | 186M   |
| Medium | 24     | 1024 | 16    | 437M   |
| Large  | 24     | 1536 | 16    | 881M   |
| XL     | 24     | 2064 | 24    | 1.5B   |
| ”3B”   | 32     | 2560 | 32    | 2.8B   |
| ”8B”   | 32     | 4096 | 32    | 7.5B   |
| "13B"  | 40     | 5120 | 40    | 13.3B  |


**Datasets**

We used a combination of multiple Finnish resources.

* Finnish Internet Parsebank https://turkunlp.org/finnish_nlp.html
mC4 multilingual colossal, cleaned Common Crawl https://huggingface.co/datasets/mc4
* Common Crawl Finnish https://TODO
* Finnish Wikipedia https://fi.wikipedia.org/wiki
* Lönnrot Projekti Lönnrot http://www.lonnrot.net/
* ePub National library ”epub” collection 
* National library ”lehdet” collection 
* Suomi24 The Suomi 24 Corpus 2001-2020 http://urn.fi/urn:nbn:fi:lb-2021101527
* Reddit r/Suomi submissions and comments https://www.reddit.com/r/Suomi
* STT Finnish News Agency Archive 1992-2018 http://urn.fi/urn:nbn:fi:lb-2019041501
* Yle Finnish News Archive 2011-2018 http://urn.fi/urn:nbn:fi:lb-2017070501
* Yle Finnish News Archive 2019-2020 http://urn.fi/urn:nbn:fi:lb-2021050401
* Yle News Archive Easy-to-read Finnish 2011-2018 http://urn.fi/urn:nbn:fi:lb-2019050901
* Yle News Archive Easy-to-read Finnish 2019-2020 http://urn.fi/urn:nbn:fi:lb-2021050701
* ROOTS TODO 


**Sampling ratios**

|Dataset   |  Chars |  Ratio  | Weight | W.Ratio | 
|----------|--------|---------|--------|---------|
|Parsebank |  35.0B |  16.9\% |    1.5 |   22.7\%| 
|mC4-Fi    |  46.3B |  22.4\% |    1.0 |   20.0\%| 
|CC-Fi     |  79.6B |  38.5\% |    1.0 |   34.4\%| 
|Fiwiki    |   0.8B |   0.4\% |    3.0 |    1.0\%| 
|Lönnrot   |   0.8B |   0.4\% |    3.0 |    1.0\%| 
|Yle       |   1.6B |   0.8\% |    2.0 |    1.4\%| 
|STT       |   2.2B |   1.1\% |    2.0 |    1.9\%| 
|ePub      |  13.5B |   6.5\% |    1.0 |    5.8\%| 
|Lehdet    |   5.8B |   2.8\% |    1.0 |    2.5\%| 
|Suomi24   |  20.6B |   9.9\% |    1.0 |    8.9\%| 
|Reddit-Fi |   0.7B |   0.4\% |    1.0 |    0.3\%|
|**TOTAL**    | **207.0B** | **100.0\%** | **N/A** |  **100.0\%** |



More documentation and a paper coming soon.