File size: 1,896 Bytes
7a70b4f
25410e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a70b4f
 
df8b4a1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
language: 
- bg
- cs
- da
- de
- el
- en
- es
- et
- fi
- fr
- ga
- hr
- hu
- it
- lt
- lv
- mt
- nl
- pl
- pt
- ro
- sk
- sl
- sv
- uk
- multilingual
license: mit
---

# EuroGPT2

**NOTE: THIS IS THE ORIGINAL MEGATRON-DEEPSPEED CHECKPOINT INCLUDING OPTIMIZER STATES**

A GPT2 language model for European languages (EU-24 + Ukrainian). 
The model follows the original architecture as [OpenAI's GPT2](https://huggingface.co/gpt2/) apart from using [rotary](https://arxiv.org/abs/2104.09864) instead of learned positional embeddigs. 

## Model settings

- parameters: 124M 
- number of layers: 12
- hidden size: 768
- number of heads: 12
- sequence length: 1024
- batch size: 168
- test PPL after training: 23.6 (steps: 436,940)

## Training data

- [Wikimedia dumps](https://dumps.wikimedia.org/) (Wikipedia, Wikinews, Wikibooks, Wikisource, Wikivoyage; 20230301)
- [EUR-Lex](https://huggingface.co/datasets/joelito/eurlex_resources)
- [OSCAR 2023.01](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301)
- Tokens: 75,167,662,080

## Languages

Included languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Irish, Croatian, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish, and Ukrainian.

| Language | Ratio  |
| -------- | ------ |
| bg       | 5,92%  |
| cs       | 4,77%  |
| da       | 2,19%  |
| de       | 7,36%  |
| el       | 8,60%  |
| en       | 10,11% |
| es       | 6,57%  |
| et       | 1,67%  |
| fi       | 2,70%  |
| fr       | 7,18%  |
| ga       | 0,25%  |
| hr       | 1,09%  |
| hu       | 6,38%  |
| it       | 5,80%  |
| lt       | 2,01%  |
| lv       | 1,76%  |
| mt       | 1,49%  |
| nl       | 5,20%  |
| pl       | 4,82%  |
| pt       | 4,64%  |
| ro       | 2,93%  |
| sk       | 2,03%  |
| sl       | 1,54%  |
| sv       | 3,00%  |

## License

MIT