eurogpt2 / README.md
malteos
language tags
25410e1
---
language:
- bg
- cs
- da
- de
- el
- en
- es
- et
- fi
- fr
- ga
- hr
- hu
- it
- lt
- lv
- mt
- nl
- pl
- pt
- ro
- sk
- sl
- sv
- uk
- multilingual
license: mit
---
# EuroGPT2
**NOTE: THIS IS THE ORIGINAL MEGATRON-DEEPSPEED CHECKPOINT INCLUDING OPTIMIZER STATES**
A GPT2 language model for European languages (EU-24 + Ukrainian).
The model follows the original architecture as [OpenAI's GPT2](https://huggingface.co/gpt2/) apart from using [rotary](https://arxiv.org/abs/2104.09864) instead of learned positional embeddigs.
## Model settings
- parameters: 124M
- number of layers: 12
- hidden size: 768
- number of heads: 12
- sequence length: 1024
- batch size: 168
- test PPL after training: 23.6 (steps: 436,940)
## Training data
- [Wikimedia dumps](https://dumps.wikimedia.org/) (Wikipedia, Wikinews, Wikibooks, Wikisource, Wikivoyage; 20230301)
- [EUR-Lex](https://huggingface.co/datasets/joelito/eurlex_resources)
- [OSCAR 2023.01](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301)
- Tokens: 75,167,662,080
## Languages
Included languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Irish, Croatian, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish, and Ukrainian.
| Language | Ratio |
| -------- | ------ |
| bg | 5,92% |
| cs | 4,77% |
| da | 2,19% |
| de | 7,36% |
| el | 8,60% |
| en | 10,11% |
| es | 6,57% |
| et | 1,67% |
| fi | 2,70% |
| fr | 7,18% |
| ga | 0,25% |
| hr | 1,09% |
| hu | 6,38% |
| it | 5,80% |
| lt | 2,01% |
| lv | 1,76% |
| mt | 1,49% |
| nl | 5,20% |
| pl | 4,82% |
| pt | 4,64% |
| ro | 2,93% |
| sk | 2,03% |
| sl | 1,54% |
| sv | 3,00% |
## License
MIT