--- language: - bg - cs - da - de - el - en - es - et - fi - fr - ga - hr - hu - it - lt - lv - mt - nl - pl - pt - ro - sk - sl - sv - uk - multilingual license: mit --- # EuroGPT2 **NOTE: THIS IS THE ORIGINAL MEGATRON-DEEPSPEED CHECKPOINT INCLUDING OPTIMIZER STATES** A GPT2 language model for European languages (EU-24 + Ukrainian). The model follows the original architecture as [OpenAI's GPT2](https://huggingface.co/gpt2/) apart from using [rotary](https://arxiv.org/abs/2104.09864) instead of learned positional embeddigs. ## Model settings - parameters: 124M - number of layers: 12 - hidden size: 768 - number of heads: 12 - sequence length: 1024 - batch size: 168 - test PPL after training: 23.6 (steps: 436,940) ## Training data - [Wikimedia dumps](https://dumps.wikimedia.org/) (Wikipedia, Wikinews, Wikibooks, Wikisource, Wikivoyage; 20230301) - [EUR-Lex](https://huggingface.co/datasets/joelito/eurlex_resources) - [OSCAR 2023.01](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301) - Tokens: 75,167,662,080 ## Languages Included languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Irish, Croatian, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish, and Ukrainian. | Language | Ratio | | -------- | ------ | | bg | 5,92% | | cs | 4,77% | | da | 2,19% | | de | 7,36% | | el | 8,60% | | en | 10,11% | | es | 6,57% | | et | 1,67% | | fi | 2,70% | | fr | 7,18% | | ga | 0,25% | | hr | 1,09% | | hu | 6,38% | | it | 5,80% | | lt | 2,01% | | lv | 1,76% | | mt | 1,49% | | nl | 5,20% | | pl | 4,82% | | pt | 4,64% | | ro | 2,93% | | sk | 2,03% | | sl | 1,54% | | sv | 3,00% | ## License MIT