scheiblr commited on
Commit
07d3808
1 Parent(s): ade9447

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -4
README.md CHANGED
@@ -1,7 +1,107 @@
1
- # Gottbert-base
2
 
3
- BERT model trained solely on the German portion of the OSCAR data set.
4
 
5
- [Paper: GottBERT: a pure German Language Model](https://arxiv.org/abs/2012.02110)
 
 
 
 
6
 
7
- Authors: Raphael Scheible, Fabian Thomczyk, Patric Tippmann, Victor Jaravine, Martin Boeker
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GottBERT: A pure German language model
2
 
3
+ GottBERT is the first German-only RoBERTa model, pre-trained on the German portion of the first released OSCAR dataset. This model aims to provide enhanced natural language processing (NLP) performance for the German language across various tasks, including Named Entity Recognition (NER), text classification, and natural language inference (NLI). GottBERT has been developed in two versions: a **base model** and a **large model**, tailored specifically for German-language tasks.
4
 
5
+ - **Model Type**: RoBERTa
6
+ - **Language**: German
7
+ - **Base Model**: 12 layers, 125 million parameters
8
+ - **Large Model**: 24 layers, 355 million parameters
9
+ - **License**: MIT
10
 
11
+ ---
12
+
13
+ ## Pretraining Details
14
+
15
+ - **Corpus**: German portion of the OSCAR dataset (Common Crawl).
16
+ - **Data Size**:
17
+ - Unfiltered: 145GB (~459 million documents)
18
+ - Filtered: 121GB (~382 million documents)
19
+ - **Preprocessing**: Filtering included correcting encoding errors (e.g., erroneous umlauts), removing spam and non-German documents using language detection and syntactic filtering.
20
+
21
+ ### Filtering Metrics
22
+ - **Stopword Ratio**: Detects spam and meaningless content.
23
+ - **Punctuation Ratio**: Detects abnormal punctuation patterns.
24
+ - **Upper Token Ratio**: Identifies documents with excessive uppercase tokens (often noisy content).
25
+
26
+ ## **Training Configuration**
27
+ - **Framework**: [Fairseq](https://github.com/scheiblr/fairseq/tree/TPUv4_very_old)
28
+ - **Hardware**:
29
+ - Base Model: 256 TPUv3 pod/128 TPUv4 pod
30
+ - Large Model: 128 TPUv4 pod
31
+ - **Training Time**:
32
+ - Base Model: 1.2 days
33
+ - Large Model: 5.7 days
34
+ - **Batch Size**: 8k tokens
35
+ - **Learning Rate**:
36
+ - Base: Peak LR = 0.0004
37
+ - Large: Peak LR = 0.00015
38
+ - **Training Iterations**: 100k steps with a 10k warm-up phase
39
+
40
+ ## Evaluation and Results
41
+ GottBERT was evaluated across various downstream tasks:
42
+ - **NER**: CoNLL 2003, GermEval 2014
43
+ - **Text Classification**: GermEval 2018 (coarse & fine), 10kGNAD
44
+ - **NLI**: German subset of XNLI
45
+
46
+ Mertics:
47
+ - **NER and Text Classification**: F1 Score
48
+ - **NLI**: Accuracy
49
+
50
+
51
+ Details:
52
+ - If nothing statetd the best checkpoint is referred based on perplexity. $\text{†}$ denotes last checkpoint at 100k optimization steps.
53
+ - The model from our [pre-print](https://arxiv.org/abs/2012.02110v1) was moved from uklfr/gottbert-base to [tum/gottbert_base_last]().
54
+ - $\mathrm{f}$ stands for filtered and marks the models trained on the filtered oscar portion.
55
+ - **bold** values indicate the best performing model within one architecure (base, large), <ins>undescored</ins> values the second best.
56
+
57
+
58
+ | Model | NLI | GermEval 14 | CoNLL | GermEval 2018 Coarse | GermEval 2018 Fine | 10kGNAD |
59
+ |--------------------------------------------|--------------|-----------------|----------|-----------|---------|------------|
60
+ | $\mathrm{GottBERT}_{\mathrm{base}}$ | 80.82 | 87.55 | <ins>85.93</ins> | 78.17 | 53.30 | 89.64 |
61
+ | $\mathrm{GottBERT}_{\mathrm{base}}^{\text{†}}$ | 81.04 | 87.48 | 85.61 | <ins>78.18</ins> | **53.92** | 90.27 |
62
+ | $^{\mathrm{f}}\mathrm{GottBERT}_{\mathrm{base}}$ | 80.56 | <ins>87.57</ins> | **86.14** | **78.65** | 52.82 | 89.79 |
63
+ | $^{\mathrm{f}}\mathrm{GottBERT}_{\mathrm{base}}^{\text{†}}$ | 80.74 | **87.59** | 85.66 | 78.08 | 52.39 | 89.92 |
64
+ | $\mathrm{GELECTRA}_{\mathrm{base}}$ | **81.70** | 86.91 | 85.37 | 77.26 | 50.07 | 89.02 |
65
+ | $\mathrm{GBERT}_{\mathrm{base}}$ | 80.06 | 87.24 | 85.16 | 77.37 | 51.51 | **90.30** |
66
+ | $\mathrm{dbmdzBERT}$ | 68.12 | 86.82 | 85.15 | 77.46 | 52.07 | **90.34** |
67
+ | $\mathrm{GermanBERT}$ | 78.16 | 86.53 | 83.87 | 74.81 | 47.78 | 90.18 |
68
+ | $\mathrm{XLM\text{-}R}_{\mathrm{base}}$ | 79.76 | 86.14 | 84.46 | 77.13 | 50.54 | 89.81 |
69
+ | $\mathrm{mBERT}$ | 77.03 | 86.67 | 83.18 | 73.54 | 48.32 | 88.90 |
70
+ | $\mathrm{GottBERT}_{\mathrm{large}}$ | 82.46 | 88.20 | <ins>86.78</ins> | 79.40 | 54.61 | 90.24 |
71
+ | $^{\mathrm{f}}\mathrm{GottBERT}_{\mathrm{large}}$ | 83.31 | 88.13 | 86.30 | 79.32 | 54.70 | 90.31 |
72
+ | $\mathrm{GottBERT}_{\mathrm{large}}^{\text{†}}$ | 82.79 | <ins>88.27</ins> | 86.28 | 78.96 | 54.72 | 90.17 |
73
+ | $\mathrm{GELECTRA}_{\mathrm{large}}$ | **86.33** | <ins>88.72</ins> | <ins>86.78</ins> | **81.28** | <ins>56.17</ins> | **90.97** |
74
+ | $\mathrm{GBERT}_{\mathrm{large}}$ | <ins>84.21</ins> | <ins>88.72</ins> | **87.19** | <ins>80.84</ins> | **57.37** | <ins>90.74</ins> |
75
+ | $\mathrm{XLM\text{-}R}_{\mathrm{large}}$ | 84.07 | **88.83** | 86.54 | 79.05 | 55.06 | 90.17 |
76
+
77
+ ## Model Architecture
78
+ - **Base Model**: 12 layers, 125M parameters, 52k token vocabulary.
79
+ - **Large Model**: 24 layers, 355M parameters, 52k token vocabulary.
80
+
81
+ ### Tokenizer
82
+ - **Type**: GPT-2 Byte-Pair Encoding (BPE)
83
+ - **Vocabulary Size**: 52k subword tokens
84
+ - **Trained on**: 40GB subsample of the unfiltered German OSCAR corpus.
85
+
86
+ ## Limitations
87
+ - **Filtered vs Unfiltered Data**: Minor improvements seen with filtered data, but not significant enough to justify filtering in every case.
88
+ - **Computation Limitations**: Fixed memory allocation on TPUs required processing data as a single stream, unlike GPU training which preserves document boundaries. Training was performed in 32-bit mode due to framework limitations, increasing memory usage.
89
+
90
+ ## Citations
91
+ If you use GottBERT in your research, please cite the following paper:
92
+ ```bibtex
93
+ @misc{scheible2020gottbertpuregermanlanguage,
94
+ title={GottBERT: a pure German Language Model},
95
+ author={Raphael Scheible and Fabian Thomczyk and Patric Tippmann and Victor Jaravine and Martin Boeker},
96
+ year={2020},
97
+ eprint={2012.02110},
98
+ archivePrefix={arXiv},
99
+ primaryClass={cs.CL},
100
+ url={https://arxiv.org/abs/2012.02110},
101
+ }
102
+ ```
103
+
104
+ ## Contact
105
+ For any questions or issues regarding the GottBERT model, feel free to reach out to:
106
+
107
+ - Raphael Scheible: [raphael.scheible@tum.de](mailto:raphael.scheible@tum.de)