Update README.md

4e1f078 verified 1 day ago

6.16 kB

	---
	license: mit
	language:
	- de
	tags:
	- RoBERTa
	- GottBERT
	- BERT
	---
	# GottBERT: A pure German language model

	GottBERT is the first German-only RoBERTa model, pre-trained on the German portion of the first released OSCAR dataset. This model aims to provide enhanced natural language processing (NLP) performance for the German language across various tasks, including Named Entity Recognition (NER), text classification, and natural language inference (NLI). GottBERT has been developed in two versions: a base model and a large model, tailored specifically for German-language tasks.

	- Model Type: RoBERTa
	- Language: German
	- Base Model: 12 layers, 125 million parameters
	- Large Model: 24 layers, 355 million parameters
	- License: MIT

	---

	## Pretraining Details

	- Corpus: German portion of the OSCAR dataset (Common Crawl).
	- Data Size:
	- Unfiltered: 145GB (~459 million documents)
	- Filtered: 121GB (~382 million documents)
	- Preprocessing: Filtering included correcting encoding errors (e.g., erroneous umlauts), removing spam and non-German documents using language detection and syntactic filtering.

	### Filtering Metrics
	- Stopword Ratio: Detects spam and meaningless content.
	- Punctuation Ratio: Detects abnormal punctuation patterns.
	- Upper Token Ratio: Identifies documents with excessive uppercase tokens (often noisy content).

	## Training Configuration
	- Framework: [Fairseq](https://github.com/scheiblr/fairseq/tree/TPUv4_very_old)
	- Hardware:
	- Base Model: 256 TPUv3 pod/128 TPUv4 pod
	- Large Model: 128 TPUv4 pod
	- Training Time:
	- Base Model: 1.2 days
	- Large Model: 5.7 days
	- Batch Size: 8k tokens
	- Learning Rate:
	- Base: Peak LR = 0.0004
	- Large: Peak LR = 0.00015
	- Training Iterations: 100k steps with a 10k warm-up phase

	## Evaluation and Results
	GottBERT was evaluated across various downstream tasks:
	- NER: CoNLL 2003, GermEval 2014
	- Text Classification: GermEval 2018 (coarse & fine), 10kGNAD
	- NLI: German subset of XNLI

	Mertics:
	- NER and Text Classification: F1 Score
	- NLI: Accuracy


	Details:
	- bold values indicate the best performing model within one architecure (base, large), <ins>undescored</ins> values the second best.


	\| Model \| Accuracy NLI \| GermEval\_14 F1 \| CoNLL F1 \| Coarse F1 \| Fine F1 \| 10kGNAD F1 \|
	\|-------------------------------------\|--------------\|----------------\|----------\|-----------\|---------\|------------\|
	\| [GottBERT_base_best](https://huggingface.co/TUM/GottBERT_base_best) \| 80.82 \| 87.55 \| <ins>85.93</ins> \| 78.17 \| 53.30 \| 89.64 \|
	\| [GottBERT_base_last](https://huggingface.co/TUM/GottBERT_base_last) \| 81.04 \| 87.48 \| 85.61 \| <ins>78.18</ins> \| 53.92 \| 90.27 \|
	\| [GottBERT_filtered_base_best](https://huggingface.co/TUM/GottBERT_filtered_base_best) \| 80.56 \| <ins>87.57</ins> \| 86.14 \| 78.65 \| 52.82 \| 89.79 \|
	\| [GottBERT_filtered_base_last](https://huggingface.co/TUM/GottBERT_filtered_base_last) \| 80.74 \| 87.59 \| 85.66 \| 78.08 \| 52.39 \| 89.92 \|
	\| GELECTRA_base \| 81.70 \| 86.91 \| 85.37 \| 77.26 \| 50.07 \| 89.02 \|
	\| GBERT_base \| 80.06 \| 87.24 \| 85.16 \| 77.37 \| 51.51 \| 90.30 \|
	\| dbmdzBERT \| 68.12 \| 86.82 \| 85.15 \| 77.46 \| 52.07 \| 90.34 \|
	\| GermanBERT \| 78.16 \| 86.53 \| 83.87 \| 74.81 \| 47.78 \| 90.18 \|
	\| XLM-R_base \| 79.76 \| 86.14 \| 84.46 \| 77.13 \| 50.54 \| 89.81 \|
	\| mBERT \| 77.03 \| 86.67 \| 83.18 \| 73.54 \| 48.32 \| 88.90 \|
	\| [GottBERT_large](https://huggingface.co/TUM/GottBERT_large) \| 82.46 \| 88.20 \| <ins>86.78</ins> \| 79.40 \| 54.61 \| 90.24 \|
	\| [GottBERT_filtered_large_best](https://huggingface.co/TUM/GottBERT_filtered_large_best) \| 83.31 \| 88.13 \| 86.30 \| 79.32 \| 54.70 \| 90.31 \|
	\| [GottBERT_filtered_large_last](https://huggingface.co/TUM/GottBERT_filtered_large_last) \| 82.79 \| <ins>88.27</ins> \| 86.28 \| 78.96 \| 54.72 \| 90.17 \|
	\| GELECTRA_large \| 86.33 \| <ins>88.72</ins> \| <ins>86.78</ins> \| 81.28 \| <ins>56.17</ins> \| 90.97 \|
	\| GBERT_large \| <ins>84.21</ins> \| <ins>88.72</ins> \| 87.19 \| <ins>80.84</ins> \| 57.37 \| <ins>90.74</ins> \|
	\| XLM-R_large \| 84.07 \| 88.83 \| 86.54 \| 79.05 \| 55.06 \| 90.17 \|


	## Model Architecture
	- Base Model: 12 layers, 125M parameters, 52k token vocabulary.
	- Large Model: 24 layers, 355M parameters, 52k token vocabulary.

	### Tokenizer
	- Type: GPT-2 Byte-Pair Encoding (BPE)
	- Vocabulary Size: 52k subword tokens
	- Trained on: 40GB subsample of the unfiltered German OSCAR corpus.

	## Limitations
	- Filtered vs Unfiltered Data: Minor improvements seen with filtered data, but not significant enough to justify filtering in every case.
	- Computation Limitations: Fixed memory allocation on TPUs required processing data as a single stream, unlike GPU training which preserves document boundaries. Training was performed in 32-bit mode due to framework limitations, increasing memory usage.

	## Fairseq Checkpoints
	Get the fairseq checkpoints [here](https://drive.proton.me/urls/CFSGE8ZK9R#1F1G727lv77k).

	## Citations
	If you use GottBERT in your research, please cite the following paper:
	```bibtex
	@misc{scheible2020gottbertpuregermanlanguage,
	title={GottBERT: a pure German Language Model},
	author={Raphael Scheible and Fabian Thomczyk and Patric Tippmann and Victor Jaravine and Martin Boeker},
	year={2020},
	eprint={2012.02110},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2012.02110},
	}
	```

	---
	license: mit
	language:
	- de
	tags:
	- RoBERTa
	- GottBERT
	- BERT
	---
	# GottBERT: A pure German language model

	GottBERT is the first German-only RoBERTa model, pre-trained on the German portion of the first released OSCAR dataset. This model aims to provide enhanced natural language processing (NLP) performance for the German language across various tasks, including Named Entity Recognition (NER), text classification, and natural language inference (NLI). GottBERT has been developed in two versions: a base model and a large model, tailored specifically for German-language tasks.

	- Model Type: RoBERTa
	- Language: German
	- Base Model: 12 layers, 125 million parameters
	- Large Model: 24 layers, 355 million parameters
	- License: MIT

	---

	## Pretraining Details

	- Corpus: German portion of the OSCAR dataset (Common Crawl).
	- Data Size:
	- Unfiltered: 145GB (~459 million documents)
	- Filtered: 121GB (~382 million documents)
	- Preprocessing: Filtering included correcting encoding errors (e.g., erroneous umlauts), removing spam and non-German documents using language detection and syntactic filtering.

	### Filtering Metrics
	- Stopword Ratio: Detects spam and meaningless content.
	- Punctuation Ratio: Detects abnormal punctuation patterns.
	- Upper Token Ratio: Identifies documents with excessive uppercase tokens (often noisy content).

	## Training Configuration
	- Framework: [Fairseq](https://github.com/scheiblr/fairseq/tree/TPUv4_very_old)
	- Hardware:
	- Base Model: 256 TPUv3 pod/128 TPUv4 pod
	- Large Model: 128 TPUv4 pod
	- Training Time:
	- Base Model: 1.2 days
	- Large Model: 5.7 days
	- Batch Size: 8k tokens
	- Learning Rate:
	- Base: Peak LR = 0.0004
	- Large: Peak LR = 0.00015
	- Training Iterations: 100k steps with a 10k warm-up phase

	## Evaluation and Results
	GottBERT was evaluated across various downstream tasks:
	- NER: CoNLL 2003, GermEval 2014
	- Text Classification: GermEval 2018 (coarse & fine), 10kGNAD
	- NLI: German subset of XNLI

	Mertics:
	- NER and Text Classification: F1 Score
	- NLI: Accuracy


	Details:
	- bold values indicate the best performing model within one architecure (base, large), <ins>undescored</ins> values the second best.


	\| Model \| Accuracy NLI \| GermEval\_14 F1 \| CoNLL F1 \| Coarse F1 \| Fine F1 \| 10kGNAD F1 \|
	\|-------------------------------------\|--------------\|----------------\|----------\|-----------\|---------\|------------\|
	\| [GottBERT_base_best](https://huggingface.co/TUM/GottBERT_base_best) \| 80.82 \| 87.55 \| <ins>85.93</ins> \| 78.17 \| 53.30 \| 89.64 \|
	\| [GottBERT_base_last](https://huggingface.co/TUM/GottBERT_base_last) \| 81.04 \| 87.48 \| 85.61 \| <ins>78.18</ins> \| 53.92 \| 90.27 \|
	\| [GottBERT_filtered_base_best](https://huggingface.co/TUM/GottBERT_filtered_base_best) \| 80.56 \| <ins>87.57</ins> \| 86.14 \| 78.65 \| 52.82 \| 89.79 \|
	\| [GottBERT_filtered_base_last](https://huggingface.co/TUM/GottBERT_filtered_base_last) \| 80.74 \| 87.59 \| 85.66 \| 78.08 \| 52.39 \| 89.92 \|
	\| GELECTRA_base \| 81.70 \| 86.91 \| 85.37 \| 77.26 \| 50.07 \| 89.02 \|
	\| GBERT_base \| 80.06 \| 87.24 \| 85.16 \| 77.37 \| 51.51 \| 90.30 \|
	\| dbmdzBERT \| 68.12 \| 86.82 \| 85.15 \| 77.46 \| 52.07 \| 90.34 \|
	\| GermanBERT \| 78.16 \| 86.53 \| 83.87 \| 74.81 \| 47.78 \| 90.18 \|
	\| XLM-R_base \| 79.76 \| 86.14 \| 84.46 \| 77.13 \| 50.54 \| 89.81 \|
	\| mBERT \| 77.03 \| 86.67 \| 83.18 \| 73.54 \| 48.32 \| 88.90 \|
	\| [GottBERT_large](https://huggingface.co/TUM/GottBERT_large) \| 82.46 \| 88.20 \| <ins>86.78</ins> \| 79.40 \| 54.61 \| 90.24 \|
	\| [GottBERT_filtered_large_best](https://huggingface.co/TUM/GottBERT_filtered_large_best) \| 83.31 \| 88.13 \| 86.30 \| 79.32 \| 54.70 \| 90.31 \|
	\| [GottBERT_filtered_large_last](https://huggingface.co/TUM/GottBERT_filtered_large_last) \| 82.79 \| <ins>88.27</ins> \| 86.28 \| 78.96 \| 54.72 \| 90.17 \|
	\| GELECTRA_large \| 86.33 \| <ins>88.72</ins> \| <ins>86.78</ins> \| 81.28 \| <ins>56.17</ins> \| 90.97 \|
	\| GBERT_large \| <ins>84.21</ins> \| <ins>88.72</ins> \| 87.19 \| <ins>80.84</ins> \| 57.37 \| <ins>90.74</ins> \|
	\| XLM-R_large \| 84.07 \| 88.83 \| 86.54 \| 79.05 \| 55.06 \| 90.17 \|


	## Model Architecture
	- Base Model: 12 layers, 125M parameters, 52k token vocabulary.
	- Large Model: 24 layers, 355M parameters, 52k token vocabulary.

	### Tokenizer
	- Type: GPT-2 Byte-Pair Encoding (BPE)
	- Vocabulary Size: 52k subword tokens
	- Trained on: 40GB subsample of the unfiltered German OSCAR corpus.

	## Limitations
	- Filtered vs Unfiltered Data: Minor improvements seen with filtered data, but not significant enough to justify filtering in every case.
	- Computation Limitations: Fixed memory allocation on TPUs required processing data as a single stream, unlike GPU training which preserves document boundaries. Training was performed in 32-bit mode due to framework limitations, increasing memory usage.

	## Fairseq Checkpoints
	Get the fairseq checkpoints [here](https://drive.proton.me/urls/CFSGE8ZK9R#1F1G727lv77k).

	## Citations
	If you use GottBERT in your research, please cite the following paper:
	```bibtex
	@misc{scheible2020gottbertpuregermanlanguage,
	title={GottBERT: a pure German Language Model},
	author={Raphael Scheible and Fabian Thomczyk and Patric Tippmann and Victor Jaravine and Martin Boeker},
	year={2020},
	eprint={2012.02110},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2012.02110},
	}
	```