parlbert-german-v2 / README.md
chkla's picture
Update README.md
46bbb25
metadata
language:
  - de
pipeline_tag: fill-mask
tags:
  - parliamentary protocols
  - political texts
widget:
  - text: Diese Themen gehören nicht ins [MASK].

⚠️ This version is only trained on around 5 million sentences (perplexity w/ adaption: 3.38 and w/o 13.38). The final version trained on around 30 million sentences will be available soon.

🚀 ParlBERT-v2 is a more general version of ParlBERT-v1 including texts from federal and state level in Germany. This first version was only trained on state level.

ParlBERT v2

This model is based on the German BERT (GBERT) architecture, specifically the "deepset/gbert-base" base model. It has been trained on over 30 million German political sentences from the "GerParCor" (Abrami et al. 2022) corpus for three epochs to provide a domain-adapted language model for German political texts. The German Political Texts Adapted GBERT model is designed for tasks related to German political texts. It can be used in a variety of applications.

📚 Datasset

"GerParCor is a genre-specific corpus of (predominantly historical) German-language parliamentary protocols from three centuries and four countries, including state and federal level data." (Abrami et al. 2022)

🤖 Model training

During the model training process, a masked language modeling approach was used with a token masking probability of 15%. The training was performed for three epochs, which means that the entire dataset was passed through the model three times during the training process.

👨‍💻 Model Use

from transformers import pipeline
model = pipeline('fill-mask', model='parlbert-german-v2')
model("Diese Themen gehören nicht ins [MASK].")

⚠️ Limitations

The German ParlBERT has limitations and potential biases. The GerParCor corpus only contains texts from the domain of politics, so the model may not perform well on texts from other domains. Additionally, the model may not be suitable for analyzing social media posts and many more. The model's training data is derived from contemporary German political texts, which may reflect certain biases or perspectives. For instance, the corpus includes texts from specific political parties or interest groups, which may lead to overrepresentation or underrepresentation of certain viewpoints. To address these limitations and potential biases, users are encouraged to evaluate the model's performance on their specific use case and carefully consider the training data's representativeness for their target text domain.

🐦 Twitter: @chklamm