chkla's picture
Upload README.md
ad403b2
metadata
language: german

Welcome to ParlBERT-Topic-German!

🏷 Model description

This model was trained on ~10k manually annotated political interpellations (📚 Breunig/ Schnatterer 2019) of comparative agenda topics to classify text into one of twenty labels (annotation codebook).

🗃 Dataset

party speeches tokens
CDU/CSU 7,635 4,862,654
SPD 5,321 3,158,315
AfD 3,465 1,844,707
FDP 3,067 1,593,108
The Greens 2,866 1,522,305
The Left 2,671 1,394,089
cross-bencher 200 86,170

🏃🏼‍♂️Model training

ParlBERT-Topic was fine-tuned on a domain adapted model for topic modeling with interpellations dataset from the Comparative Agendas Project (mlm_probability=.15). We used the HuggingFace trainer with the following hyperparameters.

🤖 Use

from transformers import pipeline

pipeline_classification_topics = pipeline("text-classification", model="chkla/parlbert-topics-german", tokenizer="bert-base-german-cased", return_all_scores=False, device=0)

text = "Sachgebiet Ausschließliche Gesetzgebungskompetenz des Bundes über die Zusammenarbeit des Bundes und der Länder zum Schutze der freiheitlichen demokratischen Grundordnung, des Bestandes und der Sicherheit des Bundes oder eines Landes Wir fragen die Bundesregierung"

pipeline_classification_topics(text) # Government

📊 Evaluation

The model was evaluated on an evaluation set (20%):

Label F1 support
International 80.0 1,126
Defense 85.0 1,099
Government 71.3 989
International 76.5 978
International 76.6 845
International 86.0 800
International 67.1 0.8021
International 78.6 0.8021
International 78.2 0.8021
International 64.4 0.8021
International 81.0 0.8021
International 69.1 0.8021
International 62.8 0.8021
International 76.3 0.8021
International 49.2 0.8021
International 63.0 0.8021
International 71.6 0.8021
International 79.6 0.8021
International 61.5 0.8021
International 45.4 0.8021

⚠️ Intended Uses & Potential Limitations

The model can only be a starting point to dive into the exciting field of policy topic classification in political texts. But be aware. Models are often highly topic dependent. Therefore, the model may perform less well on different topics and text types not included in the training set.

👥 Cite

@article{klamm2022frameast,
  title={FrameASt: A Framework for Second-level Agenda Setting in Parliamentary Debates through the Lense of Comparative Agenda Topics},
  author={Klamm, Christopher and Rehbein, Ines and Ponzetto, Simone},
  journal={ParlaCLARIN III at LREC2022},
  year={2022}
}

🐦 Twitter: @chklamm