README.md · chkla/parlbert-topic-german at 81d1b56c4502446cd95f4651aac45080b89cd156

metadata

language: german
widget:
  - text: >-
      It has been determined that the amount of greenhouse gases have decreased
      by almost half because of the prevalence in the utilization of nuclear
      power.

Welcome to ParlBERT-Topic-German!

🤖 Model description

This model was trained on ~10k manually annotated political requests (📚 Stab et al. 2018) of comparative agenda topics to classify text into one of twenty labels: 🏷 TOPIC1 (0) and TOPIC2 (1) ...

🗃 Dataset

The dataset (📚 Stab et al. 2018) consists of ARGUMENTS (~11k) that either support or oppose a topic if it includes a relevant reason for supporting or opposing the topic, or as a NON-ARGUMENT (~14k) if it does not include reasons. The authors focus on controversial topics, i.e., topics that include "an obvious polarity to the possible outcomes" and compile a final set of eight controversial topics: abortion, school uniforms, death penalty, marijuana legalization, nuclear energy, cloning, gun control, and minimum wage.

TOPIC	ARGUMENT	NON-ARGUMENT
abortion	2213	2,427
school uniforms	325	1,734
death penalty	325	2,083
marijuana legalization	325	1,262
nuclear energy	325	2,118
cloning	325	1,494
gun control	325	1,889
minimum wage	325	1,346

🏃🏼‍♂️Model training

ParlBERT-Topic was fine-tuned on ParlBERT from HuggingFace for topic modeling with questions dataset from the Comparative Agendas Project. We used the HuggingFace trainer with the following hyperparameters:

training_args = TrainingArguments(
    num_train_epochs=2,
    learning_rate=2.3102e-06,
    seed=8,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
)

📊 Evaluation

The model was evaluated on an evaluation set (20%):

Model	Acc	F1	R arg	R non	P arg	P non
RoBERTArg	0.8193	0.8021	0.8463	0.7986	0.7623	0.8719

Showing the confusion matrix using again the evaluation set:

	ARGUMENT	NON-ARGUMENT
ARGUMENT	2213	558
NON-ARGUMENT	325	1790

⚠️ Intended Uses & Potential Limitations

The model can only be a starting point to dive into the exciting field of policy topic classification in political texts. But be aware. Models are often highly topic dependent. Therefore, the model may perform less well on different topics and text types not included in the training set.

Enjoy and stay tuned! 🚀

🐦 Twitter: @chklamm