tburst's picture
Update README.md
22361c3
|
raw
history blame
8.67 kB
metadata
license: mit

Model description

An xlm-roberta-large model fine-tuned on ~1,6 million annotated statements contained in the manifesto corpus (version 2023a). The model can be used to categorize any type of text into 56 different political topics according to the Manifesto Project's coding scheme (Handbook 4). It works for all languages the xlm-roberta-model is pretrained on (overview), just note that it will perform best for the 38 languages of the Manifesto Corpus on which it was fine-tuned:

Language Language Language Language Language
armenian bosnian bulgarian catalan croatian
czech danish dutch english estonian
finnish french galician georgian german
greek hebrew hungarian icelandic italian
japanese korean latvian lithuanian macedonian
montenegrin norwegian polish portuguese romanian
russian serbian slovak slovenian spanish
swedish turkish ukrainian

The context model variant additionally incorporates the surrounding sentences of a statement to improve the classification results for ambiguous sentences. (See Training Procedure for details)

Important

We slightly modified the Classification Head of the XLMRobertaModelForSequenceClassification model (removed the tanh activation and the intermediate linear layer) as that improved the model performance for this task considerably. To correctly load the full model, include the trust_remote_code=True argument when using the from_pretrained method.

How to use

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/xlm-roberta-political-56topics-context-2023a", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")

sentence = "These principles are under threat."
context = "Human rights and international humanitarian law are fundamental pillars of a secure global system. These principles are under threat. Some of the world's most powerful states choose to sell arms to human-rights abusing states."
# For sentences without additional context, just use the sentence itself as the context.
# Example: context = "These principles are under threat."


inputs = tokenizer(sentence,
                   context,
                   return_tensors="pt",
                   max_length=300,  #we limited the input to 300 tokens during finetuning
                   padding="max_length",
                   truncation=True
                   )

logits = model(**inputs).logits

probabilities = torch.softmax(logits, dim=1).tolist()[0]
probabilities = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)}
probabilities = dict(sorted(probabilities.items(), key=lambda item: item[1], reverse=True))
print(probabilities)
# {'201 - Freedom and Human Rights': 90.76, '107 - Internationalism: Positive': 5.82, '105 - Military: Negative': 0.66...

predicted_class = model.config.id2label[logits.argmax().item()]
print(predicted_class)
# 201 - Freedom and Human Rights

Training Procedure

Training of the model took place on all quasi-sentences of the Manifesto Corpus (version 2023a), minus 10% that were kept out of training for the final test and evaluation results. This results in a training dataset of 1,601,329 quasi-sentences. As our context-including model input poses the threat of data-leakage problems between train and test data, we refrained from randomly splitting quasi-sentences into train and test data. Instead, we randomly split the dataset on the manifesto level, so that 1779 manifestos and all their quasi-sentences were assigned to the train set and 198 to the test set.

As training parameters, we used the following settings: learning rate: 1e-5, weight decay: 0.01, epochs: 1, batch size: 4, gradient accumulation steps: 4 (effective batch size: 16).

Context

To adapt the model to the task of classifying statements in manifestos we made some modifications to the traditional training setup. Given that human annotators in the Manifesto Project are encouraged to use surrounding sentences to interpret ambiguous statements , we combined statements with their context for our model's input. Specifically, we used a sentence-pair input, where the single to-be-classified statement gets followed by the separator token followed by the whole bigger context of length 200 tokens, in which the statement to-be-classified is embedded. Here is an example:

"<s> We must right the wrongs in our democracy, </s> </s> To turn this crisis into a crucible, from which we will forge a stronger, brighter, and more equitable future. We must right the wrongs in our democracy, redress the systemic injustices that have long plagued our society,throw open the doors of opportunity for all Americans and reinvent our institutions at home and our leadership abroad. </s>".

The second part, which contains the context, is greedily filled until it contains 200 tokens. Our tests showed that including the context helped to improve the performance of the classification model considerably (~7% accuracy). We tried other approaches like using two XLM-RoBERTa models as a duo, where one receives the sentence and one the context, and a shared-layer model, where both inputs are fed separately trough the same model. Both variants performed similarly to our sentence pair approach, but lead to higher complexity and computing costs, which is why we ultimately opted for the sentence pair way to include the surrounding context.

Model Performance

The model was evaluated on a test set of 199,046 annotated manifesto statements.

Overall

Accuracy Top2_Acc Top3_Acc Precision Recall F1_Macro MCC Cross-Entropy
Sentence Model 0.57 0.73 0.81 0.49 0.43 0.45 0.55 1.5
Context Model 0.64 0.81 0.88 0.54 0.52 0.53 0.62 1.15

Categories

Category Precision Recall F1 n_test(%) n_predicted(%)
101 0.50 0.48 0.49 0.30% 0.29%
102 0.56 0.61 0.58 0.09% 0.10%
103 0.51 0.36 0.42 0.28% 0.20%
104 0.78 0.81 0.79 1.57% 1.64%
105 0.69 0.70 0.69 0.34% 0.34%
106 0.59 0.57 0.58 0.33% 0.32%
107 0.68 0.66 0.67 2.24% 2.17%
108 0.66 0.68 0.67 1.20% 1.24%
109 0.52 0.39 0.45 0.17% 0.13%
110 0.63 0.68 0.65 0.36% 0.38%
201 0.58 0.59 0.59 2.16% 2.20%
202 0.62 0.63 0.62 3.25% 3.28%
203 0.46 0.47 0.47 0.19% 0.19%
204 0.61 0.37 0.46 0.25% 0.15%
301 0.66 0.71 0.68 2.13% 2.29%
302 0.38 0.25 0.30 0.17% 0.11%
303 0.58 0.60 0.59 5.12% 5.31%
304 0.67 0.65 0.66 1.38% 1.34%
305 0.59 0.57 0.58 2.32% 2.22%
401 0.45 0.36 0.40 1.50% 1.21%
402 0.61 0.58 0.59 2.73% 2.60%
403 0.56 0.51 0.53 3.59% 3.25%
404 0.30 0.15 0.20 0.58% 0.28%
405 0.43 0.51 0.47 0.18% 0.21%
406 0.38 0.46 0.42 0.26% 0.31%
407 0.56 0.52 0.54 0.40% 0.38%
408 0.28 0.17 0.21 1.34% 0.79%
409 0.37 0.21 0.27 0.24% 0.14%
410 0.53 0.50 0.52 2.22% 2.08%
411 0.73 0.75 0.74 8.32% 8.53%
412 0.26 0.20 0.22 0.58% 0.45%
413 0.49 0.63 0.55 0.29% 0.37%
414 0.58 0.55 0.56 1.38% 1.32%
415 0.14 0.23 0.18 0.05% 0.07%
416 0.52 0.49 0.50 2.45% 2.35%
501 0.69 0.78 0.73 4.77% 5.35%
502 0.78 0.84 0.81 3.08% 3.32%
503 0.61 0.63 0.62 5.96% 6.11%
504 0.71 0.76 0.74 10.05% 10.76%
505 0.46 0.37 0.41 0.69% 0.55%
506 0.78 0.82 0.80 5.42% 5.72%
507 0.45 0.26 0.33 0.14% 0.08%
601 0.52 0.46 0.49 1.79% 1.57%
602 0.35 0.34 0.34 0.24% 0.24%
603 0.65 0.68 0.67 1.36% 1.42%
604 0.62 0.48 0.54 0.57% 0.44%
605 0.72 0.74 0.73 4.22% 4.33%
606 0.56 0.48 0.51 1.45% 1.23%
607 0.57 0.67 0.62 1.08% 1.25%
608 0.48 0.48 0.48 0.41% 0.41%
701 0.62 0.66 0.64 3.35% 3.59%
702 0.42 0.30 0.35 0.08% 0.06%
703 0.75 0.87 0.80 2.65% 3.07%
704 0.43 0.32 0.37 0.57% 0.43%
705 0.38 0.33 0.35 0.80% 0.69%
706 0.43 0.37 0.39 1.35% 1.16%