license: mit
Model description
An xlm-roberta-large model fine-tuned on all ~1,8 million annotated statements contained in the manifesto corpus (version 2023a). The model can be used to categorize any type of text into 56 different political categories according to the Manifesto Project's coding scheme (Handbook 4).
The context model variant additionally incorporates the surrounding sentences of a statement to improve the classification results for ambiguous sentences. During fine-tuning we collected the surrounding sentences of a statement and merged them with the statement itself to provide the larger context of a sentence as the second part of a sentence pair input. We limited the statement itself to 100 tokens and the context of the statement to 200 tokens.
Important
We slightly modified the Classification Head of the XLMRobertaModelForSequenceClassification model (removed the tanh activation and the intermediate linear layer) as that improved the model performance for this task considerably. To correctly load the full model you have to provide the trust_remote_code=True
argument in the from_pretrained loading call.
How to use
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/xlm-roberta-political-56topics-context-2023a", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
sentence = "These principles are under threat."
context = "Human rights and international humanitarian law are fundamental pillars of a secure global system. These principles are under threat. Some of the world's most powerful states choose to sell arms to human-rights abusing states."
inputs = tokenizer(sentence,
context,
return_tensors="pt",
max_length=300, #we limited the input to 300 tokens during finetuning
padding="max_length",
truncation=True
)
logits = model(**inputs).logits
probabilities = torch.softmax(logits, dim=1).tolist()[0]
probabilities = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)}
probabilities = dict(sorted(probabilities.items(), key=lambda item: item[1], reverse=True))
print(probabilities)
# {'201 - Freedom and Human Rights': 90.76, '107 - Internationalism: Positive': 5.82, '105 - Military: Negative': 0.66...
predicted_class = model.config.id2label[logits.argmax().item()]
print(predicted_class)
# 201 - Freedom and Human Rights