README.md · manifesto-project/manifestoberta-xlm-roberta-56policy-topics-context-2023-1-1 at 1747264cce727e90d72fb53520d9b46a279d13af

metadata

license: mit

Model description

An xlm-roberta-large model fine-tuned on all ~1,8 million annotated statements contained in the manifesto corpus (version 2023a). The model can be used to categorize any type of text into 56 different political topics according to the Manifesto Project's coding scheme (Handbook 4).

The context model variant additionally incorporates the surrounding sentences of a statement to improve the classification results for ambiguous sentences. During fine-tuning we collected the surrounding sentences of a statement and merged them with the statement itself to provide the larger context of a sentence as the second part of a sentence pair input. We limited the statement itself to 100 tokens and the context of the statement to 200 tokens.

Important

We slightly modified the Classification Head of the XLMRobertaModelForSequenceClassification model (removed the tanh activation and the intermediate linear layer) as that improved the model performance for this task considerably. To correctly load the full model, include the trust_remote_code=True argument when using the from_pretrained method.

How to use

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/xlm-roberta-political-56topics-context-2023a", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")

sentence = "These principles are under threat."
context = "Human rights and international humanitarian law are fundamental pillars of a secure global system. These principles are under threat. Some of the world's most powerful states choose to sell arms to human-rights abusing states."
# For sentences without additional context, just use the sentence itself as the context.
# Example: context = "These principles are under threat."


inputs = tokenizer(sentence,
                   context,
                   return_tensors="pt",
                   max_length=300,  #we limited the input to 300 tokens during finetuning
                   padding="max_length",
                   truncation=True
                   )

logits = model(**inputs).logits

probabilities = torch.softmax(logits, dim=1).tolist()[0]
probabilities = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)}
probabilities = dict(sorted(probabilities.items(), key=lambda item: item[1], reverse=True))
print(probabilities)
# {'201 - Freedom and Human Rights': 90.76, '107 - Internationalism: Positive': 5.82, '105 - Military: Negative': 0.66...

predicted_class = model.config.id2label[logits.argmax().item()]
print(predicted_class)
# 201 - Freedom and Human Rights