tburst's picture
Update README.md
d40130c
|
raw
history blame
2.76 kB
metadata
license: mit

Model description

An xlm-roberta-large model fine-tuned on all ~1,8 million annotated statements contained in the manifesto corpus (version 2023a). The model can be used to categorize any type of text into 56 different political categories according to the Manifesto Project's coding scheme (Handbook 4).

The context model variant additionally incorporates the surrounding sentences of a statement to improve the classification results for ambiguous sentences. During fine-tuning we collected the surrounding sentences of a statement and merged them with the statement itself to provide the larger context of a sentence as the second part of a sentence pair input. We limited the statement itself to 100 tokens and the context of the statement to 200 tokens.

Important

We slightly modified the Classification Head of the XLMRobertaModelForSequenceClassification model (removed the tanh activation and the intermediate linear layer) as that improved the model performance for this task considerably. To correctly load the full model you have to provide the trust_remote_code=True argument in the from_pretrained loading call.

How to use

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/xlm-roberta-political-56topics-context-2023a", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")

sentence = "These principles are under threat."
context = "Human rights and international humanitarian law are fundamental pillars of a secure global system. These principles are under threat. Some of the world's most powerful states choose to sell arms to human-rights abusing states."

inputs = tokenizer(sentence,
                   context,
                   return_tensors="pt",
                   max_length=300,  #we limited the input to 300 tokens during finetuning
                   padding="max_length",
                   truncation=True
                   )

logits = model(**inputs).logits

probabilities = torch.softmax(logits, dim=1).tolist()[0]
probabilities = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)}
probabilities = dict(sorted(probabilities.items(), key=lambda item: item[1], reverse=True))
print(probabilities)
# {'201 - Freedom and Human Rights': 90.76, '107 - Internationalism: Positive': 5.82, '105 - Military: Negative': 0.66...

predicted_class = model.config.id2label[logits.argmax().item()]
print(predicted_class)
# 201 - Freedom and Human Rights