--- license: mit --- ## Model description An xlm-roberta-large model fine-tuned on all ~1,8 million annotated statements contained in the [manifesto corpus](https://manifesto-project.wzb.eu/information/documents/corpus) (version 2023a). The model can be used to categorize any type of text into 56 different political topics according to the Manifesto Project's coding scheme ([Handbook 4](https://manifesto-project.wzb.eu/coding_schemes/mp_v4)). The context model variant additionally incorporates the surrounding sentences of a statement to improve the classification results for ambiguous sentences. During fine-tuning we collected the surrounding sentences of a statement and merged them with the statement itself to provide the larger context of a sentence as the second part of a sentence pair input. We limited the statement itself to 100 tokens and the context of the statement to 200 tokens. **Important** We slightly modified the Classification Head of the `XLMRobertaModelForSequenceClassification` model (removed the tanh activation and the intermediate linear layer) as that improved the model performance for this task considerably. To correctly load the full model, include the `trust_remote_code=True` argument when using the `from_pretrained method`. ## How to use ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/xlm-roberta-political-56topics-context-2023a", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large") sentence = "These principles are under threat." context = "Human rights and international humanitarian law are fundamental pillars of a secure global system. These principles are under threat. Some of the world's most powerful states choose to sell arms to human-rights abusing states." # For sentences without additional context, just use the sentence itself as the context. # Example: context = "These principles are under threat." inputs = tokenizer(sentence, context, return_tensors="pt", max_length=300, #we limited the input to 300 tokens during finetuning padding="max_length", truncation=True ) logits = model(**inputs).logits probabilities = torch.softmax(logits, dim=1).tolist()[0] probabilities = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)} probabilities = dict(sorted(probabilities.items(), key=lambda item: item[1], reverse=True)) print(probabilities) # {'201 - Freedom and Human Rights': 90.76, '107 - Internationalism: Positive': 5.82, '105 - Military: Negative': 0.66... predicted_class = model.config.id2label[logits.argmax().item()] print(predicted_class) # 201 - Freedom and Human Rights ```