|
--- |
|
license: mit |
|
--- |
|
## Model description |
|
|
|
An xlm-roberta-large model fine-tuned on all ~1,8 million annotated statements contained in the [manifesto corpus](https://manifesto-project.wzb.eu/information/documents/corpus) (version 2023a). |
|
The model can be used to categorize any type of text into 56 different political topics according to the Manifesto Project's coding scheme ([Handbook 4](https://manifesto-project.wzb.eu/coding_schemes/mp_v4)). |
|
|
|
The context model variant additionally incorporates the surrounding sentences of a statement to improve the classification results for ambiguous sentences. |
|
During fine-tuning we collected the surrounding sentences of a statement and merged them with the statement itself to provide the larger context of a sentence as the second part of a sentence pair input. |
|
We limited the statement itself to 100 tokens and the context of the statement to 200 tokens. |
|
|
|
**Important** |
|
|
|
We slightly modified the Classification Head of the `XLMRobertaModelForSequenceClassification` model (removed the tanh activation and the intermediate linear layer) as that improved the model performance for this task considerably. |
|
To correctly load the full model, include the `trust_remote_code=True` argument when using the `from_pretrained method`. |
|
|
|
## How to use |
|
|
|
```python |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
|
model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/xlm-roberta-political-56topics-context-2023a", trust_remote_code=True) |
|
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large") |
|
|
|
sentence = "These principles are under threat." |
|
context = "Human rights and international humanitarian law are fundamental pillars of a secure global system. These principles are under threat. Some of the world's most powerful states choose to sell arms to human-rights abusing states." |
|
# For sentences without additional context, just use the sentence itself as the context. |
|
# Example: context = "These principles are under threat." |
|
|
|
|
|
inputs = tokenizer(sentence, |
|
context, |
|
return_tensors="pt", |
|
max_length=300, #we limited the input to 300 tokens during finetuning |
|
padding="max_length", |
|
truncation=True |
|
) |
|
|
|
logits = model(**inputs).logits |
|
|
|
probabilities = torch.softmax(logits, dim=1).tolist()[0] |
|
probabilities = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)} |
|
probabilities = dict(sorted(probabilities.items(), key=lambda item: item[1], reverse=True)) |
|
print(probabilities) |
|
# {'201 - Freedom and Human Rights': 90.76, '107 - Internationalism: Positive': 5.82, '105 - Military: Negative': 0.66... |
|
|
|
predicted_class = model.config.id2label[logits.argmax().item()] |
|
print(predicted_class) |
|
# 201 - Freedom and Human Rights |
|
``` |
|
|
|
|
|
|