tburst's picture
Update README.md
de11376
|
raw
history blame
No virus
2.89 kB
---
license: mit
---
## Model description
An xlm-roberta-large model fine-tuned on all ~1,8 million annotated statements contained in the [manifesto corpus](https://manifesto-project.wzb.eu/information/documents/corpus) (version 2023a).
The model can be used to categorize any type of text into 56 different political topics according to the Manifesto Project's coding scheme ([Handbook 4](https://manifesto-project.wzb.eu/coding_schemes/mp_v4)).
The context model variant additionally incorporates the surrounding sentences of a statement to improve the classification results for ambiguous sentences.
During fine-tuning we collected the surrounding sentences of a statement and merged them with the statement itself to provide the larger context of a sentence as the second part of a sentence pair input.
We limited the statement itself to 100 tokens and the context of the statement to 200 tokens.
**Important**
We slightly modified the Classification Head of the `XLMRobertaModelForSequenceClassification` model (removed the tanh activation and the intermediate linear layer) as that improved the model performance for this task considerably.
To correctly load the full model, include the `trust_remote_code=True` argument when using the `from_pretrained method`.
## How to use
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/xlm-roberta-political-56topics-context-2023a", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
sentence = "These principles are under threat."
context = "Human rights and international humanitarian law are fundamental pillars of a secure global system. These principles are under threat. Some of the world's most powerful states choose to sell arms to human-rights abusing states."
# For sentences without additional context, just use the sentence itself as the context.
# Example: context = "These principles are under threat."
inputs = tokenizer(sentence,
context,
return_tensors="pt",
max_length=300, #we limited the input to 300 tokens during finetuning
padding="max_length",
truncation=True
)
logits = model(**inputs).logits
probabilities = torch.softmax(logits, dim=1).tolist()[0]
probabilities = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)}
probabilities = dict(sorted(probabilities.items(), key=lambda item: item[1], reverse=True))
print(probabilities)
# {'201 - Freedom and Human Rights': 90.76, '107 - Internationalism: Positive': 5.82, '105 - Military: Negative': 0.66...
predicted_class = model.config.id2label[logits.argmax().item()]
print(predicted_class)
# 201 - Freedom and Human Rights
```