Update README.md
Browse files
README.md
CHANGED
@@ -3,11 +3,48 @@ license: mit
|
|
3 |
---
|
4 |
## Model description
|
5 |
|
6 |
-
An xlm-roberta-large model fine-tuned on
|
|
|
|
|
|
|
|
|
|
|
7 |
|
8 |
**Important**
|
9 |
|
10 |
-
We slightly modified the Classification Head of the XLMRobertaModelForSequenceClassification model (removed the
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
|
12 |
-
|
13 |
|
|
|
3 |
---
|
4 |
## Model description
|
5 |
|
6 |
+
An xlm-roberta-large model fine-tuned on all ~1,8 million annotated statements contained in the [manifesto corpus](https://manifesto-project.wzb.eu/information/documents/corpus) (version 2023a).
|
7 |
+
The model can be used to categorize any type of text into [56 different political categories](https://manifesto-project.wzb.eu/coding_schemes/mp_v4) according to the Manifesto Project's coding scheme (Handbook 4).
|
8 |
+
|
9 |
+
The context model variant additionally utilizes the surrounding sentences of a statement to improve the classification results for ambiguous sentences.
|
10 |
+
During fine-tuning we collected the surrounding sentences of a statement and combined them with the statement itself to provide the larger context of a sentence as the second part of a sentence pair input.
|
11 |
+
We limited the statement itself to 100 tokens and the context of the statement to 200 tokens.
|
12 |
|
13 |
**Important**
|
14 |
|
15 |
+
We slightly modified the Classification Head of the XLMRobertaModelForSequenceClassification model (removed the tanh activation and the intermediate linear layer) as that improved the model performance for this task considerably. To correctly load the full model you have to provide the ```trust_remote_code=True``` argument in the from_pretrained loading call.
|
16 |
+
|
17 |
+
|
18 |
+
## How to use
|
19 |
+
|
20 |
+
```python
|
21 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
22 |
+
|
23 |
+
model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/xlm-roberta-large-56topics-context-2023a", trust_remote_code=True)
|
24 |
+
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
|
25 |
+
|
26 |
+
sentence = "These principles are under threat."
|
27 |
+
context = "Human rights and international humanitarian law are fundamental pillars of a secure global system. These principles are under threat. Some of the world's most powerful states choose to sell arms to human-rights abusing states."
|
28 |
+
|
29 |
+
inputs = tokenizer(sentence,
|
30 |
+
context,
|
31 |
+
return_tensors="pt",
|
32 |
+
max_length=300, #we limited the input to 300 tokens during finetuning
|
33 |
+
padding="max_length",
|
34 |
+
truncation=True
|
35 |
+
)
|
36 |
+
|
37 |
+
logits = model(**inputs).logits
|
38 |
+
|
39 |
+
probabilities = torch.softmax(logits, dim=1).tolist()[0]
|
40 |
+
probabilities = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)}
|
41 |
+
probabilities = dict(sorted(probabilities.items(), key=lambda item: item[1], reverse=True))
|
42 |
+
print(probabilities)
|
43 |
+
# {'201 - Freedom and Human Rights': 90.76, '107 - Internationalism: Positive': 5.82, '105 - Military: Negative': 0.66...
|
44 |
+
|
45 |
+
predicted_class = model.config.id2label[logits.argmax().item()]
|
46 |
+
print(predicted_class)
|
47 |
+
# 201 - Freedom and Human Rights
|
48 |
+
```
|
49 |
|
|
|
50 |
|