File size: 9,776 Bytes
4f4aadc
9a8ca27
4f4aadc
adfa996
 
948fae3
48fb056
948fae3
0bcd9d1
427827c
0bcd9d1
 
 
 
 
 
 
 
 
 
bbce29e
77c985c
adfa996
 
 
db66d9b
 
bbce29e
 
 
 
2696ffd
bbce29e
 
1518b01
bbce29e
 
 
 
de11376
 
 
bbce29e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
adfa996
77c985c
ca3c910
 
 
 
 
 
 
 
 
 
 
 
 
a925f1d
 
4bfa20b
a925f1d
 
ca3c910
22361c3
ca3c910
 
 
 
056f873
 
0bcd9d1
056f873
 
 
0bcd9d1
 
1518b01
 
056f873
39ba715
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
06c0467
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
license: bigscience-openrail-m
---
## Model description

An xlm-roberta-large model fine-tuned on ~1,6 million annotated statements contained in the [Manifesto Corpus](https://manifesto-project.wzb.eu/information/documents/corpus) (version 2023a).
The model can be used to categorize any type of text into 56 different political topics according to the Manifesto Project's coding scheme ([Handbook 4](https://manifesto-project.wzb.eu/coding_schemes/mp_v4)).
It works for all languages the xlm-roberta model is pretrained on ([overview](https://github.com/facebookresearch/fairseq/tree/main/examples/xlmr#introduction)), just note that it will perform best for the 38 languages contained in the Manifesto Corpus:

||||||
|------|------|------|------|------|
|armenian|bosnian|bulgarian|catalan|croatian|
|czech|danish|dutch|english|estonian|
|finnish|french|galician|georgian|german|
|greek|hebrew|hungarian|icelandic|italian|
|japanese|korean|latvian|lithuanian|macedonian|
|montenegrin|norwegian|polish|portuguese|romanian|
|russian|serbian|slovak|slovenian|spanish|
|swedish|turkish|ukrainian| | |


The context model variant additionally incorporates the surrounding sentences of a statement to improve the classification results for ambiguous sentences. (See Training Procedure for details)

**Important**

We slightly modified the Classification Head of the `XLMRobertaModelForSequenceClassification` model (removed the tanh activation and the intermediate linear layer) as that improved the model performance for this task considerably.
To correctly load the full model, include the `trust_remote_code=True` argument when using the `from_pretrained method`.

## How to use

```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/manifestoberta-xlm-roberta-56policy-topics-context-2023-1-1", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")

sentence = "These principles are under threat."
context = "Human rights and international humanitarian law are fundamental pillars of a secure global system. These principles are under threat. Some of the world's most powerful states choose to sell arms to human-rights abusing states."
# For sentences without additional context, just use the sentence itself as the context.
# Example: context = "These principles are under threat."


inputs = tokenizer(sentence,
                   context,
                   return_tensors="pt",
                   max_length=300,  #we limited the input to 300 tokens during finetuning
                   padding="max_length",
                   truncation=True
                   )

logits = model(**inputs).logits

probabilities = torch.softmax(logits, dim=1).tolist()[0]
probabilities = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)}
probabilities = dict(sorted(probabilities.items(), key=lambda item: item[1], reverse=True))
print(probabilities)
# {'201 - Freedom and Human Rights': 90.76, '107 - Internationalism: Positive': 5.82, '105 - Military: Negative': 0.66...

predicted_class = model.config.id2label[logits.argmax().item()]
print(predicted_class)
# 201 - Freedom and Human Rights
```

## Training Procedure

Training of the model took place on all quasi-sentences of the Manifesto Corpus (version 2023a), minus 10% that were kept out of training for the final test and evaluation results.
This results in a training dataset of 1,601,329 quasi-sentences.
As our context-including model input poses the threat of data-leakage problems between train and test data, we refrained from randomly splitting quasi-sentences into train and test data.
Instead, we randomly split the dataset on the manifesto level, so that 1779 manifestos and all their quasi-sentences were assigned to the train set and 198 to the test set.

As training parameters, we used the following settings: learning rate: 1e-5, weight decay: 0.01, epochs: 1, batch size: 4, gradient accumulation steps: 4 (effective batch size: 16).

### Context 

To adapt the model to the task of classifying statements in manifestos we made some modifications to the traditional training setup.
Given that human annotators in the Manifesto Project are encouraged to use surrounding sentences to interpret ambiguous statements , we combined statements  with their context for our model's input.
Specifically, we used a sentence-pair input, where the single to-be-classified statement gets followed by the separator token followed by the whole bigger context of length 200 tokens, in which the statement to-be-classified is embedded.
Here is an example: 

*"`<s>` We must right the wrongs in our democracy, `</s>` `</s>` To turn this crisis into a crucible, from which we will forge a stronger, brighter, and more equitable future. We must right the wrongs in our democracy, redress the systemic injustices that have long plagued our society,throw open the doors of opportunity for all Americans and reinvent our institutions at home and our leadership abroad. `</s>`".*


The second part, which contains the context, is greedily filled until it contains 200 tokens.
Our tests showed that including the context helped to improve the performance of the classification model considerably (~7% accuracy).
We tried other approaches like using two XLM-RoBERTa models as a duo, where one receives the sentence and one the context, and a shared-layer model, where both inputs are fed separately trough the same model.
Both variants performed similarly to our sentence pair approach, but lead to higher complexity and computing costs, which is why we ultimately opted for the sentence pair way to include the surrounding context. 


## Model Performance

The model was evaluated on a test set of 199,046 annotated manifesto statements.

### Overall

|                                                                                                       | Accuracy | Top2_Acc | Top3_Acc | Precision| Recall | F1_Macro | MCC | Cross-Entropy |
|-------------------------------------------------------------------------------------------------------|:--------:|:--------:|:--------:|:--------:|:------:|:--------:|:---:|:-------------:|
[Sentence Model](https://huggingface.co/manifesto-project/manifestoberta-xlm-roberta-56policy-topics-sentence-2023-1-1)|   0.57   |   0.73   |	  0.81   |	  0.49  |  0.43  |	 0.45   | 0.55|	     1.5      |
[Context Model](https://huggingface.co/manifesto-project/manifestoberta-xlm-roberta-56policy-topics-context-2023-1-1)  |   0.64   |   0.81   |   0.88   |    0.54  |  0.52  |   0.53   | 0.62|      1.15     |

### Categories

|Category|Precision|Recall|F1|n_test(%)|n_predicted(%)|
|:------|:-----------:|:----:|:----:|:-----:|:-----:|
| 101  |0.50|0.48|0.49|0.30%|0.29%|
|102|0.56|0.61|0.58|0.09%|0.10%|
|103|0.51|0.36|0.42|0.28%|0.20%|
|104|0.78|0.81|0.79|1.57%|1.64%|
|105|0.69|0.70|0.69|0.34%|0.34%|
|106|0.59|0.57|0.58|0.33%|0.32%|
|107|0.68|0.66|0.67|2.24%|2.17%|
|108|0.66|0.68|0.67|1.20%|1.24%|
|109|0.52|0.39|0.45|0.17%|0.13%|
|110|0.63|0.68|0.65|0.36%|0.38%|
|201|0.58|0.59|0.59|2.16%|2.20%|
|202|0.62|0.63|0.62|3.25%|3.28%|
|203|0.46|0.47|0.47|0.19%|0.19%|
|204|0.61|0.37|0.46|0.25%|0.15%|
|301|0.66|0.71|0.68|2.13%|2.29%|
|302|0.38|0.25|0.30|0.17%|0.11%|
|303|0.58|0.60|0.59|5.12%|5.31%|
|304|0.67|0.65|0.66|1.38%|1.34%|
|305|0.59|0.57|0.58|2.32%|2.22%|
|401|0.45|0.36|0.40|1.50%|1.21%|
|402|0.61|0.58|0.59|2.73%|2.60%|
|403|0.56|0.51|0.53|3.59%|3.25%|
|404|0.30|0.15|0.20|0.58%|0.28%|
|405|0.43|0.51|0.47|0.18%|0.21%|
|406|0.38|0.46|0.42|0.26%|0.31%|
|407|0.56|0.52|0.54|0.40%|0.38%|
|408|0.28|0.17|0.21|1.34%|0.79%|
|409|0.37|0.21|0.27|0.24%|0.14%|
|410|0.53|0.50|0.52|2.22%|2.08%|
|411|0.73|0.75|0.74|8.32%|8.53%|
|412|0.26|0.20|0.22|0.58%|0.45%|
|413|0.49|0.63|0.55|0.29%|0.37%|
|414|0.58|0.55|0.56|1.38%|1.32%|
|415|0.14|0.23|0.18|0.05%|0.07%|
|416|0.52|0.49|0.50|2.45%|2.35%|
|501|0.69|0.78|0.73|4.77%|5.35%|
|502|0.78|0.84|0.81|3.08%|3.32%|
|503|0.61|0.63|0.62|5.96%|6.11%|
|504|0.71|0.76|0.74|10.05%|10.76%|
|505|0.46|0.37|0.41|0.69%|0.55%|
|506|0.78|0.82|0.80|5.42%|5.72%|
|507|0.45|0.26|0.33|0.14%|0.08%|
|601|0.52|0.46|0.49|1.79%|1.57%|
|602|0.35|0.34|0.34|0.24%|0.24%|
|603|0.65|0.68|0.67|1.36%|1.42%|
|604|0.62|0.48|0.54|0.57%|0.44%|
|605|0.72|0.74|0.73|4.22%|4.33%|
|606|0.56|0.48|0.51|1.45%|1.23%|
|607|0.57|0.67|0.62|1.08%|1.25%|
|608|0.48|0.48|0.48|0.41%|0.41%|
|701|0.62|0.66|0.64|3.35%|3.59%|
|702|0.42|0.30|0.35|0.08%|0.06%|
|703|0.75|0.87|0.80|2.65%|3.07%|
|704|0.43|0.32|0.37|0.57%|0.43%|
|705|0.38|0.33|0.35|0.80%|0.69%|
|706|0.43|0.37|0.39|1.35%|1.16%|

## Citation

Please cite the model as follows:

Burst, Tobias / Lehmann, Pola / Franzmann, Simon / Al-Gaddooa, Denise / Ivanusch, Christoph / Regel, Sven / Riethmüller, Felicia / Weßels, Bernhard / Zehnter, Lisa (2023): manifestoberta. Version 56topics.context.2023.1.1. Berlin: Wissenschaftszentrum Berlin für Sozialforschung (WZB) / Göttingen: Institut für Demokratieforschung (IfDem). https://doi.org/10.25522/manifesto.manifestoberta.56topics.context.2023.1.1  

```bib
@misc{Burst:2023,
  Address = {Berlin / Göttingen},
  Author = {Burst, Tobias AND Lehmann, Pola AND Franzmann, Simon AND Al-Gaddooa, Denise AND Ivanusch, Christoph AND Regel, Sven AND Riethmüller, Felicia AND Weßels, Bernhard AND Zehnter, Lisa},
  Publisher = {Wissenschaftszentrum Berlin für Sozialforschung / Göttinger Institut für Demokratieforschung},
  Title = {manifestoberta. Version 56topics.context.2023.1.1},
  doi = {10.25522/manifesto.manifestoberta.56topics.context.2023.1.1},
  url = {https://doi.org/10.25522/manifesto.manifestoberta.56topics.context.2023.1.1},  
  Year = {2023},
```