Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,83 @@
|
|
1 |
-
---
|
2 |
-
license: bigscience-openrail-m
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: bigscience-openrail-m
|
3 |
+
widget:
|
4 |
+
- text: >-
|
5 |
+
We will restore funding to the Global Environment Facility and the
|
6 |
+
Intergovernmental Panel on Climate Change.
|
7 |
+
---
|
8 |
+
|
9 |
+
## Model description
|
10 |
+
An xlm-roberta-large model fine-tuned on ~1,7 million annotated statements contained in the [Manifesto Corpus](https://manifesto-project.wzb.eu/information/documents/corpus) (version 2023a).
|
11 |
+
The model can be used to categorize any type of text into 56 different political topics according to the Manifesto Project's coding scheme ([Handbook 4](https://manifesto-project.wzb.eu/coding_schemes/mp_v4)).
|
12 |
+
It works for all languages the xlm-roberta model is pretrained on ([overview](https://github.com/facebookresearch/fairseq/tree/main/examples/xlmr#introduction)), just note that it will perform best for the 38 languages contained in the Manifesto Corpus:
|
13 |
+
|
14 |
+
||||||
|
15 |
+
|------|------|------|------|------|
|
16 |
+
|armenian|bosnian|bulgarian|catalan|croatian|
|
17 |
+
|czech|danish|dutch|english|estonian|
|
18 |
+
|finnish|french|galician|georgian|german|
|
19 |
+
|greek|hebrew|hungarian|icelandic|italian|
|
20 |
+
|japanese|korean|latvian|lithuanian|macedonian|
|
21 |
+
|montenegrin|norwegian|polish|portuguese|romanian|
|
22 |
+
|russian|serbian|slovak|slovenian|spanish|
|
23 |
+
|swedish|turkish|ukrainian| | |
|
24 |
+
|
25 |
+
## How to use
|
26 |
+
|
27 |
+
```python
|
28 |
+
import torch
|
29 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
30 |
+
|
31 |
+
model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/manifestoberta-xlm-roberta-56policy-topics-sentence-2024-1-1")
|
32 |
+
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
|
33 |
+
|
34 |
+
sentence = "We will restore funding to the Global Environment Facility and the Intergovernmental Panel on Climate Change, to support critical climate science research around the world"
|
35 |
+
|
36 |
+
inputs = tokenizer(sentence,
|
37 |
+
return_tensors="pt",
|
38 |
+
max_length=200, #we limited the input to 200 tokens during finetuning
|
39 |
+
padding="max_length",
|
40 |
+
truncation=True
|
41 |
+
)
|
42 |
+
|
43 |
+
logits = model(**inputs).logits
|
44 |
+
|
45 |
+
probabilities = torch.softmax(logits, dim=1).tolist()[0]
|
46 |
+
probabilities = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)}
|
47 |
+
probabilities = dict(sorted(probabilities.items(), key=lambda item: item[1], reverse=True))
|
48 |
+
print(probabilities)
|
49 |
+
# {'501 - Environmental Protection: Positive': 67.28, '411 - Technology and Infrastructure': 15.19, '107 - Internationalism: Positive': 13.63, '416 - Anti-Growth Economy: Positive': 2.02...
|
50 |
+
|
51 |
+
predicted_class = model.config.id2label[logits.argmax().item()]
|
52 |
+
print(predicted_class)
|
53 |
+
# 501 - Environmental Protection: Positive
|
54 |
+
```
|
55 |
+
|
56 |
+
|
57 |
+
## Model Performance
|
58 |
+
|
59 |
+
The model was evaluated on a test set of 200,920 annotated manifesto statements.
|
60 |
+
|
61 |
+
### Overall
|
62 |
+
|
63 |
+
| | Accuracy | Top2_Acc | Top3_Acc | Precision| Recall | F1_Macro | MCC | Cross-Entropy |
|
64 |
+
|-------------------------------------------------------------------------------------------------------|:--------:|:--------:|:--------:|:--------:|:------:|:--------:|:---:|:-------------:|
|
65 |
+
[Sentence Model](https://huggingface.co/manifesto-project/manifestoberta-xlm-roberta-56policy-topics-sentence-2024-1-1)| 0.57 | 0.73 | 0.81 | 0.48 | 0.43 | 0.45 | 0.55| 1.47 |
|
66 |
+
[Context Model](https://huggingface.co/manifesto-project/manifestoberta-xlm-roberta-56policy-topics-context-2024-1-1) | 0.64 | 0.81 | 0.88 | 0.55 | 0.52 | 0.53 | 0.63| 1.15 |
|
67 |
+
|
68 |
+
### Citation
|
69 |
+
|
70 |
+
Please cite the model as follows:
|
71 |
+
|
72 |
+
Burst, Tobias / Lehmann, Pola / Franzmann, Simon / Al-Gaddooa, Denise / Ivanusch, Christoph / Regel, Sven / Riethmüller, Felicia / Weßels, Bernhard / Zehnter, Lisa (2024): manifestoberta. Version 56topics.sentence.2023.1.1. Berlin: Wissenschaftszentrum Berlin für Sozialforschung (WZB) / Göttingen: Institut für Demokratieforschung (IfDem). https://doi.org/10.25522/manifesto.manifestoberta.56topics.sentence.2024.1.1
|
73 |
+
|
74 |
+
```bib
|
75 |
+
@misc{Burst:2024,
|
76 |
+
Address = {Berlin / Göttingen},
|
77 |
+
Author = {Burst, Tobias AND Lehmann, Pola AND Franzmann, Simon AND Al-Gaddooa, Denise AND Ivanusch, Christoph AND Regel, Sven AND Riethmüller, Felicia AND Weßels, Bernhard AND Zehnter, Lisa},
|
78 |
+
Publisher = {Wissenschaftszentrum Berlin für Sozialforschung / Göttinger Institut für Demokratieforschung},
|
79 |
+
Title = {manifestoberta. Version 56topics.sentence.2024.1.1},
|
80 |
+
doi = {10.25522/manifesto.manifestoberta.56topics.sentence.2024.1.1},
|
81 |
+
url = {https://doi.org/10.25522/manifesto.manifestoberta.56topics.sentence.2024.1.1},
|
82 |
+
Year = {2024},
|
83 |
+
```
|