denisealg commited on
Commit
8d54fcf
1 Parent(s): 10e12d3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -3
README.md CHANGED
@@ -1,3 +1,83 @@
1
- ---
2
- license: bigscience-openrail-m
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: bigscience-openrail-m
3
+ widget:
4
+ - text: >-
5
+ We will restore funding to the Global Environment Facility and the
6
+ Intergovernmental Panel on Climate Change.
7
+ ---
8
+
9
+ ## Model description
10
+ An xlm-roberta-large model fine-tuned on ~1,7 million annotated statements contained in the [Manifesto Corpus](https://manifesto-project.wzb.eu/information/documents/corpus) (version 2023a).
11
+ The model can be used to categorize any type of text into 56 different political topics according to the Manifesto Project's coding scheme ([Handbook 4](https://manifesto-project.wzb.eu/coding_schemes/mp_v4)).
12
+ It works for all languages the xlm-roberta model is pretrained on ([overview](https://github.com/facebookresearch/fairseq/tree/main/examples/xlmr#introduction)), just note that it will perform best for the 38 languages contained in the Manifesto Corpus:
13
+
14
+ ||||||
15
+ |------|------|------|------|------|
16
+ |armenian|bosnian|bulgarian|catalan|croatian|
17
+ |czech|danish|dutch|english|estonian|
18
+ |finnish|french|galician|georgian|german|
19
+ |greek|hebrew|hungarian|icelandic|italian|
20
+ |japanese|korean|latvian|lithuanian|macedonian|
21
+ |montenegrin|norwegian|polish|portuguese|romanian|
22
+ |russian|serbian|slovak|slovenian|spanish|
23
+ |swedish|turkish|ukrainian| | |
24
+
25
+ ## How to use
26
+
27
+ ```python
28
+ import torch
29
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
30
+
31
+ model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/manifestoberta-xlm-roberta-56policy-topics-sentence-2024-1-1")
32
+ tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
33
+
34
+ sentence = "We will restore funding to the Global Environment Facility and the Intergovernmental Panel on Climate Change, to support critical climate science research around the world"
35
+
36
+ inputs = tokenizer(sentence,
37
+ return_tensors="pt",
38
+ max_length=200, #we limited the input to 200 tokens during finetuning
39
+ padding="max_length",
40
+ truncation=True
41
+ )
42
+
43
+ logits = model(**inputs).logits
44
+
45
+ probabilities = torch.softmax(logits, dim=1).tolist()[0]
46
+ probabilities = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)}
47
+ probabilities = dict(sorted(probabilities.items(), key=lambda item: item[1], reverse=True))
48
+ print(probabilities)
49
+ # {'501 - Environmental Protection: Positive': 67.28, '411 - Technology and Infrastructure': 15.19, '107 - Internationalism: Positive': 13.63, '416 - Anti-Growth Economy: Positive': 2.02...
50
+
51
+ predicted_class = model.config.id2label[logits.argmax().item()]
52
+ print(predicted_class)
53
+ # 501 - Environmental Protection: Positive
54
+ ```
55
+
56
+
57
+ ## Model Performance
58
+
59
+ The model was evaluated on a test set of 200,920 annotated manifesto statements.
60
+
61
+ ### Overall
62
+
63
+ | | Accuracy | Top2_Acc | Top3_Acc | Precision| Recall | F1_Macro | MCC | Cross-Entropy |
64
+ |-------------------------------------------------------------------------------------------------------|:--------:|:--------:|:--------:|:--------:|:------:|:--------:|:---:|:-------------:|
65
+ [Sentence Model](https://huggingface.co/manifesto-project/manifestoberta-xlm-roberta-56policy-topics-sentence-2024-1-1)| 0.57 | 0.73 | 0.81 | 0.48 | 0.43 | 0.45 | 0.55| 1.47 |
66
+ [Context Model](https://huggingface.co/manifesto-project/manifestoberta-xlm-roberta-56policy-topics-context-2024-1-1) | 0.64 | 0.81 | 0.88 | 0.55 | 0.52 | 0.53 | 0.63| 1.15 |
67
+
68
+ ### Citation
69
+
70
+ Please cite the model as follows:
71
+
72
+ Burst, Tobias / Lehmann, Pola / Franzmann, Simon / Al-Gaddooa, Denise / Ivanusch, Christoph / Regel, Sven / Riethmüller, Felicia / Weßels, Bernhard / Zehnter, Lisa (2024): manifestoberta. Version 56topics.sentence.2023.1.1. Berlin: Wissenschaftszentrum Berlin für Sozialforschung (WZB) / Göttingen: Institut für Demokratieforschung (IfDem). https://doi.org/10.25522/manifesto.manifestoberta.56topics.sentence.2024.1.1
73
+
74
+ ```bib
75
+ @misc{Burst:2024,
76
+ Address = {Berlin / Göttingen},
77
+ Author = {Burst, Tobias AND Lehmann, Pola AND Franzmann, Simon AND Al-Gaddooa, Denise AND Ivanusch, Christoph AND Regel, Sven AND Riethmüller, Felicia AND Weßels, Bernhard AND Zehnter, Lisa},
78
+ Publisher = {Wissenschaftszentrum Berlin für Sozialforschung / Göttinger Institut für Demokratieforschung},
79
+ Title = {manifestoberta. Version 56topics.sentence.2024.1.1},
80
+ doi = {10.25522/manifesto.manifestoberta.56topics.sentence.2024.1.1},
81
+ url = {https://doi.org/10.25522/manifesto.manifestoberta.56topics.sentence.2024.1.1},
82
+ Year = {2024},
83
+ ```