tburst commited on
Commit
bbce29e
1 Parent(s): bd3bb19

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -3
README.md CHANGED
@@ -3,11 +3,48 @@ license: mit
3
  ---
4
  ## Model description
5
 
6
- An xlm-roberta-large model fine-tuned on the classification of all ~1,8 million annotated statements contained in the [manifesto corpus](https://manifesto-project.wzb.eu/information/documents/corpus) (version 2023a), each labelled with one of 56 different political categories. The context model variant additionally utilizes the surrounding sentences of a statement to improve its classification.
 
 
 
 
 
7
 
8
  **Important**
9
 
10
- We slightly modified the Classification Head of the XLMRobertaModelForSequenceClassification model (removed the tahn activation and the intermediate linear layer) as that improved the model performance for this task considerably. To correctly load the full model you have to provide the ```trust_remote_code=True``` argument in the from_pretrained loading call.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
-
13
 
 
3
  ---
4
  ## Model description
5
 
6
+ An xlm-roberta-large model fine-tuned on all ~1,8 million annotated statements contained in the [manifesto corpus](https://manifesto-project.wzb.eu/information/documents/corpus) (version 2023a).
7
+ The model can be used to categorize any type of text into [56 different political categories](https://manifesto-project.wzb.eu/coding_schemes/mp_v4) according to the Manifesto Project's coding scheme (Handbook 4).
8
+
9
+ The context model variant additionally utilizes the surrounding sentences of a statement to improve the classification results for ambiguous sentences.
10
+ During fine-tuning we collected the surrounding sentences of a statement and combined them with the statement itself to provide the larger context of a sentence as the second part of a sentence pair input.
11
+ We limited the statement itself to 100 tokens and the context of the statement to 200 tokens.
12
 
13
  **Important**
14
 
15
+ We slightly modified the Classification Head of the XLMRobertaModelForSequenceClassification model (removed the tanh activation and the intermediate linear layer) as that improved the model performance for this task considerably. To correctly load the full model you have to provide the ```trust_remote_code=True``` argument in the from_pretrained loading call.
16
+
17
+
18
+ ## How to use
19
+
20
+ ```python
21
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
22
+
23
+ model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/xlm-roberta-large-56topics-context-2023a", trust_remote_code=True)
24
+ tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
25
+
26
+ sentence = "These principles are under threat."
27
+ context = "Human rights and international humanitarian law are fundamental pillars of a secure global system. These principles are under threat. Some of the world's most powerful states choose to sell arms to human-rights abusing states."
28
+
29
+ inputs = tokenizer(sentence,
30
+ context,
31
+ return_tensors="pt",
32
+ max_length=300, #we limited the input to 300 tokens during finetuning
33
+ padding="max_length",
34
+ truncation=True
35
+ )
36
+
37
+ logits = model(**inputs).logits
38
+
39
+ probabilities = torch.softmax(logits, dim=1).tolist()[0]
40
+ probabilities = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)}
41
+ probabilities = dict(sorted(probabilities.items(), key=lambda item: item[1], reverse=True))
42
+ print(probabilities)
43
+ # {'201 - Freedom and Human Rights': 90.76, '107 - Internationalism: Positive': 5.82, '105 - Military: Negative': 0.66...
44
+
45
+ predicted_class = model.config.id2label[logits.argmax().item()]
46
+ print(predicted_class)
47
+ # 201 - Freedom and Human Rights
48
+ ```
49
 
 
50