Update README.md
Browse files
README.md
CHANGED
@@ -6,9 +6,7 @@ license: mit
|
|
6 |
An xlm-roberta-large model fine-tuned on ~1,6 million annotated statements contained in the [manifesto corpus](https://manifesto-project.wzb.eu/information/documents/corpus) (version 2023a).
|
7 |
The model can be used to categorize any type of text into 56 different political topics according to the Manifesto Project's coding scheme ([Handbook 4](https://manifesto-project.wzb.eu/coding_schemes/mp_v4)).
|
8 |
|
9 |
-
The context model variant additionally incorporates the surrounding sentences of a statement to improve the classification results for ambiguous sentences.
|
10 |
-
During fine-tuning we collected the surrounding sentences of a statement and merged them with the statement itself to provide the larger context of a sentence as the second part of a sentence pair input.
|
11 |
-
We limited the statement itself to 100 tokens and the context of the statement to 200 tokens.
|
12 |
|
13 |
**Important**
|
14 |
|
@@ -50,7 +48,7 @@ print(predicted_class)
|
|
50 |
# 201 - Freedom and Human Rights
|
51 |
```
|
52 |
|
53 |
-
## Training
|
54 |
|
55 |
Training of the model took place on all quasi-sentences of the Manifesto Corpus (version 2023a), minus 10% that were kept out of training for the final test and evaluation results.
|
56 |
This results in a training dataset of 1,601,329 quasi-sentences.
|
|
|
6 |
An xlm-roberta-large model fine-tuned on ~1,6 million annotated statements contained in the [manifesto corpus](https://manifesto-project.wzb.eu/information/documents/corpus) (version 2023a).
|
7 |
The model can be used to categorize any type of text into 56 different political topics according to the Manifesto Project's coding scheme ([Handbook 4](https://manifesto-project.wzb.eu/coding_schemes/mp_v4)).
|
8 |
|
9 |
+
The context model variant additionally incorporates the surrounding sentences of a statement to improve the classification results for ambiguous sentences. (See Training Procedure for details)
|
|
|
|
|
10 |
|
11 |
**Important**
|
12 |
|
|
|
48 |
# 201 - Freedom and Human Rights
|
49 |
```
|
50 |
|
51 |
+
## Training Procedure
|
52 |
|
53 |
Training of the model took place on all quasi-sentences of the Manifesto Corpus (version 2023a), minus 10% that were kept out of training for the final test and evaluation results.
|
54 |
This results in a training dataset of 1,601,329 quasi-sentences.
|