manifesto-project
/

manifestoberta-xlm-roberta-56policy-topics-context-2023-1-1

Text Classification

Model card Files Files and versions Community

tburst commited on Sep 26, 2023

Commit

77c985c

•

1 Parent(s): 4bfa20b

Update README.md

Files changed (1) hide show

README.md +2 -4

README.md CHANGED Viewed

@@ -6,9 +6,7 @@ license: mit
 An xlm-roberta-large model fine-tuned on ~1,6 million annotated statements contained in the [manifesto corpus](https://manifesto-project.wzb.eu/information/documents/corpus) (version 2023a).
 The model can be used to categorize any type of text into 56 different political topics according to the Manifesto Project's coding scheme ([Handbook 4](https://manifesto-project.wzb.eu/coding_schemes/mp_v4)).
-The context model variant additionally incorporates the surrounding sentences of a statement to improve the classification results for ambiguous sentences.
-During fine-tuning we collected the surrounding sentences of a statement and merged them with the statement itself to provide the larger context of a sentence as the second part of a sentence pair input.
-We limited the statement itself to 100 tokens and the context of the statement to 200 tokens.
 **Important**
@@ -50,7 +48,7 @@ print(predicted_class)
 # 201 - Freedom and Human Rights
 ```
-## Training procedure
 Training of the model took place on all quasi-sentences of the Manifesto Corpus (version 2023a), minus 10% that were kept out of training for the final test and evaluation results.
 This results in a training dataset of 1,601,329 quasi-sentences.

 An xlm-roberta-large model fine-tuned on ~1,6 million annotated statements contained in the [manifesto corpus](https://manifesto-project.wzb.eu/information/documents/corpus) (version 2023a).
 The model can be used to categorize any type of text into 56 different political topics according to the Manifesto Project's coding scheme ([Handbook 4](https://manifesto-project.wzb.eu/coding_schemes/mp_v4)).
+The context model variant additionally incorporates the surrounding sentences of a statement to improve the classification results for ambiguous sentences. (See Training Procedure for details)
 **Important**
 # 201 - Freedom and Human Rights
 ```
+## Training Procedure
 Training of the model took place on all quasi-sentences of the Manifesto Corpus (version 2023a), minus 10% that were kept out of training for the final test and evaluation results.
 This results in a training dataset of 1,601,329 quasi-sentences.