manifesto-project
/

manifestoberta-xlm-roberta-56policy-topics-context-2023-1-1

Text Classification

Model card Files Files and versions Community

tburst commited on Sep 26, 2023

Commit

ca3c910

•

1 Parent(s): 056f873

Update README.md

Files changed (1) hide show

README.md +21 -0

README.md CHANGED Viewed

@@ -50,6 +50,27 @@ print(predicted_class)
 # 201 - Freedom and Human Rights
 ```
 ## Model Performance
 The model was evaluated on a test set of 186,276 annotated manifesto statements (10% of the whole corpus).

 # 201 - Freedom and Human Rights
 ```
+## Training procedure
+Training of the model took place on all quasi-sentences of the Manifesto Corpus (version 2023a), minus 10% that were kept out of training for the final test and evaluation results.
+This results in a training dataset of 1,601,329 quasi-sentences.
+As our context-including model input poses the threat of data-leakage problems between train and test data, we refrained from randomly splitting quasi-sentences into train and test data.
+Instead, we randomly split the dataset on the manifesto level, so that 1779 manifestos and all their quasi-sentences were assigned to the train set and 198 to the test set.
+As training parameters, we used the following settings: learning rate: 1e-5, weight decay: 0.01, epochs: 1, batch size: 4, gradient accumulation steps: 4 (effective batch size: 16).
+### Context
+To adapt the model to the task of classifying statements in manifestos we made some modifications to the traditional training setup.
+Given that human annotators in the Manifesto Project are encouraged to use surrounding sentences to interpret ambiguous statements , we combined statements  with their context for our model's input.
+Specifically, we used a sentence-pair input, where the single to-be-classified statement gets followed by the separator token followed by the whole bigger context of length 200 tokens, in which the statement to-be-classified is embedded.
+Here is an example: "`<s>` We must right the wrongs in our democracy, `</s>` To turn this crisis into a crucible, from which we will forge a stronger, brighter, and more equitable future. We must right the wrongs in our democracy, redress the systemic injustices that have long plagued our society,throw open the doors of opportunity for all Americans and reinvent our institutions at home and our leadership abroad. `</s>`".
+The second part, which contains the context, is greedily filled until it contains 200 tokens.
+Our tests showed that including the context helped to improve the performance of the classification model considerably (~8% accuracy).
+We tried other approaches like using two XLM-RoBERTa models as a duo, where one receives the sentence and one the context, and a shared-layer model, where both inputs are fed separately trough the same model.
+Both variants performed similarly to our sentence pair approach, but lead to higher complexity and computing costs, which is why we ultimately opted for the sentence pair way to include the surrounding context.
 ## Model Performance
 The model was evaluated on a test set of 186,276 annotated manifesto statements (10% of the whole corpus).