MoritzLaurer's picture
MoritzLaurer HF staff
Update README.md
0a573f2
---
language:
- en
tags:
- text-classification
metrics:
- accuracy (balanced)
- F1 (weighted)
widget:
- text: "Covid-19 outbreaks raise concerns over more infectious strain"
---
# Policy-DistilBERT-7d
## Model description
This model was trained on 129.669 manually annotated sentences to classify text into one of seven political categories: 'Economy', 'External Relations', 'Fabric of Society', 'Freedom and Democracy', 'Political System', 'Welfare and Quality of Life' or 'Social Groups'.
## Intended uses & limitations
#### How to use the model
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "MoritzLaurer/policy-distilbert-7d"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "The new variant first detected in southern England in September is blamed for sharp rises in levels of positive tests in recent weeks in London, south-east England and the east of England"
input = tokenizer(text, truncation=True, return_tensors="pt")
output = model(input["input_ids"])
# the output corresponds to the following labels:
# 0: external relations, 1: freedom and democracy, 2: political system, 3: economy, 4: welfare and quality of life, 5: fabric of society, 6: social groups
# output to dictionary
prediction = torch.softmax(output["logits"][0], -1).tolist()
label_names = ["external relations", "freedom and democracy", "political system", "economy", "welfare and quality of life", "fabric of society", "social groups"]
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
print(prediction)
#{'external relations': 0.0, 'freedom and democracy': 0.0, 'political system': 0.9, 'economy': 0.4,
# 'welfare and quality of life': 98.3, 'fabric of society': 0.3, 'social groups': 0.0}
```
### Training data
Policy-DistilBERT-7d was trained on the English-speaking subset of the [Manifesto Project Dataset (MPDS2020a)](https://manifesto-project.wzb.eu/datasets). The model was trained on 129.669 sentences from 164 political manifestos from 55 political parties in 8 English-speaking countries (Australia, Canada, Ireland, Israel, New Zealand, South Africa, United Kingdom, United States). The manifestos were published between 1992 - 2019.
The Manifesto Project mannually annotates individual sentences from political party manifestos in 7 main political domains: 'Economy', 'External Relations', 'Fabric of Society', 'Freedom and Democracy', 'Political System', 'Welfare and Quality of Life' or 'Social Groups' - see the [codebook](https://manifesto-project.wzb.eu/down/data/2020b/codebooks/codebook_MPDataset_MPDS2020b.pdf) for the exact definitions of each domain.
### Training procedure
`distilbert-base-uncased` was trained using the Hugging Face trainer with the following hyperparameters. The hyperparameters were determined using a hyperparameter search on a 15% validation set.
```
training_args = TrainingArguments(
num_train_epochs=5, # total number of training epochs
learning_rate=4e-05,
per_device_train_batch_size=4, # batch size per device during training
per_device_eval_batch_size=4, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.02, # strength of weight decay
fp16=True # mixed precision training
)
```
### Eval results
The model was evaluated using 15% of the sentences (85-15 train-test split).
accuracy (balanced) | F1 (weighted) | precision | recall | accuracy (not balanced)
-------|---------|----------|---------|----------
0.745 | 0.773 | 0.772 | 0.771 | 0.771
Please note that the label distribution in the dataset is imbalanced:
```
Welfare and Quality of Life 0.327225
Economy 0.259191
Fabric of Society 0.111800
Political System 0.095081
Social Groups 0.094371
External Relations 0.063724
Freedom and Democracy 0.048608
```
[Balanced accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html) and [weighted F1](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html) were therefore used to evaluate model performance.
## Limitations and bias
The model was trained on sentences in political manifestos from parties in the 8 countries mentioned above between 1992-2019, manually annotated by the [Manifesto Project](https://manifesto-project.wzb.eu/information/documents/information). The model output therefore reproduces the limitations of the dataset in terms of country coverage, time span, domain definitions and potential biases of the annotators - as any supervised machine learning model would. Applying the model to other types of data (other types of texts, countries etc.) probably reduces performance.
### BibTeX entry and citation info
```bibtex
@unpublished{
title={Policy-DistilBERT},
author={Moritz Laurer},
year={2020},
note = {Unpublished paper}
}
```