---
language: 
- en
tags:
- text-classification
metrics:
- accuracy (balanced)
- F1 (weighted)
widget:
- text: "Covid-19 outbreaks raise concerns over more infectious strain"

---


# Policy-DistilBERT-7d

## Model description

This model was trained on 129.669 manually annotated sentences to classify text into one of seven political categories: 'Economy', 'External Relations', 'Fabric of Society', 'Freedom and Democracy', 'Political System', 'Welfare and Quality of Life' or 'Social Groups'.

## Intended uses & limitations

#### How to use the model

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "MoritzLaurer/policy-distilbert-7d"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "The new variant first detected in southern England in September is blamed for sharp rises in levels of positive tests in recent weeks in London, south-east England and the east of England"

input = tokenizer(text, truncation=True, return_tensors="pt")
output = model(input["input_ids"])
# the output corresponds to the following labels:
# 0: external relations, 1: freedom and democracy, 2: political system, 3: economy, 4: welfare and quality of life, 5: fabric of society, 6: social groups

# output to dictionary
prediction = torch.softmax(output["logits"][0], -1).tolist()
label_names = ["external relations", "freedom and democracy", "political system", "economy", "welfare and quality of life", "fabric of society", "social groups"]
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
print(prediction)
#{'external relations': 0.0, 'freedom and democracy': 0.0, 'political system': 0.9, 'economy': 0.4, 
# 'welfare and quality of life': 98.3, 'fabric of society': 0.3, 'social groups': 0.0}

```

### Training data

Policy-DistilBERT-7d was trained on the English-speaking subset of the [Manifesto Project Dataset (MPDS2020a)](https://manifesto-project.wzb.eu/datasets). The model was trained on 129.669 sentences from 164 political manifestos from 55 political parties in 8 English-speaking countries (Australia, Canada, Ireland, Israel, New Zealand, South Africa, United Kingdom, United States). The manifestos were published between 1992 - 2019. 

The Manifesto Project mannually annotates individual sentences from political party manifestos in 7 main political domains: 'Economy', 'External Relations', 'Fabric of Society', 'Freedom and Democracy', 'Political System', 'Welfare and Quality of Life' or 'Social Groups' - see the [codebook](https://manifesto-project.wzb.eu/down/data/2020b/codebooks/codebook_MPDataset_MPDS2020b.pdf) for the exact definitions of each domain. 


### Training procedure

`distilbert-base-uncased` was trained using the Hugging Face trainer with the following hyperparameters. The hyperparameters were determined using a hyperparameter search on a 15% validation set. 

```
training_args = TrainingArguments(
    num_train_epochs=5,              # total number of training epochs
    learning_rate=4e-05,
    per_device_train_batch_size=4,   # batch size per device during training
    per_device_eval_batch_size=4,    # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.02,               # strength of weight decay
    fp16=True                        # mixed precision training
)
```

### Eval results

The model was evaluated using 15% of the sentences (85-15 train-test split).

accuracy (balanced)   | F1 (weighted) | precision | recall | accuracy (not balanced) 
-------|---------|----------|---------|----------
0.745  | 0.773 | 0.772 | 0.771 | 0.771

Please note that the label distribution in the dataset is imbalanced:
```
Welfare and Quality of Life    0.327225
Economy                        0.259191
Fabric of Society              0.111800
Political System               0.095081
Social Groups                  0.094371
External Relations             0.063724
Freedom and Democracy          0.048608
```

[Balanced accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html) and [weighted F1](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html) were therefore used to evaluate model performance.


## Limitations and bias

The model was trained on sentences in political manifestos from parties in the 8 countries mentioned above between 1992-2019, manually annotated by the [Manifesto Project](https://manifesto-project.wzb.eu/information/documents/information). The model output therefore reproduces the limitations of the dataset in terms of country coverage, time span, domain definitions and potential biases of the annotators - as any supervised machine learning model would. Applying the model to other types of data (other types of texts, countries etc.) probably reduces performance. 


### BibTeX entry and citation info

```bibtex
@unpublished{
  title={Policy-DistilBERT},
  author={Moritz Laurer},
  year={2020},
  note = {Unpublished paper}
}
```