PARTYPRESS multilingual

Fine-tuned model in seven languages (German, Danish, English, Durch, Polish, Spanish, and Swedish) on texts from nine countries (Austria, Denmark, Germany, Ireland, Netherlands, Poland, Spain, Sweden, UK), based on bert-base-multilingual-cased. Used in Erfort et al. (2023), building on the PARTYPRESS database. For the downstream task of classyfing press releases from political parties into 23 unique policy areas we achieve a performance comparable to expert human coders.

Model description

The PARTYPRESS multilingual model builds on bert-base-multilingual-cased but has a supervised component. This means, it was fine-tuned using texts labeled by humans. The labels indicate 23 different political issue categories derived from the Comparative Agendas Project (CAP):

Code	Issue
1	Macroeconomics
2	Civil Rights
3	Health
4	Agriculture
5	Labor
6	Education
7	Environment
8	Energy
9	Immigration
10	Transportation
12	Law and Crime
13	Social Welfare
14	Housing
15	Domestic Commerce
16	Defense
17	Technology
18	Foreign Trade
19.1	International Affairs
19.2	European Union
20	Government Operations
23	Culture
98	Non-thematic
99	Other

Model variations

We also provide monolingual models for each of the nine countries covered by the PARTYPRESS database. The model can be easily extended to other languages, country contexts, or time periods by fine-tuning it with minimal additional labeled texts.

Intended uses & limitations

The main use of the model is for text classification of press releases from political parties. It may also be useful for other political texts.

The classification can then be used to measure which issues parties are discussing in their communication.

How to use

This model can be used directly with a pipeline for text classification:

>>> from transformers import pipeline
>>> tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512}
>>> partypress = pipeline("text-classification", model = "cornelius/partypress-multilingual", tokenizer = "cornelius/partypress-multilingual", **tokenizer_kwargs)
>>> partypress(["We urgently need to fight climate change and reduce carbon emissions. This is what our party stands for.", 
            "We urge all parties to end the violence and come to the table. This conflict between the two countries must end.",
            "Así, “el trabajo de los militares españoles está al servicio de España y de los demás países”, que participan en esta misión por mandato de la OTAN, ha recordado.",
            "Dass es immer noch einen Gender-Pay-Gap gibt, geht auf das Konto dieser Regierung."])

[{'label': '7 - Environment', 'score': 0.9664431810379028},
 {'label': '19.1 - International Affairs', 'score': 0.9851641654968262},
 {'label': '16 - Defense', 'score': 0.986809492111206},
 {'label': '2 - Civil Rights', 'score': 0.9799079895019531}]

Limitations and bias

The model was trained with data from parties in nine countries. For use in other countries, the model may be further fine-tuned. Without further fine-tuning, the performance of the model may be lower.

The model may have biased predictions. We discuss some biases by country, party, and over time in the release paper for the PARTYPRESS database. For example, the performance is highest for press releases from Ireland (75%) and lowest for Poland (55%).

Training data

The PARTYPRESS multilingual model was fine-tuned with 27,243 press releases in seven languages on texts from 68 European parties in nine countries. The press releases were labeled by two expert human coders per country.

For the training data of the underlying model, please refer to bert-base-multilingual-cased

Training procedure

Preprocessing

For the preprocessing, please refer to bert-base-multilingual-cased

Pretraining

For the pretraining, please refer to bert-base-multilingual-cased

Fine-tuning

We fine-tuned the model using 27,243 labeled press releases from political parties in seven languages.

Training Hyperparameters

The batch size for training was 12, for testing 2, with four epochs. All other hyperparameters were the standard from the transformers library.

Framework versions

Transformers 4.28.0
TensorFlow 2.12.0
Datasets 2.12.0
Tokenizers 0.13.3

Evaluation results

Fine-tuned on our downstream task, this model achieves the following results in a five-fold cross validation that are comparable to the performance of our expert human coders:

Accuracy	Precision	Recall	F1 score
69.52	67.99	67.60	66.77

Note that the classification task is difficult because topics such as environment and energy are often difficult to keep apart.

When we aggregate the shares of text for each issue, we find that the root-mean-square error is very low (0.29).

BibTeX entry and citation info

@article{erfort_partypress_2023,
  author    = {Cornelius Erfort and
               Lukas F. Stoetzer and
               Heike Klüver},
  title     = {The PARTYPRESS Database: A new comparative database of parties’ press releases},
  journal   = {Research and Politics},
  volume    = {10},
  number    = {3},
  year      = {2023},
  doi       = {10.1177/20531680231183512},
  URL       = {https://doi.org/10.1177/20531680231183512}

}

Erfort, C., Stoetzer, L. F., & Klüver, H. (2023). The PARTYPRESS Database: A new comparative database of parties’ press releases. Research & Politics, 10(3). https://doi.org/10.1177/20531680231183512

Further resources

Github: cornelius-erfort/partypress

Research and Politics Dataverse: Replication Data for: The PARTYPRESS Database: A New Comparative Database of Parties’ Press Releases

Acknowledgements

Research for this contribution is part of the Cluster of Excellence "Contestations of the Liberal Script" (EXC 2055, Project-ID: 390715649), funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy. Cornelius Erfort is moreover grateful for generous funding provided by the DFG through the Research Training Group DYNAMICS (GRK 2458/1).

Contact

Cornelius Erfort

Humboldt-Universität zu Berlin

corneliuserfort.de