Model Card for Model hybrinfox/ukraine-operation_propaganda-detection-EN

This model aims at identifying propaganda on the topic of the Ukrainian invasion in press articles.

Model Details

Model Description

The model is a fine-tuned version of roberta-base (https://huggingface.co/roberta-base) on the Propagandist Pseudo-News dataset (https://github.com/hybrinfox/ppn)

Owned by: Airbus Defence and Space
Developed for: HYBRINFOX consortium (Airbus Defence and Space - Paris Sciences et Lettres, Ecole Normale Supérieure Ulm, Institut Jean-Nicod - Université de Rennes, Inria, IRISA, Mondeca)
Funded by : French National Research Agency (ANR-21-ASIA-0003)
Model type: Text classification
Language(s) (NLP): fr
License: CC BY-NB 4.0
Finetuned from model : roberta-base

Uses

Direct Use

The model can be used directly to classify press articles written in English about the Ukraine invasion or related topic. The output corresponds to the probability of belonging to each class, 0 for regular press articles and 1 for propagandist article.

Out-of-Scope Use

This model should not be used to categorize news sources as propagandist or not, but can help identify pro-Russian narratives and Russian values. This model is not trained to identify the auhors' intentions and should not be used to make such conclusions.

Bias, Risks, and Limitations

This model has been trained with articles from different sources, but all articles from the propaganda class share the same narrative. Moreover, all articles shared the same topic of the Russio-Ukrainian conflict. The model is not infaillible and shouldn't be use to make critical decisions when judging an article, its authors, or the corresponding news outlet.

Recommendations

We recommend that you use this model for research purposes and to always cross its predictions with the informed opinion of other sources before taking any conclusion.

How to Get Started with the Model

Use the code below to get started with the model.

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="hybrinfox/ukraine-operation_propaganda-detection-EN")

# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("hybrinfox/ukraine-operation_propaganda-detection-FR")
model = AutoModelForSequenceClassification.from_pretrained("hybrinfox/ukraine-operation_propaganda-detection-EN")

Training Details

Training Data

The model has been trained using the data from the Propagandist Pseudo-News dataset available at https://github.com/hybrinfox/ppn for the positive class. Additional articles on the same topic, but from mainstream sources has been used for the negative class. Please, read the paper for more details.

Training Procedure

Training Hyperparameters

Training regime:

Batch size: 8 Learning rate: 5e-5 Number of fine-tuning epochs: 3 Optimizer: Adam with default settings Loss function: Binary Cross-Entropy

Evaluation

The model was evaluated during training with the training metrics, as well as with the validation loss

Testing Data, Factors & Metrics

Testing Data

The previously described dataset has been split between train/val/test with a 80/10/10 ratio. The reported results are on the test set, after using the training set for training and validation for controling the model learning.

Metrics

The reported metrics are the F1 scores and losses on the three sets.

Results

Split	Loss	F1 score
Train	0.0004	1.0000
Val	0.0170	0.9985
Test	0.0329	0.9970

Summary

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: T4
Hours used: 0.3
Cloud Provider: GCP
Compute Region: europe-west1
Carbon Emitted: 0.01 kg.CO2 eq

Thanks to fine-tuning a general foundation model, the environmental impact of training our propaganda detector is negligible, being the equivalent of 40 meters traveled by an internal combustion engine car. The low-carbon energy used in the compute region also helped to reduce the environmental impact of the training.

Citation

Géraud Faye, Benjamin Icard, Morgane Casanova, Julien Chanson, François Maine, François Bancilhon, Guillaume Gadek, Guillaume Gravier, and Paul Égré. 2024. Exposing propaganda: an analysis of stylistic cues comparing human annotations and machine classification. In Proceedings of the Third Workshop on Understanding Implicit and Underspecified Language, pages 62–72, Malta. Association for Computational Linguistics.

BibTeX:


@inproceedings{faye-etal-2024-exposing,
    title = "Exposing propaganda: an analysis of stylistic cues comparing human annotations and machine classification",
    author = "Faye, G{\'e}raud  and
      Icard, Benjamin  and
      Casanova, Morgane  and
      Chanson, Julien  and
      Maine, Fran{\c{c}}ois  and
      Bancilhon, Fran{\c{c}}ois  and
      Gadek, Guillaume  and
      Gravier, Guillaume  and
      {\'E}gr{\'e}, Paul",
    editor = "Pyatkin, Valentina  and
      Fried, Daniel  and
      Stengel-Eskin, Elias  and
      Stengel-Eskin, Elias  and
      Liu, Alisa  and
      Pezzelle, Sandro",
    booktitle = "Proceedings of the Third Workshop on Understanding Implicit and Underspecified Language",
    month = mar,
    year = "2024",
    address = "Malta",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.unimplicit-1.6",
    pages = "62--72",
}

Model Card Authors

HYBRINFOX consortium

Model Card Contact

hybrinfox@gmail.com