gbert-CTA-w-synth

gbert-CTA-w-synth is a fine-tuned version of the German BERT model (GBERT) designed to detect Calls to Action (CTAs) in political Instagram content. It was developed to analyze political mobilization strategies during the 2021 German Federal Election, focusing on Instagram stories and posts.

This model is trained on real-world and synthetic data to mitigate class imbalances and improve performance. It specializes in detecting explicit and implicit CTAs in multimodal content, including captions, Optical Character Recognition (OCR) text from images, and video transcriptions.

Model Description

Base Model: deepset/gbert-large
Fine-tuned on: German Instagram content, including captions, OCR text, and transcriptions
Synthetic Data: Augmented with synthetic training data generated using OpenAI’s GPT-4 to address class imbalance.
Tasks: Binary classification of CTA presence or absence in Instagram posts and stories.

For video transcriptions, we used bofenghuang/whisper-large-v2-cv11-german, a fine-tuned version of OpenAI's Whisper model adapted for the German language.

Performance

The model was evaluated against human-annotated ground truth labels to ensure classification quality. We performed an evaluation using five-fold cross-validation to validate the model’s generalizability. The model was benchmarked with the following metrics:

Macro F1 score: 0.93
Binary F1 score: 0.89
Precision: 0.98
Recall: 0.81

The evaluation was based on a dataset containing 1,388 documents annotated by nine contributors. Disagreements were resolved using majority decisions.

Usage

This model is intended for computational social science and political communication research, specifically for studying how political actors mobilize audiences on social media. It is effective for detecting Calls to Action in German-language social media content.

How to Use

You can use this model with the transformers library in Python:

from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = BertTokenizer.from_pretrained('chaichy/gbert-CTA-w-synth')
model = BertForSequenceClassification.from_pretrained('chaichy/gbert-CTA-w-synth')

# Tokenize input
inputs = tokenizer("Input text here", return_tensors="pt")

# Get classification results
outputs = model(**inputs)
logits = outputs.logits
predicted_class = torch.argmax(logits, dim=1)

# 0 for absence, 1 for presence of CTA
print(f"Predicted class: {predicted_class.item()}")

Data

The model was trained on Instagram content collected during the 2021 German Federal Election campaign. This included:

Captions: Text accompanying images or videos in posts.
OCR text: Optical Character Recognition (OCR) extracted text from images.
Transcriptions: Text extracted from video audio, using bofenghuang/whisper-large-v2-cv11-german.

The dataset contains both explicit and implicit CTAs, which are binary labeled (True/False). We generated synthetic training data based on the original human-annotated dataset to handle class imbalance. The synthetic dataset was created using OpenAI’s GPT-4o, which mimicked real-world CTAs by generating new examples in a consistent political communication style.

Ethical Considerations

The training data was collected from publicly available Instagram posts and stories shared by verified political accounts during the 2021 German Federal Election. No personal or sensitive data was included.

Citation

If you use this model, please cite the following:

@misc{achmanndenkler2024detectingcallsactionmultimodal,
      title={Detecting Calls to Action in Multimodal Content: Analysis of the 2021 German Federal Election Campaign on Instagram}, 
      author={Michael Achmann-Denkler and Jakob Fehle and Mario Haim and Christian Wolff},
      year={2024},
      eprint={2409.02690},
      archivePrefix={arXiv},
      primaryClass={cs.SI},
      url={https://arxiv.org/abs/2409.02690}, 
}

chaichy
/

gbert-CTA-w-synth