|
--- |
|
library_name: transformers |
|
tags: |
|
- covid |
|
- misinformation |
|
- youtube |
|
- detector |
|
license: apache-2.0 |
|
language: |
|
- en |
|
--- |
|
|
|
# COVID-19 Misinformation Detection Tool for YouTube Videos |
|
|
|
This model is a fine-tuned version of the [DeBerta-v3-large model](https://huggingface.co/microsoft/deberta-v3-large) trained to detect COVID-19 misinformation in YouTube videos. |
|
|
|
## Model Description |
|
|
|
Given the YouTube video metadata (e.g., title, description, transcript, tags), the model will predict three potential numeric labels: opposing COVID-19 misinformation (0), neutral information (1), and supporting COVID-19 misinformation (2). |
|
|
|
To learn more about these labels, please refer to the paper: [Algorithmic Behaviors Across Regions: A Geolocation Audit of YouTube Search for COVID-19 Misinformation between the United States and South Africa](https://arxiv.org/abs/2409.10168). The video dataset used to train and evaluate the model is available at the [Github link here](https://github.com/social-comp/YouTubeAuditGeolocation-data). |
|
|
|
## Training Hyperparameters |
|
The following hyperparameters were used during training: |
|
- OPTIMIZER: Adam optimizer with cross-entropy loss function |
|
- LEARNING_RATE = 5e-6 |
|
- TRAIN_BATCH_SIZE = 4 |
|
- WEIGHT_DECAY= 1e-04 |
|
- VALIDATION_BATCH_SIZE = 4 |
|
- TEST_BATCH_SIZE = 4 |
|
- NUM_EPOCHS = 5 |
|
- MIN_SAVE_EPOCH = 2 |
|
|
|
The dataset was split 80-10-10 across the train (N=2180), validation (N=272), and test set (N=273). The model was fine-tuned on a single NVIDIA A40 GPU. |
|
|
|
## How to Get Started with the Model |
|
|
|
To get started, you should initialize the model using AutoTokenizer and AutoModelForSequenceClassification classes. For the tokenizer, set "use_fast" parameter to False, the max_len to 1024, padding to "max_length," and truncation to True. For the model, set the "num_labels" parameter to 3. |
|
|
|
Next, with a YouTube video dataset with metadata, please concatenate each video's title, description, transcripts, and tags in the following manner: |
|
|
|
input = 'VIDEO TITLE: ' + title + '\nVIDEO DESCRIPTION: ' + description + '\nVIDEO TRANSCRIPT: ' + transcript + '\nVIDEO TAGS: ' + tags |
|
|
|
Thus, each video in your dataset should have its input metadata formatted in the structure above. Finally, run the input into a tokenizer and feed the tokenized input into the model to obtain one of three predicted labels. Use the logit function to obtain the label: |
|
|
|
_, pred_idx = outputs.logits.max(dim=1) |
|
|
|
## Training Data |
|
|
|
The video dataset used to train and evaluate the model is available at the [Github link here](https://github.com/social-comp/YouTubeAuditGeolocation-data). |
|
|
|
To summarize, the dataset was annotated by Amazon Mechanical Turk (AMT) workers and the paper's authors. Please refer to the paper for more information on the training data and its annotation process. |
|
|
|
The videos in the dataset were labeled along the following 7 classes: "Opposing COVID-19 Misinformation (-1),' "Neutral COVID-19 Information (0)," "Supporting COVID-19 Misinformation (1)," "On the COVID-19 origins in Wuhan, China (2)," "Irrelevant (3)," "Video in a language other than English (4)," and "URL not accessible (5)" within the dataset. However, as explained in the paper, we normalized the 7 classes to 3 classes based on their stance on COVID-19 misinformation: supporting, neutral, and opposing (see subsection "Consolidating from 5-classes to 3-classes" in the paper for more information). |
|
|
|
Since the classifier's pred_idx can only be non-negative, we adjusted the 3-point annotation labels for the classifier by adding one. Thus, the classifier will output the following label values: opposing COVID-19 misinformation (0), neutral (1), and supporting COVID-19 misinformation (2). |
|
|
|
### Results |
|
|
|
The model achieved an accuracy, weighted F1-score, and macro F1-score of 0.85 on the test set. |
|
|
|
|
|
# Citation |
|
If you used this model or the dataset in the Github in your research, please cite our work at: |
|
|
|
```bibtex |
|
@misc{jung2024algorithmicbehaviorsregionsgeolocation, |
|
title={Algorithmic Behaviors Across Regions: A Geolocation Audit of YouTube Search for COVID-19 Misinformation between the United States and South Africa}, |
|
author={Hayoung Jung and Prerna Juneja and Tanushree Mitra}, |
|
year={2024}, |
|
eprint={2409.10168}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CY}, |
|
url={https://arxiv.org/abs/2409.10168}, |
|
} |
|
``` |
|
|