Fake News Classifier - Finetuned: 'distilbert-base-cased'

This model is finetuned on a large dataset of hand-labeled short statements from politifact.com's API.
Relevant columns of the data (speaker, statement, etc.) are concatenated and tokenized to create the model input.

The text is tokenized using the 'distilbert-base-cased' HuggingFace tokenizer.
For training, the text is cut to a block-size of 200.
Max length padding is used to maintain consistent input data shape.

The model that is finetuned is the DistilBERT model, 'distilbert-base-cased'.
This is a small and fast text classifier, perfect for real-time inference!
- 40% less parameters than the base BERT model.
- 60% faster while preserving 95% performance of the base BERT model.
The intuition for using the cased model is to capture some patterns in the writing style (capitalization, punctuation).
- This information may be relevant for detecting fake news sources.
- Writing styles may be relevant (as we see in clickbait titles with capitalization).
This model performs well in flagging misinformation (fake news), especially if the format is similar to the training distribution.
Overall, the performance is worse than the finetuned 'distilbert-base-uncased,' as the training data is less clean.

caballeroch
/

FakeNewsClassifierDistilBert-cased