Edit model card

Fake News Classifier - Finetuned: 'distilbert-base-cased'

LIAR Dataset


  • This model is finetuned on a large dataset of hand-labeled short statements from politifact.com's API.
  • Relevant columns of the data (speaker, statement, etc.) are concatenated and tokenized to create the model input.

DistilBERT Cased Tokenizer


  • The text is tokenized using the 'distilbert-base-cased' HuggingFace tokenizer.
  • For training, the text is cut to a block-size of 200.
  • Max length padding is used to maintain consistent input data shape.

DistilBERT Cased Model


  • The model that is finetuned is the DistilBERT model, 'distilbert-base-cased'.
  • This is a small and fast text classifier, perfect for real-time inference!
    • 40% less parameters than the base BERT model.
    • 60% faster while preserving 95% performance of the base BERT model.
  • The intuition for using the cased model is to capture some patterns in the writing style (capitalization, punctuation).
    • This information may be relevant for detecting fake news sources.
    • Writing styles may be relevant (as we see in clickbait titles with capitalization).
  • This model performs well in flagging misinformation (fake news), especially if the format is similar to the training distribution.
  • Overall, the performance is worse than the finetuned 'distilbert-base-uncased,' as the training data is less clean.
Downloads last month
25

Dataset used to train caballeroch/FakeNewsClassifierDistilBert-cased