Edit model card

Fake News Classifier - Finetuned: 'distilbert-base-cased'

LIAR Dataset


  • This model is finetuned on a large dataset of hand-labeled short statements from politifact.com's API.
  • Relevant columns of the data (speaker, statement, etc.) are concatenated and tokenized to create the model input.

DistilBERT Cased Tokenizer


  • The text is tokenized using the 'distilbert-base-cased' HuggingFace tokenizer.
  • For training, the text is cut to a block-size of 200.
  • Max length padding is used to maintain consistent input data shape.

DistilBERT Cased Model


  • The model that is finetuned is the DistilBERT model, 'distilbert-base-cased'.
  • This is a small and fast text classifier, perfect for real-time inference!
    • 40% less parameters than the base BERT model.
    • 60% faster while preserving 95% performance of the base BERT model.
  • The intuition for using the cased model is to capture some patterns in the writing style (capitalization, punctuation).
    • This information may be relevant for detecting fake news sources.
    • Writing styles may be relevant (as we see in clickbait titles with capitalization).
  • This model performs well in flagging misinformation (fake news), especially if the format is similar to the training distribution.
  • Overall, the performance is worse than the finetuned 'distilbert-base-uncased,' as the training data is less clean.
Downloads last month
14
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train caballeroch/FakeNewsClassifierDistilBert-cased