Edit model card

Fake News Classifier - Finetuned: 'distilbert-base-uncased'

LIAR Dataset

  • This model is finetuned on a large dataset of hand-labeled short statements from politifact.com's API.
  • Data went through a series of text cleaning stages such as:
    1. Lower-case standardization for improved 'uncased' model performance.
    2. Mixed letter/digit word removal.
    3. Stopword removal.
    4. Extra space trimming.

DistilBERT Uncased Tokenizer

  • The text is tokenized using the 'distilbert-base-uncased' HuggingFace tokenizer.
  • For training, the text is cut to a block-size of 200.
  • Max length padding is used to maintain consistent input data shape.

DistilBERT Uncased Model

  • The model that is finetuned is the DistilBERT model, 'distilbert-base-uncased'.
  • This is a small and fast text classifier, perfect for real-time inference!
    • 40% less parameters than the base BERT model.
    • 60% faster while preserving 95% performance of the base BERT model.
  • This model outperforms the finetuned 'distilbert-base-cased' by over 5% average F1-score.
    • This improvement comes mainly from the slower learning rate and improved data preprocessing.
    • These modifications allow for a smoother training curve and convergence.
Downloads last month

Dataset used to train caballeroch/FakeNewsClassifierDistilBert-uncased