Edit model card

Fake News Classifier - Finetuned: 'distilbert-base-uncased'

LIAR Dataset


  • This model is finetuned on a large dataset of hand-labeled short statements from politifact.com's API.
  • Data went through a series of text cleaning stages such as:
    1. Lower-case standardization for improved 'uncased' model performance.
    2. Mixed letter/digit word removal.
    3. Stopword removal.
    4. Extra space trimming.

DistilBERT Uncased Tokenizer


  • The text is tokenized using the 'distilbert-base-uncased' HuggingFace tokenizer.
  • For training, the text is cut to a block-size of 200.
  • Max length padding is used to maintain consistent input data shape.

DistilBERT Uncased Model


  • The model that is finetuned is the DistilBERT model, 'distilbert-base-uncased'.
  • This is a small and fast text classifier, perfect for real-time inference!
    • 40% less parameters than the base BERT model.
    • 60% faster while preserving 95% performance of the base BERT model.
  • This model outperforms the finetuned 'distilbert-base-cased' by over 5% average F1-score.
    • This improvement comes mainly from the slower learning rate and improved data preprocessing.
    • These modifications allow for a smoother training curve and convergence.
Downloads last month
2
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train caballeroch/FakeNewsClassifierDistilBert-uncased