--- datasets: - liar metrics: - accuracy - f1 - precision - recall --- # Fake News Classifier - Finetuned: 'distilbert-base-uncased' #### **LIAR Dataset** *** - This model is finetuned on a large dataset of hand-labeled short statements from politifact.com's API. - Data went through a series of text cleaning stages such as: 1. Lower-case standardization for improved 'uncased' model performance. 2. Mixed letter/digit word removal. 3. Stopword removal. 4. Extra space trimming. #### **DistilBERT Uncased Tokenizer** *** - The text is tokenized using the **'distilbert-base-uncased'** HuggingFace tokenizer. - For training, the text is cut to a block-size of 200. - Max length padding is used to maintain consistent input data shape. #### **DistilBERT Uncased Model** *** - The model that is finetuned is the DistilBERT model, **'distilbert-base-uncased'**. - This is a small and fast text classifier, perfect for real-time inference! - 40% less parameters than the base BERT model. - 60% faster while preserving 95% performance of the base BERT model. - This model outperforms the finetuned 'distilbert-base-cased' by over 5% average F1-score. - This improvement comes mainly from the slower learning rate and improved data preprocessing. - These modifications allow for a smoother training curve and convergence.