File size: 2,304 Bytes
75d16a9
 
 
 
 
 
e5916ca
75d16a9
 
 
 
 
 
 
0d9bf7f
7974f4e
75d16a9
b2562f9
6dcb07f
7974f4e
75d16a9
 
 
a2848b7
 
 
 
 
6dcb07f
75d16a9
 
 
 
 
b2562f9
7974f4e
 
b2562f9
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
---
language: 
  - Python
tags:
- NLP
- Fake News Detection
- XLM RoBERTa
datasets:
- https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset
metrics:
- Accuracy
- F1-score
---

# Write up:

## Link to hugging face model:
https://huggingface.co/Sajib-006/fake_news_detection_xlmRoberta

## Model Description:
    * Used pretrained XLM-Roberta base model.
    * Added classifier layer after bert model
    * For tokenization, i used max length of text as 512(which is max bert can handle)
    
## Result:
    * Using bert base uncased english model, the accuracy was near 85% (For all samples)
    * Using XLM Roberta base model, the accuracy was almost 100% ( For only 2k samples)
    
## Limitations:
    * Pretrained XLM Roberta is a heavy model. Training it with the full dataset(44k+ samples) was not possible using google colab free version. So i had to take small sample of 2k size for my experiment.
    * As we can see, there is almost 100% accuracy and F1-score for 2000 dataset, so i haven't tried to find misclassified data.
    * I couldn't run the model for the whole dataset as i used google colab free version, there was RAM and disk restrictions. XLMRoberta is a heavy model, so training it for the full dataset tends to take huge time. Colab doesn't provide GPU for long time. 
    * As one run for one epoch took huge time, i had to save checkpoint after 1 epoch and retrain the model loading weights for 2nd time. After 2 epoch it showed almost 100% accuracy, so i didn't continue to train again.
    * A more clear picture could have been seen if it could be run for the full dataset. I thought of some ideas about better model but couldn't implement for hardware restriction as mentioned and time constraint. My ideas are given below.

## Ideas to imrove on full dataset:
    *   Using XLM Roberta large instead of base can improve 
    *   Adding dense layer and dropout layer to reduce overfitting(Though in my result there is 100% accuracy on hold-out test set, so no overfitting seems to be there)
    *   Adding convolutional layer after the bert encoder work even better.
    *   Combination of different complex convolution layers can be added to check if accuracy increases further more.
    *   Hyperparameter tuning of the layers to ensure best result.