YAML Metadata Error: "datasets[0]" with value "https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset" is not valid. If possible, use a dataset id from https://hf.co/datasets.
YAML Metadata Error: "language[0]" must only contain lowercase characters
YAML Metadata Error: "language[0]" with value "Python" is not valid. It must be an ISO 639-1, 639-2 or 639-3 code (two/three letters), or a special value like "code", "multilingual". If you want to use BCP-47 identifiers, you can specify them in language_bcp47.

Write up:

Link to hugging face model:

https://huggingface.co/Sajib-006/fake_news_detection_xlmRoberta

Model Description:

* Used pretrained XLM-Roberta base model.
* Added classifier layer after bert model
* For tokenization, i used max length of text as 512(which is max bert can handle)

Result:

* Using bert base uncased english model, the accuracy was near 85% (For all samples)
* Using XLM Roberta base model, the accuracy was almost 100% ( For only 2k samples)

Limitations:

* Pretrained XLM Roberta is a heavy model. Training it with the full dataset(44k+ samples) was not possible using google colab free version. So i had to take small sample of 2k size for my experiment.
* As we can see, there is almost 100% accuracy and F1-score for 2000 dataset, so i haven't tried to find misclassified data.
* I couldn't run the model for the whole dataset as i used google colab free version, there was RAM and disk restrictions. XLMRoberta is a heavy model, so training it for the full dataset tends to take huge time. Colab doesn't provide GPU for long time. 
* As one run for one epoch took huge time, i had to save checkpoint after 1 epoch and retrain the model loading weights for 2nd time. After 2 epoch it showed almost 100% accuracy, so i didn't continue to train again.
* A more clear picture could have been seen if it could be run for the full dataset. I thought of some ideas about better model but couldn't implement for hardware restriction as mentioned and time constraint. My ideas are given below.

Ideas to imrove on full dataset:

*   Using XLM Roberta large instead of base can improve 
*   Adding dense layer and dropout layer to reduce overfitting(Though in my result there is 100% accuracy on hold-out test set, so no overfitting seems to be there)
*   Adding convolutional layer after the bert encoder work even better.
*   Combination of different complex convolution layers can be added to check if accuracy increases further more.
*   Hyperparameter tuning of the layers to ensure best result.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.