|
--- |
|
datasets: |
|
- thehamkercat/telegram-spam-ham |
|
- ucirvine/sms_spam |
|
- SetFit/enron_spam |
|
base_model: |
|
- FacebookAI/roberta-base |
|
pipeline_tag: text-classification |
|
license: mit |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
results: |
|
- task: |
|
type: text-classification |
|
dataset: |
|
name: ucirvine/sms_spam |
|
metrics: |
|
- name: Accuracy |
|
type: Test-Data Accuracy |
|
value: 95.03% |
|
source: |
|
name: Validation via ucirvine/sms_spam dataset in Google Collab |
|
library_name: transformers |
|
--- |
|
# Is Spam all we need? A RoBERTa Based Approach To Spam Detection |
|
## Intro |
|
This is inspired largely by mshenoda's roberta spam huggingFace model (https://huggingface.co/mshenoda/roberta-spam). |
|
|
|
However, instead of fine-tuning it on all the data sources that the original author had, I only finetuned using the telegram and enron spam/ham datasets. The idea behind this was a more diversified data source, preventing overfitting to the original distribution, and just a fun NLP exploratory experiment. This was fine-tuned by replicating the sentiment analysis Google collab example provided in the Roberta resources page (https://huggingface.co/docs/transformers/main/en/model_doc/roberta#resources) Google collab example. |
|
|
|
**NOTE**: This was done for an interview project, so if you find this by chance... hopefully it helps you too, but know there's **definitely** better resources out there... and that this was done in the span of one evening. |
|
|
|
## Metrics |
|
**Accuracy**: 0.9503 |
|
Thrilling, I know, I also just got the chills, especially since my performance is arguably worse than the original authors ๐ |
|
|
|
Granted, I only ran it for one epoch, and the data is taken from different distributions. I'm sure it would've been more "accurate" if I had just trained it on the SMS data, but diversity is good. And, it's fun to see how stuff impacts the final result! |
|
|
|
## Model Output |
|
- 0 is ham |
|
- 1 is spam |
|
|
|
## Dataset(s) |
|
|
|
The dataset is composed of messages labeled by ham or spam (0 or 1), merged from *two* data sources: |
|
|
|
1. Telegram Spam Ham https://huggingface.co/datasets/thehamkercat/telegram-spam-ham/tree/main |
|
2. Enron Spam: https://huggingface.co/datasets/SetFit/enron_spam/tree/main (only used message column and labels) |
|
|
|
The dataset used for testing was the original kaggle competition (as part of the interview project that this was for) |
|
|
|
1. SMS Spam Collection https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset |
|
|
|
## Dataset Class Distribution |
|
|
|
| | Total | Training | Testing | |
|
|:--------:|:-----:|:--------------:|:-----------:| |
|
| Counts | 59267 | 53693 (90.6% ) | 5574 (9.4%) | |
|
|
|
| | Total | Spam | Ham | Set | % Total | |
|
|:--------:|:-----:|:-------------:|:-------------:|:-----:|:-------:| |
|
| Enron | 33345 | 16852 (50.5%) | 16493 (49.5%) | Train | 56.2% | |
|
| Telegram | 20348 | 6011 (29.5%) | 14337 (70.5%) | Train | 43.8% | |
|
| SMS | 5574 | 747 (13.5%) | 4827 (86.5%) | Test | 100% | |
|
|
|
| | Distribution of number of characters per class label (100 bins) | Distribution of number of words per class label (100 bins) | |
|
|:--------:|:---------------------------------------------------------------:|:----------------------------------------------------------:| |
|
| SMS |  |  | |
|
| Enron (limiting a few outliers) |  |  | |
|
| Telegram |  |  | |
|
|
|
^ Note the tails, very interesting distributions. But more so, good to see [Benford's law](https://en.wikipedia.org/wiki/Benford's_law) is alive and well in these. |
|
|
|
## Architecture |
|
The model is fine tuned RoBERTa |
|
|
|
roberta-base: https://huggingface.co/roberta-base |
|
|
|
paper: https://arxiv.org/abs/1907.11692 |
|
|
|
## Code |
|
https://huggingface.co/ggrizzly/roBERTa-spam-detection/resolve/main/roberta_spam_classifier_fine_tuning_google_collab.ipynb |