mshenoda
/

roberta-spam

Text Classification

Inference Endpoints

Model card Files Files and versions Community

mshenoda commited on Jun 4, 2023

Commit

45ab82d

•

1 Parent(s): 9e6d1e5

Update README.md

Files changed (1) hide show

README.md +4 -3

README.md CHANGED Viewed

@@ -6,9 +6,10 @@ Spam messages frequently carry malicious links or phishing attempts posing signi
 ## Dataset
 The dataset is composed of messages labeled by ham or spam, merged from three data sources:
-1.	SMS Spam Collection https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
-2.	Telegram Spam Ham https://huggingface.co/datasets/thehamkercat/telegram-spam-ham/tree/main
-3.	Enron Spam:  https://huggingface.co/datasets/SetFit/enron_spam/tree/main (only used message column and labels)
 The prepare script for enron is available at https://github.com/mshenoda/roberta-spam/tree/main/data/enron.
 The data is split 80% train 10% validation, and 10% test sets; the scripts used to split and merge of the three data sources are available at: https://github.com/mshenoda/roberta-spam/tree/main/data/utils.

 ## Dataset
 The dataset is composed of messages labeled by ham or spam, merged from three data sources:
+1. SMS Spam Collection https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
+2. Telegram Spam Ham https://huggingface.co/datasets/thehamkercat/telegram-spam-ham/tree/main
+3. Enron Spam:  https://huggingface.co/datasets/SetFit/enron_spam/tree/main (only used message column and labels)
 The prepare script for enron is available at https://github.com/mshenoda/roberta-spam/tree/main/data/enron.
 The data is split 80% train 10% validation, and 10% test sets; the scripts used to split and merge of the three data sources are available at: https://github.com/mshenoda/roberta-spam/tree/main/data/utils.