# Spam Detection Project README

Spam Detection project, for users to classify messages as spam or not.

## Table of Contents

- [Data Collection and Preprocessing](#data-collection-and-preprocessing)
  - [Mount Google Drive](#mount-google-drive)
  - [Install Required Libraries](#install-required-libraries)
  - [Load and Prepare Data](#load-and-prepare-data)
  - [Prepare Data Labels](#prepare-data-labels)
  - [Split Data](#split-data)
- [Model Building and Training](#model-building-and-training)
  - [Initialize Tokenizer](#initialize-tokenizer)
  - [Tokenize Data](#tokenize-data)
  - [Create TensorFlow Datasets](#create-tensorflow-datasets)
  - [Define Training Arguments](#define-training-arguments)
  - [Initialize and Train Model](#initialize-and-train-model)
- [Model Evaluation and Inference](#model-evaluation-and-inference)
  - [Evaluate Model](#evaluate-model)
  - [Generate Predictions](#generate-predictions)
  - [Save Trained Model](#save-trained-model)
- [Interactive Gradio Interface](#interactive-gradio-interface)
  - [Inference on Sample Text](#inference-on-sample-text)
  - [Create Gradio Interface](#create-gradio-interface)

## Data Collection and Preprocessing

### Mount Google Drive
Mounting Google Drive in Google Colab to access files and data stored in Google Drive within the notebook.

In [3]:
# from google.colab import drive
# drive.mount('/content/drive')

### Install Required Libraries
Installing the necessary libraries datasets, transformers, and gradio using the pip package manager.

In [4]:
! pip install datasets transformers gradio

Collecting datasets
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m519.3/519.3 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m7.4/7.4 MB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gradio
  Downloading gradio-3.40.1-py3-none-any.whl (20.0 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m20.0/20.0 MB[0m [31m44.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

### Load and Prepare Data
Reading and preparing the dataset from different sources, including Kaggle, Hugging Face dataset, and CSV files. It concatenates all dataframes to create a single DataFrame containing spam detection data.


In [5]:
import pandas as pd
from datasets import load_dataset

drive_data_path = "/content/"

def read_data():
    # Read Kaggle data
    df_kaggle = pd.read_csv(drive_data_path + 'kaggle.txt', sep='\t', names=["label", "message"])
    df_kaggle.loc[df_kaggle['label'] == 'ham', 'label'] = 'not_spam'

    # Load data from Hugging Face dataset
    data_ = load_dataset("Deysi/spam-detection-dataset")
    texts_train = [item['text'] for item in data_["train"]]
    labels_train = [item['label'] for item in data_["train"]]
    df_hugging_face_train = pd.DataFrame({'label': labels_train, 'message': texts_train})
    texts_test = [item['text'] for item in data_["test"]]
    labels_test = [item['label'] for item in data_["test"]]
    df_hugging_face_test = pd.DataFrame({'label': labels_test, 'message': texts_test})

    # Concatenate Hugging Face dataset train and test data
    df_hugging_face = pd.concat([df_hugging_face_train, df_hugging_face_test], ignore_index=True)

    # Read CSV file data
    df_csv_train = pd.read_csv(drive_data_path + "train.csv")
    df_csv_train = df_csv_train[['label', 'text']]
    df_csv_train = df_csv_train.rename(columns={'text': 'message'})
    df_csv_test = pd.read_csv(drive_data_path + "test.csv")
    df_csv_test = df_csv_test[['label', 'text']]
    df_csv_test = df_csv_test.rename(columns={'text': 'message'})

    # Concatenate all dataframes
    df = pd.concat([df_kaggle, df_hugging_face, df_csv_train, df_csv_test], ignore_index=True)

    return df

df = read_data()

# Display value counts of labels
print(df['label'].value_counts())

Downloading readme:   0%|          | 0.00/581 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/663k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/8175 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2725 [00:00<?, ? examples/s]

not_spam    15625
spam        11747
Name: label, dtype: int64


### Prepare Data Labels
Preparing the target labels for classification by converting them to binary values (0 or 1) using one-hot encoding.

In [6]:
X,y=list(df['message']),list(df['label'])

In [7]:
y=list(pd.get_dummies(y,drop_first=True)['spam'])

### Split Data
Splitting the data into training and testing sets using train_test_split from scikit-learn.

In [8]:
from sklearn.model_selection import train_test_split
print("X,y shape: ",len(X),len(y))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

X,y shape:  27372 27372


## Model Building and Training

### Initialize Tokenizer
Initializing the DistilBERT tokenizer from the Hugging Face library.

In [9]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Downloading (‚Ä¶)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (‚Ä¶)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (‚Ä¶)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (‚Ä¶)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

### Tokenize Data
Tokenizing the training and testing data using the initialized tokenizer.

In [10]:
train_encodings = tokenizer(X_train, truncation=True, padding=True)
test_encodings = tokenizer(X_test, truncation=True, padding=True)

### Create TensorFlow Datasets
Creating TensorFlow datasets using the tokenized encodings and labels.

In [11]:
import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    y_train
))

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    y_test
))

### Define Training Arguments
Defining the training arguments for the TFTrainer from the Hugging Face library.

In [12]:
from transformers import TFDistilBertForSequenceClassification, TFTrainer, TFTrainingArguments

# Define training arguments
training_args = TFTrainingArguments(
    output_dir='./results',                    # Directory to save model checkpoints and results
    num_train_epochs=2,                         # Number of training epochs
    per_device_train_batch_size=8,              # Batch size for training
    per_device_eval_batch_size=16,              # Batch size for evaluation
    warmup_steps=500,                           # Number of warmup steps for learning rate scheduling
    weight_decay=0.01,                          # Weight decay for regularization
    logging_dir='./logs',                       # Directory for storing logs
    logging_steps=10,                           # Log every specified number of steps
    evaluation_strategy="steps",                # Evaluation strategy ("steps" or "epoch")
    eval_steps=500,                             # Number of steps between evaluations
    save_total_limit=1,                         # Limit the number of checkpoints saved
    metric_for_best_model="eval_accuracy",      # Metric for saving the best model checkpoint
)

### Initialize and Train Model
Initializes the DistilBERT model, initializes the TFTrainer, and trains the model.

In [15]:
# Create the DistilBERT model within the strategy scope
with training_args.strategy.scope():
    model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

# Set the eval_steps value
training_args.eval_steps = 500

# Initialize the Trainer for training
trainer = TFTrainer(
    model=model,                         # the instantiated ü§ó Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset             # evaluation dataset
)

# Train the model
trainer.train()

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

## Model Evaluation and Inference

### Evaluate Model
Evaluating the trained model using the testing dataset.

In [16]:
trainer.evaluate(test_dataset)

{'eval_loss': 0.007443295970950113}

### Save Trained Model
Saving the trained model.

In [17]:
trainer.save_model('spam_detection_model')

## Interactive Gradio Interface

### Inference on Sample Text
Performing inference on a sample text using the trained model.

In [18]:
# Sample text you want to classify
sample_text = "Hi there, how's it going?"

# Preprocess the sample text using the tokenizer
sample_encodings = tokenizer(sample_text, truncation=True, padding=True, return_tensors="tf")

# Perform inference
with training_args.strategy.scope():
    logits = model(sample_encodings.input_ids).logits

# Convert logits to probabilities using softmax
probabilities = tf.nn.softmax(logits, axis=-1)

# Get the predicted class
predicted_class = tf.argmax(probabilities, axis=-1).numpy()[0]

# Map the predicted class to label
label_mapping = {0: "No need to worry, Not a spam message.", 1: "This message has been identified as spam."}
predicted_label = label_mapping[predicted_class]

print("Sample Message:", sample_text)
print("Predicted Label:", predicted_label)


Sample Message: Hi there, how's it going?
Predicted Label: No need to worry, Not a spam message.


### Create Gradio Interface
Creating a Gradio interface for interactive spam detection using the trained model.

In [22]:
import gradio as gr
import tensorflow as tf
import random

def process_input(text):
    # Preprocess the sample text using the tokenizer
    encodings = tokenizer(text, truncation=True, padding=True, return_tensors="tf")

    # Perform inference
    logits = model(encodings.input_ids).logits

    # Convert logits to probabilities using softmax
    probabilities = tf.nn.softmax(logits, axis=-1)

    # Get the predicted class
    predicted_class = tf.argmax(probabilities, axis=-1).numpy()[0]

    # Map the predicted class to label
    label_mapping = {
        0: '<b><div style="font-size:16px; text-align:center;">No need to worry, Not a spam message.</div></b>',
        1: '<b><div style="font-size:16px; color:#ff3b5c; text-align:center;">Warning‚ö†Ô∏è: This message has been identified as spam.</div></b>',
    }
    predicted_label = label_mapping[predicted_class]

    return [
        {
            "Spam": float(probabilities.numpy()[0][1]),
            "Not a Spam": float(probabilities.numpy()[0][0]),
        },
        predicted_label,
    ]


# Define the Gradio interface
title = "Spam Detector‚ö†Ô∏è"
examples = [
    "Dear Customer, Your account has been compromised. Click the link below to verify your account details immediately or risk suspension. **(Example 1)**",
    "You've been selected as the lucky winner of our international sweepstakes! To claim your prize, reply with your full name, address, and bank details. <font color='blue' style='background-color: lightgray;'>(Example 2)</font>",
    "Congratulations! You've won a free iPhone X. Click the link to claim your prize.",
    "URGENT: Your bank account has been compromised. Click here to reset your password.",
    "Get rich quick! Invest in our exclusive program and earn thousands overnight.",
    "Your prescription refill is ready for pickup at your local pharmacy. Visit us at your convenience",
    "Reminder: Your monthly utility bill is due on August 20th. Please make the payment.",
    "You've been selected as the lucky winner of a million-dollar lottery. Reply to claim.",
    "Limited time offer: Double your money with our amazing investment opportunity.",
    "Hi, just checking in to see how you're doing. Let's catch up soon.",
    "Reminder: Your dentist appointment is scheduled for tomorrow at 2 PM.",
    "Invitation: Join us for a webinar on digital marketing strategies. Register now!",
    "Your application for the scholarship has been reviewed. We're pleased to inform you that you've been selected.",
    "Hi there! Just wanted to check in and see how you're doing.",
    "Reminder: Your friend's birthday is coming up. Don't forget to send them a message.",
    "Thank you for your purchase. Your order has been successfully processed.",
    "Your monthly newsletter is here! Stay updated with the latest news and updates.",
    "Invitation: Join us for a community clean-up event this weekend. Let's make a difference together.",
    "Reminder: Your scheduled appointment is tomorrow. We look forward to seeing you.",
    "Good news! You've earned a reward for your loyalty. Check your account for details.",
    "Your recent transaction has been approved. Please keep this email for your records.",
    "Exciting announcement: Our new store location is now open. Visit us and receive a special discount.",
    "Welcome to our online community! Here's how to get started and connect with others.",
    "Your request has been received and is being processed. We'll update you with the status soon.",
    "Upcoming event: Join us for a free cooking class this Saturday. Learn new recipes and techniques.",
    "Reminder: Don't forget to vote in the upcoming election. Your voice matters.",
    "Join our book club and dive into a world of fascinating stories. Here's how to join.",
]


# Create Gradio components
input_text = gr.Textbox(
    lines=3, label="Enter the SMS/Message/Email you received", autofocus=True
)
output_text = gr.HTML("", label="Output")
probabilities_text = gr.Label("", label="Probabilities")

random.shuffle(examples)

# Initialize the Gradio interface
model_gui = gr.Interface(
    fn=process_input,
    inputs=input_text,
    outputs=[probabilities_text, output_text],
    title=title,
    examples=examples,
    interpretation="default",
    theme="shivi/calm_seafoam",
    css="""*{font-family:'IBM Plex Mono';}""",
    examples_per_page=15,
)

# Launch the Gradio interface
model_gui.launch()

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>



In [23]:
model_gui.close()

Closing server running on port: 7862
