File size: 9,444 Bytes
35699f3
 
 
 
 
 
 
 
 
 
 
 
dd6136b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
---
title: Spam Detector
emoji: 🦀
colorFrom: gray
colorTo: blue
sdk: gradio
sdk_version: 3.40.1
app_file: app.py
pinned: false
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Spam Detection Project

Welcome to the Spam Detection project! This document provides a comprehensive guide to building a robust spam detection model using the DistilBERT model. This project covers the entire pipeline, from data collection and preprocessing to model building, training, evaluation, and the creation of an interactive Gradio interface for users to classify messages as spam or not.

# Table of Contents
- [The Project](#the-project)
  - [Project Dependencies](#project-dependencies)
  - [Datasets](#datasets)
  - [Project Structure](#project-structure)
    - [Data Collection and Preprocessing](#data-collection-and-preprocessing)
    - [Model Building and Training](#model-building-and-training)
    - [Model Evaluation and Inference](#model-evaluation-and-inference)
    - [Interactive Gradio Interface](#interactive-gradio-interface)
- [The Application](#the-application)
  - [App Dependencies](#app-dependencies)
  - [Usage](#usage)
  - [Overview](#overview)
  - [Deployment](#deployment)
---

# The Project [Training and Fine tuning on Pre Trained LLM with Transformers] - [(Jupyter Notebook)](./Spam_detection.ipynb)

## Project Dependencies

To run this project, you'll need the following libraries installed(anyway they are mentioned in the jupyter notebook):

- `datasets` (preferred version: 2.14.4)
- `transformers` (preferred version: 4.31.0)
- `gradio` (preferred version: 3.40.1)
- `pandas` (preferred version: 1.5.3)
- `scikit-learn` (preferred version: 1.2.2)
- `tensorflow` (preferred version: 2.12.0)

If needed, You can create a virtual environment(Kernel):
- In Windows powershell,
```bash
python -m venv myenv
.\myenv\Scripts\Activate
```

- In Linux,
```bash
python3 -m venv myenv
source myenv/bin/activate
```

You can install these libraries using the following command:

```bash
pip install datasets transformers gradio pandas scikit-learn tensorflow
```

## Datasets

1. **Downloaded SMS Spam Collection Dataset:** A dataset comprising 5572 rows was obtained from Kaggle. This dataset can be found [here](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset).

2. **Utilized the `datasets` Module:** Using the `datasets` module, the "Deysi/spam-detection-dataset" was loaded, resulting in a spam detection dataset with 10900 rows. This dataset is available [here](https://huggingface.co/datasets/Deysi/spam-detection-dataset) in HuggingFace.

3. **Downloaded and Uploaded Spam Detector Dataset:** The Spam Detector dataset, downloadable from [here(Hugging Face)](https://huggingface.co/datasets/0x7194633/spam_detector), was acquired. Subsequently, the CSV dataset was uploaded to a drive. By mounting the drive, access to the dataset, containing 10900 rows, was established.

**Ultimately, these datasets were merged:** After converting them into pandas Dataframes, the datasets were combined into a unified DataFrame.

## Project Structure

The Spam Detection project is organized into several key phases:

### Data Collection and Preprocessing

1. **Mounting Google Drive:** We begin by mounting Google Drive to access files and data stored in your drive.

2. **Installing Required Libraries:** We install essential libraries `datasets`, `transformers`, and `gradio` using pip.

3. **Loading and Preparing Data:** Data is sourced from various repositories, including Kaggle, Hugging Face datasets, and CSV(from hugging face) files. The data is then combined into a single DataFrame to facilitate further processing.

4. **Preparing Data Labels:** Data labels are converted into binary values (`0` or `1`) using one-hot encoding.

5. **Splitting Data:** The dataset is divided into training and testing sets using the `train_test_split` function from scikit-learn.

### Model Building and Training

6. **Initializing Tokenizer:** The DistilBERT tokenizer from Hugging Face is initialized to tokenize the text data.

7. **Tokenizing Data:** Both training and testing data are tokenized using the previously initialized tokenizer.

8. **Creating TensorFlow Datasets:** TensorFlow datasets are created using the tokenized encodings and labels.

9. **Defining Training Arguments:** Training arguments for the `TFTrainer` from Hugging Face are defined. These include the number of epochs, batch sizes, evaluation strategy, and more.

10. **Initializing and Training Model:** Within the TensorFlow strategy scope, the DistilBERT model is created and initialized. The model is then trained using the `TFTrainer`, and a Progress Bar callback is added for monitoring.

### Model Evaluation and Inference

11. **Evaluating Model:** The trained model's performance is evaluated using the testing dataset.

12. **Saving Trained Model:** The trained model is saved in .h5 format as [spam_detection_model](./spam_detection_model/)

### Interactive Gradio Interface

13. **Inference on Sample Text:** The trained model is used to perform inference on a sample text.

14. **Creating Gradio Interface:** An interactive Gradio interface is created using the trained model. Users can input messages, and the model predicts whether the message is *spam* or not using the `process_input` core function.

---

# The Application

## App Dependencies

To run this application, you'll need the following libraries installed:

- `transformers` (preferred version: 4.31.0)
- `gradio` (preferred version: 3.40.1)
- `tensorflow` (preferred version: 2.12.0)

If needed, You can create a virtual environment(Kernel):
- In Windows powershell,
```bash
python -m venv spam_detection_virtual_environment
.\spam_detection_virtual_environment\Scripts\Activate
```

- In Linux,
```bash
python3 -m venv spam_detection_virtual_environment
source spam_detection_virtual_environment/bin/activate
```

You can install these libraries using the following command:

```bash
pip install transformers gradio tensorflow
```

## Usage

To utilize the Spam Detection Application:

1. Make sure that specified dependencies are installed
2. Run 
```bash
python app.py
```
3. Navigate to the URL of the Gradio application(will be printed in terminal)
4. Optional: append `'/?__theme=dark'` to the URL for dark theme
5. Input a SMS/Email/Message(you can also enter example by clicking on that) and submit it, You will get the corresponding output in the application
6. After completion, Interrupt the terminal by clicking 'Ctrc + C' to terminate the process from the particular port or close the application.


## Overview

The code follows a structured flow, starting with loading the model and tokenizer, defining the process_input function, creating the Gradio interface, and finally launching the interface for interactive spam detection.

1. **Import Libraries:**
   - Import necessary libraries for building the application, including TensorFlow, Transformers, and Gradio.

2. **Load Model and Tokenizer:**
   - Load the pre-trained DistilBERT model for sequence classification (`TFDistilBertForSequenceClassification`) from Hugging Face's Transformers library.
   - Load the tokenizer (`DistilBertTokenizerFast`) to preprocess input text.

3. **Define `process_input` Function:**
   - Define a function named `process_input` that takes an input text and performs the following steps:
     - Preprocess the text using the tokenizer.
     - Perform inference using the loaded model to obtain logits.
     - Convert logits to probabilities using softmax.
     - Determine the predicted class (spam or not spam) and map it to a label.
     - Return the predicted label and probabilities.

4. **Define Gradio Interface:**
   - Define the Gradio interface:
     - Set the title of the interface.
     - Create a list of example messages for testing the model.
     - Create Gradio components for input (`input_text`) and output (`probabilities_text`, `output_text`).
     - Shuffle the examples to show random samples.

5. **Initialize and Launch Interface:**
   - Initialize the Gradio interface using the `gr.Interface` class:
     - Provide the process_input function (`fn` parameter).
     - Specify the input and output components.
     - Set the title, examples, interpretation mode, theme, CSS, and examples per page.
   - Launch the Gradio interface using the `launch` method.
   - Optionally, you can print a message with instructions to access the dark theme.

## Deployment

- Deployed the Gradio Application in hugging face platform.
- You can interact with my application [here](https://huggingface.co/spaces/lostUchiha/spam-detector)

# Unique Challenges faced

- Searched a lot on google for resources gathered from different platforms(Kaggle, Hugging face) and merged all the datasets to gether using the `read_data()` function.
- Trained the pre-trained model using the this large dataset took upto 1hr.
- 

## Acknowledgments

This project leverages the capabilities of the DistilBERT model from Hugging Face, combined with the user-friendly Gradio library for creating an interactive interface.

For more information or inquiries, please feel free to reach out to the project author, Ravindra Mohith, at ravindramohith@gmail.com.

---

Customize and enhance this README with additional details, examples, and references to make it a comprehensive guide to your Spam Detection project.