|
--- |
|
license: creativeml-openrail-m |
|
datasets: |
|
- prithivMLmods/Spam-Text-Detect-Analysis |
|
language: |
|
- en |
|
base_model: |
|
- google-bert/bert-base-uncased |
|
pipeline_tag: text-classification |
|
library_name: transformers |
|
--- |
|
### **SPAM DETECTION UNCASED [ SPAM / HAM ]** |
|
|
|
## **Overview** |
|
|
|
This project implements a spam detection model using the **BERT (Bidirectional Encoder Representations from Transformers)** architecture and leverages **Weights & Biases (wandb)** for experiment tracking. The model is trained and evaluated using the [prithivMLmods/Spam-Text-Detect-Analysis](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis) dataset from Hugging Face. |
|
|
|
--- |
|
|
|
## **π οΈ Requirements** |
|
|
|
- Python 3.x |
|
- PyTorch |
|
- Transformers |
|
- Datasets |
|
- Weights & Biases |
|
- Scikit-learn |
|
|
|
--- |
|
|
|
### **Install Dependencies** |
|
|
|
You can install the required dependencies with the following: |
|
|
|
```bash |
|
pip install transformers datasets wandb scikit-learn |
|
``` |
|
|
|
--- |
|
|
|
## **π Model Training** |
|
|
|
### **Model Architecture** |
|
The model uses **BERT for sequence classification**: |
|
- Pre-trained Model: `bert-base-uncased` |
|
- Task: Binary classification (Spam / Ham) |
|
- Optimization: Cross-entropy loss |
|
|
|
--- |
|
|
|
### **Training Arguments** |
|
- **Learning rate:** `2e-5` |
|
- **Batch size:** 16 |
|
- **Epochs:** 3 |
|
- **Evaluation:** Epoch-based. |
|
|
|
--- |
|
|
|
## **π Dataset** |
|
|
|
The model uses the **Spam Text Detection Dataset** available at [Hugging Face Datasets](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis). |
|
|
|
You can access the dataset [here](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis). |
|
|
|
--- |
|
|
|
## **π₯οΈ Instructions** |
|
|
|
### Clone and Set Up |
|
Clone the repository, if applicable: |
|
|
|
```bash |
|
git clone <repository-url> |
|
cd <project-directory> |
|
``` |
|
|
|
Ensure dependencies are installed with: |
|
|
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
--- |
|
|
|
### Train the Model |
|
After installing dependencies, you can train the model using: |
|
|
|
```python |
|
from train import main # Assuming training is implemented in a `train.py` |
|
``` |
|
|
|
Replace `train.py` with your script's entry point. |
|
|
|
--- |
|
|
|
## **β¨ Weights & Biases Integration** |
|
|
|
We use **Weights & Biases** for: |
|
- Real-time logging of training and evaluation metrics. |
|
- Tracking experiments. |
|
- Monitoring evaluation loss, precision, recall, and accuracy. |
|
|
|
Set up wandb by initializing this in the script: |
|
|
|
```python |
|
import wandb |
|
wandb.init(project="spam-detection") |
|
``` |
|
|
|
--- |
|
|
|
## **π Metrics** |
|
|
|
The following metrics were logged: |
|
|
|
- **Accuracy:** Final validation accuracy. |
|
- **Precision:** Fraction of predicted positive cases that were truly positive. |
|
- **Recall:** Fraction of actual positive cases predicted. |
|
- **F1 Score:** Harmonic mean of precision and recall. |
|
- **Evaluation Loss:** Loss during validation on evaluation splits. |
|
|
|
--- |
|
|
|
## **π Results** |
|
|
|
Using BERT with the provided dataset: |
|
|
|
- **Validation Accuracy:** `0.9937` |
|
- **Precision:** `0.9931` |
|
- **Recall:** `0.9597` |
|
- **F1 Score:** `0.9761` |
|
|
|
--- |
|
|
|
## **π Files and Directories** |
|
|
|
- `model/`: Contains trained model checkpoints. |
|
- `data/`: Scripts for processing datasets. |
|
- `wandb/`: All logged artifacts from Weights & Biases runs. |
|
- `results/`: Training and evaluation results are saved here. |
|
|
|
--- |
|
|
|
## **π Acknowledgements** |
|
|
|
Dataset Source: [Spam-Text-Detect-Analysis on Hugging Face](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis) |
|
Model: **BERT for sequence classification** from Hugging Face Transformers. |
|
|
|
--- |