Spam-Bert-Uncased / README.md
prithivMLmods's picture
Update README.md
3eecabc verified
|
raw
history blame
3.47 kB
---
license: creativeml-openrail-m
datasets:
- prithivMLmods/Spam-Text-Detect-Analysis
language:
- en
base_model:
- google-bert/bert-base-uncased
pipeline_tag: text-classification
library_name: transformers
---
### **SPAM DETECTION UNCASED [ SPAM / HAM ]**
## **Overview**
This project implements a spam detection model using the **BERT (Bidirectional Encoder Representations from Transformers)** architecture and leverages **Weights & Biases (wandb)** for experiment tracking. The model is trained and evaluated using the [prithivMLmods/Spam-Text-Detect-Analysis](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis) dataset from Hugging Face.
---
## **πŸ› οΈ Requirements**
- Python 3.x
- PyTorch
- Transformers
- Datasets
- Weights & Biases
- Scikit-learn
---
### **Install Dependencies**
You can install the required dependencies with the following:
```bash
pip install transformers datasets wandb scikit-learn
```
---
## **πŸ“ˆ Model Training**
### **Model Architecture**
The model uses **BERT for sequence classification**:
- Pre-trained Model: `bert-base-uncased`
- Task: Binary classification (Spam / Ham)
- Optimization: Cross-entropy loss
---
### **Training Arguments**
- **Learning rate:** `2e-5`
- **Batch size:** 16
- **Epochs:** 3
- **Evaluation:** Epoch-based.
---
## **πŸ”— Dataset**
The model uses the **Spam Text Detection Dataset** available at [Hugging Face Datasets](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis).
You can access the dataset [here](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis).
---
## **πŸ–₯️ Instructions**
### Clone and Set Up
Clone the repository, if applicable:
```bash
git clone <repository-url>
cd <project-directory>
```
Ensure dependencies are installed with:
```bash
pip install -r requirements.txt
```
---
### Train the Model
After installing dependencies, you can train the model using:
```python
from train import main # Assuming training is implemented in a `train.py`
```
Replace `train.py` with your script's entry point.
---
## **✨ Weights & Biases Integration**
We use **Weights & Biases** for:
- Real-time logging of training and evaluation metrics.
- Tracking experiments.
- Monitoring evaluation loss, precision, recall, and accuracy.
Set up wandb by initializing this in the script:
```python
import wandb
wandb.init(project="spam-detection")
```
---
## **πŸ“Š Metrics**
The following metrics were logged:
- **Accuracy:** Final validation accuracy.
- **Precision:** Fraction of predicted positive cases that were truly positive.
- **Recall:** Fraction of actual positive cases predicted.
- **F1 Score:** Harmonic mean of precision and recall.
- **Evaluation Loss:** Loss during validation on evaluation splits.
---
## **πŸš€ Results**
Using BERT with the provided dataset:
- **Validation Accuracy:** `0.9937`
- **Precision:** `0.9931`
- **Recall:** `0.9597`
- **F1 Score:** `0.9761`
---
## **πŸ“ Files and Directories**
- `model/`: Contains trained model checkpoints.
- `data/`: Scripts for processing datasets.
- `wandb/`: All logged artifacts from Weights & Biases runs.
- `results/`: Training and evaluation results are saved here.
---
## **πŸ“œ Acknowledgements**
Dataset Source: [Spam-Text-Detect-Analysis on Hugging Face](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis)
Model: **BERT for sequence classification** from Hugging Face Transformers.
---