license: creativeml-openrail-m
datasets:
- prithivMLmods/Spam-Text-Detect-Analysis
language:
- en
base_model:
- google-bert/bert-base-uncased
pipeline_tag: text-classification
library_name: transformers
SPAM DETECTION UNCASED [ SPAM / HAM ]
This implementation leverages BERT (Bidirectional Encoder Representations from Transformers) for binary classification (Spam / Ham) using sequence classification. The model uses the prithivMLmods/Spam-Text-Detect-Analysis
dataset and integrates Weights & Biases (wandb) for comprehensive experiment tracking.
π οΈ Overview
Core Details:
- Model: BERT for sequence classification
Pre-trained Model:bert-base-uncased
- Task: Spam detection - Binary classification task (Spam vs Ham).
- Metrics Tracked:
- Accuracy
- Precision
- Recall
- F1 Score
- Evaluation loss
π Key Results
Results were obtained using BERT and the provided training dataset:
- Validation Accuracy: 0.9937
- Precision: 0.9931
- Recall: 0.9597
- F1 Score: 0.9761
π Model Training Details
Model Architecture:
The model uses bert-base-uncased
as the pre-trained backbone and is fine-tuned for the sequence classification task.
Training Parameters:
- Learning Rate: 2e-5
- Batch Size: 16
- Epochs: 3
- Loss: Cross-Entropy
π How to Train the Model
Clone Repository:
git clone <repository-url> cd <project-directory>
Install Dependencies: Install all necessary dependencies.
pip install -r requirements.txt
or manually:
pip install transformers datasets wandb scikit-learn
Train the Model: Assuming you have a script like
train.py
, run:from train import main
β¨ Weights & Biases Integration
Why Use wandb?
- Monitor experiments in real time via visualization.
- Log metrics such as loss, accuracy, precision, recall, and F1 score.
- Provides a history of past runs and their comparisons.
Initialize Weights & Biases
Include this snippet in your training script:
import wandb
wandb.init(project="spam-detection")
π Directory Structure
The directory is organized to ensure scalability and clear separation of components:
project-directory/
β
βββ data/ # Dataset processing scripts
βββ wandb/ # Logged artifacts from wandb runs
βββ results/ # Save training and evaluation results
βββ model/ # Trained model checkpoints
βββ requirements.txt # List of dependencies
βββ train.py # Main script for training the model
π Dataset Information
The training dataset comes from Spam-Text-Detect-Analysis available on Hugging Face:
- Dataset Link: Spam Text Detection Dataset - Hugging Face
Dataset size:
- 5.57k entries
Let me know if you need assistance setting up the training pipeline, optimizing metrics, visualizing with wandb, or deploying this fine-tuned model. π