prithivMLmods commited on
Commit
4af0eb2
β€’
1 Parent(s): 3eecabc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -98
README.md CHANGED
@@ -11,97 +11,83 @@ library_name: transformers
11
  ---
12
  ### **SPAM DETECTION UNCASED [ SPAM / HAM ]**
13
 
14
- ## **Overview**
15
-
16
- This project implements a spam detection model using the **BERT (Bidirectional Encoder Representations from Transformers)** architecture and leverages **Weights & Biases (wandb)** for experiment tracking. The model is trained and evaluated using the [prithivMLmods/Spam-Text-Detect-Analysis](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis) dataset from Hugging Face.
17
 
18
  ---
19
 
20
- ## **πŸ› οΈ Requirements**
21
 
22
- - Python 3.x
23
- - PyTorch
24
- - Transformers
25
- - Datasets
26
- - Weights & Biases
27
- - Scikit-learn
 
 
 
 
28
 
29
  ---
30
 
31
- ### **Install Dependencies**
32
-
33
- You can install the required dependencies with the following:
34
 
35
- ```bash
36
- pip install transformers datasets wandb scikit-learn
37
- ```
 
38
 
39
  ---
40
 
41
- ## **πŸ“ˆ Model Training**
42
 
43
- ### **Model Architecture**
44
- The model uses **BERT for sequence classification**:
45
- - Pre-trained Model: `bert-base-uncased`
46
- - Task: Binary classification (Spam / Ham)
47
- - Optimization: Cross-entropy loss
48
-
49
- ---
50
 
51
- ### **Training Arguments**
52
- - **Learning rate:** `2e-5`
53
- - **Batch size:** 16
54
- - **Epochs:** 3
55
- - **Evaluation:** Epoch-based.
56
 
57
  ---
58
 
59
- ## **πŸ”— Dataset**
60
-
61
- The model uses the **Spam Text Detection Dataset** available at [Hugging Face Datasets](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis).
62
-
63
- You can access the dataset [here](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis).
64
-
65
- ---
66
-
67
- ## **πŸ–₯️ Instructions**
68
-
69
- ### Clone and Set Up
70
- Clone the repository, if applicable:
71
-
72
- ```bash
73
- git clone <repository-url>
74
- cd <project-directory>
75
- ```
76
-
77
- Ensure dependencies are installed with:
78
-
79
- ```bash
80
- pip install -r requirements.txt
81
- ```
82
-
83
- ---
84
-
85
- ### Train the Model
86
- After installing dependencies, you can train the model using:
87
-
88
- ```python
89
- from train import main # Assuming training is implemented in a `train.py`
90
- ```
91
-
92
- Replace `train.py` with your script's entry point.
93
 
94
  ---
95
 
96
  ## **✨ Weights & Biases Integration**
97
 
98
- We use **Weights & Biases** for:
99
- - Real-time logging of training and evaluation metrics.
100
- - Tracking experiments.
101
- - Monitoring evaluation loss, precision, recall, and accuracy.
102
-
103
- Set up wandb by initializing this in the script:
104
 
 
 
105
  ```python
106
  import wandb
107
  wandb.init(project="spam-detection")
@@ -109,41 +95,30 @@ wandb.init(project="spam-detection")
109
 
110
  ---
111
 
112
- ## **πŸ“Š Metrics**
113
-
114
- The following metrics were logged:
115
-
116
- - **Accuracy:** Final validation accuracy.
117
- - **Precision:** Fraction of predicted positive cases that were truly positive.
118
- - **Recall:** Fraction of actual positive cases predicted.
119
- - **F1 Score:** Harmonic mean of precision and recall.
120
- - **Evaluation Loss:** Loss during validation on evaluation splits.
121
-
122
- ---
123
-
124
- ## **πŸš€ Results**
125
 
126
- Using BERT with the provided dataset:
127
 
128
- - **Validation Accuracy:** `0.9937`
129
- - **Precision:** `0.9931`
130
- - **Recall:** `0.9597`
131
- - **F1 Score:** `0.9761`
 
 
 
 
 
 
132
 
133
  ---
134
 
135
- ## **πŸ“ Files and Directories**
 
 
136
 
137
- - `model/`: Contains trained model checkpoints.
138
- - `data/`: Scripts for processing datasets.
139
- - `wandb/`: All logged artifacts from Weights & Biases runs.
140
- - `results/`: Training and evaluation results are saved here.
141
 
142
  ---
143
 
144
- ## **πŸ“œ Acknowledgements**
145
-
146
- Dataset Source: [Spam-Text-Detect-Analysis on Hugging Face](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis)
147
- Model: **BERT for sequence classification** from Hugging Face Transformers.
148
-
149
- ---
 
11
  ---
12
  ### **SPAM DETECTION UNCASED [ SPAM / HAM ]**
13
 
14
+ This implementation leverages **BERT (Bidirectional Encoder Representations from Transformers)** for binary classification (Spam / Ham) using sequence classification. The model uses the **`prithivMLmods/Spam-Text-Detect-Analysis` dataset** and integrates **Weights & Biases (wandb)** for comprehensive experiment tracking.
 
 
15
 
16
  ---
17
 
18
+ ## **πŸ› οΈ Overview**
19
 
20
+ ### **Core Details:**
21
+ - **Model:** BERT for sequence classification
22
+ Pre-trained Model: `bert-base-uncased`
23
+ - **Task:** Spam detection - Binary classification task (Spam vs Ham).
24
+ - **Metrics Tracked:**
25
+ - Accuracy
26
+ - Precision
27
+ - Recall
28
+ - F1 Score
29
+ - Evaluation loss
30
 
31
  ---
32
 
33
+ ## **πŸ“Š Key Results**
34
+ Results were obtained using BERT and the provided training dataset:
 
35
 
36
+ - **Validation Accuracy:** **0.9937**
37
+ - **Precision:** **0.9931**
38
+ - **Recall:** **0.9597**
39
+ - **F1 Score:** **0.9761**
40
 
41
  ---
42
 
43
+ ## **πŸ“ˆ Model Training Details**
44
 
45
+ ### **Model Architecture:**
46
+ The model uses `bert-base-uncased` as the pre-trained backbone and is fine-tuned for the sequence classification task.
 
 
 
 
 
47
 
48
+ ### **Training Parameters:**
49
+ - **Learning Rate:** 2e-5
50
+ - **Batch Size:** 16
51
+ - **Epochs:** 3
52
+ - **Loss:** Cross-Entropy
53
 
54
  ---
55
 
56
+ ## **πŸš€ How to Train the Model**
57
+
58
+ 1. **Clone Repository:**
59
+ ```bash
60
+ git clone <repository-url>
61
+ cd <project-directory>
62
+ ```
63
+
64
+ 2. **Install Dependencies:**
65
+ Install all necessary dependencies.
66
+ ```bash
67
+ pip install -r requirements.txt
68
+ ```
69
+ or manually:
70
+ ```bash
71
+ pip install transformers datasets wandb scikit-learn
72
+ ```
73
+
74
+ 3. **Train the Model:**
75
+ Assuming you have a script like `train.py`, run:
76
+ ```python
77
+ from train import main
78
+ ```
 
 
 
 
 
 
 
 
 
 
 
79
 
80
  ---
81
 
82
  ## **✨ Weights & Biases Integration**
83
 
84
+ ### Why Use wandb?
85
+ - **Monitor experiments in real time** via visualization.
86
+ - Log metrics such as loss, accuracy, precision, recall, and F1 score.
87
+ - Provides a history of past runs and their comparisons.
 
 
88
 
89
+ ### Initialize Weights & Biases
90
+ Include this snippet in your training script:
91
  ```python
92
  import wandb
93
  wandb.init(project="spam-detection")
 
95
 
96
  ---
97
 
98
+ ## πŸ“ **Directory Structure**
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
+ The directory is organized to ensure scalability and clear separation of components:
101
 
102
+ ```
103
+ project-directory/
104
+ β”‚
105
+ β”œβ”€β”€ data/ # Dataset processing scripts
106
+ β”œβ”€β”€ wandb/ # Logged artifacts from wandb runs
107
+ β”œβ”€β”€ results/ # Save training and evaluation results
108
+ β”œβ”€β”€ model/ # Trained model checkpoints
109
+ β”œβ”€β”€ requirements.txt # List of dependencies
110
+ └── train.py # Main script for training the model
111
+ ```
112
 
113
  ---
114
 
115
+ ## πŸ”— Dataset Information
116
+ The training dataset comes from **Spam-Text-Detect-Analysis** available on Hugging Face:
117
+ - **Dataset Link:** [Spam Text Detection Dataset - Hugging Face](https://huggingface.co/datasets)
118
 
119
+ Dataset size:
120
+ - **5.57k entries**
 
 
121
 
122
  ---
123
 
124
+ Let me know if you need assistance setting up the training pipeline, optimizing metrics, visualizing with wandb, or deploying this fine-tuned model. πŸš€