prithivMLmods commited on
Commit
7ff4aff
β€’
1 Parent(s): 5d81b2b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +142 -0
README.md CHANGED
@@ -8,4 +8,146 @@ base_model:
8
  - google-bert/bert-base-uncased
9
  pipeline_tag: text-classification
10
  library_name: transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
 
8
  - google-bert/bert-base-uncased
9
  pipeline_tag: text-classification
10
  library_name: transformers
11
+ ---
12
+ Below is a `README.md` template that you can use to document the project with appropriate details:
13
+
14
+ ---
15
+
16
+ # πŸ“Š **Spam Detection with BERT using Hugging Face & Weights & Biases**
17
+
18
+ ## **Overview**
19
+
20
+ This project implements a spam detection model using the **BERT (Bidirectional Encoder Representations from Transformers)** architecture and leverages **Weights & Biases (wandb)** for experiment tracking. The model is trained and evaluated using the [prithivMLmods/Spam-Text-Detect-Analysis](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis) dataset from Hugging Face.
21
+
22
+ ---
23
+
24
+ ## **πŸ› οΈ Requirements**
25
+
26
+ - Python 3.x
27
+ - PyTorch
28
+ - Transformers
29
+ - Datasets
30
+ - Weights & Biases
31
+ - Scikit-learn
32
+
33
+ ---
34
+
35
+ ### **Install Dependencies**
36
+
37
+ You can install the required dependencies with the following:
38
+
39
+ ```bash
40
+ pip install transformers datasets wandb scikit-learn
41
+ ```
42
+
43
+ ---
44
+
45
+ ## **πŸ“ˆ Model Training**
46
+
47
+ ### **Model Architecture**
48
+ The model uses **BERT for sequence classification**:
49
+ - Pre-trained Model: `bert-base-uncased`
50
+ - Task: Binary classification (Spam / Ham)
51
+ - Optimization: Cross-entropy loss
52
+
53
+ ---
54
+
55
+ ### **Training Arguments**
56
+ - **Learning rate:** `2e-5`
57
+ - **Batch size:** 16
58
+ - **Epochs:** 3
59
+ - **Evaluation:** Epoch-based.
60
+
61
+ ---
62
+
63
+ ## **πŸ”— Dataset**
64
+
65
+ The model uses the **Spam Text Detection Dataset** available at [Hugging Face Datasets](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis).
66
+
67
+ You can access the dataset [here](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis).
68
+
69
+ ---
70
+
71
+ ## **πŸ–₯️ Instructions**
72
+
73
+ ### Clone and Set Up
74
+ Clone the repository, if applicable:
75
+
76
+ ```bash
77
+ git clone <repository-url>
78
+ cd <project-directory>
79
+ ```
80
+
81
+ Ensure dependencies are installed with:
82
+
83
+ ```bash
84
+ pip install -r requirements.txt
85
+ ```
86
+
87
+ ---
88
+
89
+ ### Train the Model
90
+ After installing dependencies, you can train the model using:
91
+
92
+ ```python
93
+ from train import main # Assuming training is implemented in a `train.py`
94
+ ```
95
+
96
+ Replace `train.py` with your script's entry point.
97
+
98
+ ---
99
+
100
+ ## **✨ Weights & Biases Integration**
101
+
102
+ We use **Weights & Biases** for:
103
+ - Real-time logging of training and evaluation metrics.
104
+ - Tracking experiments.
105
+ - Monitoring evaluation loss, precision, recall, and accuracy.
106
+
107
+ Set up wandb by initializing this in the script:
108
+
109
+ ```python
110
+ import wandb
111
+ wandb.init(project="spam-detection")
112
+ ```
113
+
114
+ ---
115
+
116
+ ## **πŸ“Š Metrics**
117
+
118
+ The following metrics were logged:
119
+
120
+ - **Accuracy:** Final validation accuracy.
121
+ - **Precision:** Fraction of predicted positive cases that were truly positive.
122
+ - **Recall:** Fraction of actual positive cases predicted.
123
+ - **F1 Score:** Harmonic mean of precision and recall.
124
+ - **Evaluation Loss:** Loss during validation on evaluation splits.
125
+
126
+ ---
127
+
128
+ ## **πŸš€ Results**
129
+
130
+ Using BERT with the provided dataset:
131
+
132
+ - **Validation Accuracy:** `0.9937`
133
+ - **Precision:** `0.9931`
134
+ - **Recall:** `0.9597`
135
+ - **F1 Score:** `0.9761`
136
+
137
+ ---
138
+
139
+ ## **πŸ“ Files and Directories**
140
+
141
+ - `model/`: Contains trained model checkpoints.
142
+ - `data/`: Scripts for processing datasets.
143
+ - `wandb/`: All logged artifacts from Weights & Biases runs.
144
+ - `results/`: Training and evaluation results are saved here.
145
+
146
+ ---
147
+
148
+ ## **πŸ“œ Acknowledgements**
149
+
150
+ Dataset Source: [Spam-Text-Detect-Analysis on Hugging Face](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis)
151
+ Model: **BERT for sequence classification** from Hugging Face Transformers.
152
+
153
  ---