Fine-tuned BERT-base-uncased pre-trained model to classify spam SMS.
My second project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify spam SMS. This is huge improvements from https://github.com/fzn0x/bert-indonesian-english-hate-comments.
β Install requirements
Install required dependencies
pip install --upgrade pip
pip install -r requirements.txt
β Add BERT virtual env
write the command below
# β
Create and activate a virtual environment
python -m venv bert-env
source bert-env/bin/activate # On Windows use: bert-env\Scripts\activate
β Install CUDA
Check if your GPU supports CUDA:
nvidia-smi
Then:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
π§ How to use
- Check your device and CUDA availability:
python check_device.py
:warning: Using CPU is not advisable, prefer check your CUDA availability.
- Train the model:
python scripts/train.py
:warning: Remove unneeded checkpoint in models/pretrained to save your storage after training
- Run prediction:
python scripts/predict.py
β
Dataset Location: data/spam.csv
, modify the dataset to enhance the model based on your needs.
π Citations
If you use this repository or its ideas, please cite the following:
See citations.bib
for full BibTeX entries.
- Wolf et al., Transformers: State-of-the-Art Natural Language Processing, EMNLP 2020. ACL Anthology
- Pedregosa et al., Scikit-learn: Machine Learning in Python, JMLR 2011.
- Almeida & GΓ³mez Hidalgo, SMS Spam Collection v.1, UCI Machine Learning Repository (2011). Kaggle Link
π§ Credits and Libraries Used
- Hugging Face Transformers β model, tokenizer, and training utilities
- scikit-learn β metrics and preprocessing
- Logging silencing inspired by Hugging Face GitHub discussions
- Dataset from UCI SMS Spam Collection
- Inspiration from Kaggle Notebook by Suyash Khare
License and Usage
License under MIT license.
Leave a β if you think this project is helpful, contributions are welcome.