Fine-tuned BERT-base-uncased pre-trained model to classify spam SMS.

My second project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify spam SMS. This is huge improvements from https://github.com/fzn0x/bert-indonesian-english-hate-comments.

✅ Install requirements

Install required dependencies

pip install --upgrade pip
pip install -r requirements.txt

write the command below

# ✅ Create and activate a virtual environment
python -m venv bert-env
source bert-env/bin/activate    # On Windows use: bert-env\Scripts\activate

Check if your GPU supports CUDA:

nvidia-smi

Then:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False

python check_device.py

:warning: Using CPU is not advisable, prefer check your CUDA availability.

python scripts/train.py

:warning: Remove unneeded checkpoint in models/pretrained to save your storage after training

python scripts/predict.py

✅ Dataset Location: data/spam.csv, modify the dataset to enhance the model based on your needs.

If you use this repository or its ideas, please cite the following:

See citations.bib for full BibTeX entries.

Wolf et al., Transformers: State-of-the-Art Natural Language Processing, EMNLP 2020. ACL Anthology
Pedregosa et al., Scikit-learn: Machine Learning in Python, JMLR 2011.
Almeida & Gómez Hidalgo, SMS Spam Collection v.1, UCI Machine Learning Repository (2011). Kaggle Link

License under MIT license.

Leave a ⭐ if you think this project is helpful, contributions are welcome.