Spam Email Detection with NLP
Project Overview
This project aims to classify email messages as Spam or Ham (Not Spam) using Natural Language Processing (NLP) techniques and Machine Learning algorithms.
The dataset contains labeled email messages that are used to train a classification model capable of identifying unwanted spam content.
Dataset Information
- Dataset: Spam Email Dataset
- Problem Type: Binary Classification
- Classes:
- Spam
- Ham (Not Spam)
Data Analysis
Several exploratory data analysis (EDA) steps were performed:
- Dataset structure examination
- Missing value analysis
- Spam vs Ham distribution
- Message length analysis
- Word frequency analysis
- WordCloud visualization
Text Preprocessing
The following NLP preprocessing techniques were applied:
- Lowercase conversion
- Punctuation removal
- Stopword removal
- Text cleaning
- Tokenization
Feature Engineering
Two vectorization approaches were evaluated:
CountVectorizer
Converts text into word-frequency vectors.
TF-IDF
Transforms text into weighted numerical features based on word importance.
Machine Learning Model
Logistic Regression
A Logistic Regression classifier was trained using vectorized email messages and selected as the final model.
Results
- Accuracy: Approximately 98%
- Strong spam detection performance
- Effective separation of spam and legitimate emails
Streamlit Application
A Streamlit web application was developed where users can enter email text and instantly receive a spam prediction.
The application:
- Accepts user email text
- Performs preprocessing
- Applies vectorization
- Predicts spam or ham
- Displays prediction results
Project Links
๐ Launch Application
๐ป GitHub Repository
Model Files
- spam_model.pkl
- count_vectorizer.pkl
Libraries Used
- Pandas
- NumPy
- Scikit-Learn
- Streamlit
- Matplotlib
- Seaborn
- NeatText
Conclusion
In this project, spam email detection was successfully implemented using NLP preprocessing techniques and Logistic Regression. The model achieved approximately 98% accuracy and demonstrated strong performance in distinguishing spam messages from legitimate emails. The trained model was deployed through a Streamlit application, allowing users to test predictions interactively.