Spam Email Detection with NLP

Project Overview

This project aims to classify email messages as Spam or Ham (Not Spam) using Natural Language Processing (NLP) techniques and Machine Learning algorithms.

The dataset contains labeled email messages that are used to train a classification model capable of identifying unwanted spam content.

Dataset Information

Dataset: Spam Email Dataset
Problem Type: Binary Classification
Classes:
- Spam
- Ham (Not Spam)

Data Analysis

Several exploratory data analysis (EDA) steps were performed:

Dataset structure examination
Missing value analysis
Spam vs Ham distribution
Message length analysis
Word frequency analysis
WordCloud visualization

Text Preprocessing

The following NLP preprocessing techniques were applied:

Lowercase conversion
Punctuation removal
Stopword removal
Text cleaning
Tokenization

Feature Engineering

Two vectorization approaches were evaluated:

CountVectorizer

Converts text into word-frequency vectors.

TF-IDF

Transforms text into weighted numerical features based on word importance.

Machine Learning Model

Logistic Regression

A Logistic Regression classifier was trained using vectorized email messages and selected as the final model.

Results

Accuracy: Approximately 98%
Strong spam detection performance
Effective separation of spam and legitimate emails

Streamlit Application

A Streamlit web application was developed where users can enter email text and instantly receive a spam prediction.

The application:

Accepts user email text
Performs preprocessing
Applies vectorization
Predicts spam or ham
Displays prediction results

Project Links

🚀 Launch Application

💻 GitHub Repository

Model Files

spam_model.pkl
count_vectorizer.pkl

Libraries Used

Pandas
NumPy
Scikit-Learn
Streamlit
Matplotlib
Seaborn
NeatText

Conclusion

In this project, spam email detection was successfully implemented using NLP preprocessing techniques and Logistic Regression. The model achieved approximately 98% accuracy and demonstrated strong performance in distinguishing spam messages from legitimate emails. The trained model was deployed through a Streamlit application, allowing users to test predictions interactively.

Downloads last month: -; Downloads are not tracked for this model. How to track