Spam Email Detection with NLP

Project Overview

This project aims to classify email messages as Spam or Ham (Not Spam) using Natural Language Processing (NLP) techniques and Machine Learning algorithms.

The dataset contains labeled email messages that are used to train a classification model capable of identifying unwanted spam content.

Dataset Information

  • Dataset: Spam Email Dataset
  • Problem Type: Binary Classification
  • Classes:
    • Spam
    • Ham (Not Spam)

Data Analysis

Several exploratory data analysis (EDA) steps were performed:

  • Dataset structure examination
  • Missing value analysis
  • Spam vs Ham distribution
  • Message length analysis
  • Word frequency analysis
  • WordCloud visualization

Text Preprocessing

The following NLP preprocessing techniques were applied:

  • Lowercase conversion
  • Punctuation removal
  • Stopword removal
  • Text cleaning
  • Tokenization

Feature Engineering

Two vectorization approaches were evaluated:

CountVectorizer

Converts text into word-frequency vectors.

TF-IDF

Transforms text into weighted numerical features based on word importance.

Machine Learning Model

Logistic Regression

A Logistic Regression classifier was trained using vectorized email messages and selected as the final model.

Results

  • Accuracy: Approximately 98%
  • Strong spam detection performance
  • Effective separation of spam and legitimate emails

Streamlit Application

A Streamlit web application was developed where users can enter email text and instantly receive a spam prediction.

The application:

  • Accepts user email text
  • Performs preprocessing
  • Applies vectorization
  • Predicts spam or ham
  • Displays prediction results

Project Links

๐Ÿš€ Launch Application

๐Ÿ’ป GitHub Repository

Model Files

  • spam_model.pkl
  • count_vectorizer.pkl

Libraries Used

  • Pandas
  • NumPy
  • Scikit-Learn
  • Streamlit
  • Matplotlib
  • Seaborn
  • NeatText

Conclusion

In this project, spam email detection was successfully implemented using NLP preprocessing techniques and Logistic Regression. The model achieved approximately 98% accuracy and demonstrated strong performance in distinguishing spam messages from legitimate emails. The trained model was deployed through a Streamlit application, allowing users to test predictions interactively.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support