Spam Classification Models

Overview

This repository contains two models designed for detecting spam in SMS messages, both trained on the mltrev23/spam-classify dataset. The models include:

Spam Classifier: A machine learning model trained to classify SMS messages as either spam or ham (non-spam).
Count Vectorizer: A vectorization model used to transform SMS text data into numerical feature vectors suitable for classification.

Models

1. Spam Classifier

Filename: spam_classifier.pkl
Type: Multinomial Naive Bayes (or other type, based on your actual model)
Input: Numerical feature vectors (output from the Count Vectorizer)
Output: Binary classification (spam or ham)

2. Count Vectorizer

Filename: count_vectorizer.pkl
Type: Scikit-learn's CountVectorizer
Input: Raw SMS text data
Output: Sparse matrix of token counts

Dataset

Both models were trained on the mltrev23/spam-classify dataset, which consists of SMS messages labeled as either spam or ham. The dataset includes a diverse set of SMS messages that provide a robust training set for detecting unwanted or harmful content.

Installation

To use these models, first clone this repository and install the required Python packages:

git clone https://huggingface.co/mltrev23/spam-classification
cd spam-classification
pip install -r requirements.txt

Requirements

The models require the following Python libraries:

pip install scikit-learn
pip install numpy

Usage

Loading the Models

You can load the models using the joblib library:

import joblib

# Load the count vectorizer
vectorizer = joblib.load('count_vectorizer.pkl')

# Load the spam classifier
classifier = joblib.load('spam_classifier.pkl')

Predicting Spam Messages

To classify new SMS messages, follow these steps:

# Sample SMS messages
messages = ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)", 
            "I'll call you later"]

# Transform the messages into feature vectors
X = vectorizer.transform(messages)

# Predict using the classifier
predictions = classifier.predict(X)

# Output the predictions
for message, prediction in zip(messages, predictions):
    print(f"Message: {message} \nPrediction: {'Spam' if prediction == 'spam' else 'Ham'}\n")

Evaluating the Classifier

You can also evaluate the performance of the classifier using a test set from the same dataset:

from sklearn.metrics import accuracy_score, classification_report

# Assuming you have a test set of messages and labels
X_test = vectorizer.transform(test_messages)
y_pred = classifier.predict(X_test)

print("Accuracy:", accuracy_score(test_labels, y_pred))
print(classification_report(test_labels, y_pred))

Model Interpretation

The spam classifier is a powerful tool for identifying unwanted SMS messages, but understanding why it makes certain decisions is also crucial. You can inspect the model's learned parameters, such as the most influential words for each class (spam or ham), to gain insights into how the model works.

Contributing

If you wish to contribute to this repository by improving the models or expanding the dataset, feel free to submit a pull request. Please ensure that your code is well-documented and adheres to the existing style.

License

This project is licensed under the MIT License - see the LICENSE file for details.

References

If you use these models in your research or project, please cite the dataset and relevant model training methods as follows:

Dataset: mltrev23/spam-classify
Naive Bayes: McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization (pp. 41-48).