Model Card for Model ID

NOTE: This is NOT our final model. This is one of the secondary models that we explored in developing our final model. The final model is in the GBTrees Repository on HuggingFace.

Model Details

This model classifies news headlines as either NBC or Fox News.

Model Description

Developed by: Jack Bader, Kaiyuan Wang, Pairan Xu
Taks: Binary classification (NBC News vs. Fox News)
Preprocessing: TF-IDF vectorization applied to the text data
stop_words = "english"
max_features = 1000
Model type: Random Forest
Freamwork: Scikit-learn

Metrics

Accuracy Score

Model Evaluation

import pandas as pd
import joblib
from huggingface_hub import hf_hub_download
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

# Mount to drive
from google.colab import drive
drive.mount('/content/drive')

# Load test set
test_df = pd.read_csv("/content/drive/MyDrive/test_data_random_subset.csv", encoding="Windows-1252")

# Log in w/ huggingface token
# Token can be found in repo as Token.docx
!huggingface-cli login

# Download the model
model = hf_hub_download(repo_id = "CIS5190FinalProj/RandomForest", filename = "best_rf_model.pkl")

# Download the vectorizer
tfidf_vectorizer = hf_hub_download(repo_id = "CIS5190FinalProj/RandomForest", filename = "tfidf_vectorizer.pkl")

# Load the model
pipeline = joblib.load(model)

# Load the vectorizer
tfidf_vectorizer = joblib.load(tfidf_vectorizer)

# Extract the headlines from the test set
X_test = test_df['title']

# Apply transformation to the headlines into numerical features
X_test_transformed = tfidf_vectorizer.transform(X_test)

# Make predictions using the pipeline
y_pred = pipeline.predict(X_test_transformed)

# Extract 'labels' as target
y_test = test_df['label']

# Print classification report
print(classification_report(y_test, y_pred))