BBC News Classification Pipeline (Production-Ready)

This repository hosts a production-optimized NLP pipeline that classifies news articles into five distinct categories: Business, Entertainment, Politics, Tech, and Sports.

Unlike standard modeling workflows that separate text processing from inference, this model encapsulates its entire custom preprocessing architecture inside a single serialized pipeline to completely eliminate training-serving skew.

📊 Model Performance & Accuracy

The model was evaluated on a 20% holdout test set, utilizing strict class stratification to ensure balanced evaluation across all five news categories.

Macro-F1 Score: 0.96
Evaluation Metrics: Achieved highly balanced Macro-Precision and Macro-Recall scores, ensuring that minority and majority classes are predicted with equal reliability.
Efficiency: Achieves deep-learning-level accuracy via highly optimized feature engineering, drastically reducing inference compute costs compared to transformer models like BERT.

⚙️ Model Architecture & Design

The engineering design wraps all dependencies into a unified scikit-learn Pipeline:

Custom NLP Transformer (TextCleaner): Inherits from BaseEstimator and TransformerMixin. It executes regex-based cleaning (removal of HTML tags, URLs, email addresses, and line breaks) followed by a deterministic spaCy (en_core_web_sm) tokenization pass to strip stop words, punctuation, and capture structural base forms (lemmatization).
Feature Extraction: A CountVectorizer configuration pulling both unigrams and bigrams (ngram_range=(1,2)) to preserve local multi-word semantic features.
Classifier Layer: A MultinomialNB (Multinomial Naive Bayes) estimator optimized for discrete count-based document text frequencies.

[Input Text] ──> [TextCleaner (spaCy)] ──> [CountVectorizer (1,2 Grams)] ──> [MultinomialNB] ──> [Output Class]

Downloads last month: -