BBC News Classification Pipeline (Production-Ready)

This repository hosts a production-optimized NLP pipeline that classifies news articles into five distinct categories: Business, Entertainment, Politics, Tech, and Sports.

Unlike standard modeling workflows that separate text processing from inference, this model encapsulates its entire custom preprocessing architecture inside a single serialized pipeline to completely eliminate training-serving skew.

πŸ“Š Model Performance & Accuracy

The model was evaluated on a 20% holdout test set, utilizing strict class stratification to ensure balanced evaluation across all five news categories.

  • Macro-F1 Score: 0.96
  • Evaluation Metrics: Achieved highly balanced Macro-Precision and Macro-Recall scores, ensuring that minority and majority classes are predicted with equal reliability.
  • Efficiency: Achieves deep-learning-level accuracy via highly optimized feature engineering, drastically reducing inference compute costs compared to transformer models like BERT.

βš™οΈ Model Architecture & Design

The engineering design wraps all dependencies into a unified scikit-learn Pipeline:

  1. Custom NLP Transformer (TextCleaner): Inherits from BaseEstimator and TransformerMixin. It executes regex-based cleaning (removal of HTML tags, URLs, email addresses, and line breaks) followed by a deterministic spaCy (en_core_web_sm) tokenization pass to strip stop words, punctuation, and capture structural base forms (lemmatization).
  2. Feature Extraction: A CountVectorizer configuration pulling both unigrams and bigrams (ngram_range=(1,2)) to preserve local multi-word semantic features.
  3. Classifier Layer: A MultinomialNB (Multinomial Naive Bayes) estimator optimized for discrete count-based document text frequencies.
[Input Text] ──> [TextCleaner (spaCy)] ──> [CountVectorizer (1,2 Grams)] ──> [MultinomialNB] ──> [Output Class]
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support