Transaction Classifier — SGD Baseline (v1)

A classical machine learning baseline using TF-IDF character n-gram features with a Stochastic Gradient Descent (SGD) classifier. This model classifies raw bank transaction strings into 10 budget categories.

This is version 1 (Phase 1) in a progressive model development series - the initial baseline that all subsequent models improved upon.

Model Details

Property Value
Architecture TF-IDF (char_wb 3-5 grams) + SGDClassifier
Task Multi-class text classification (10 categories)
Training samples 3,597,859
Max features 100,000
Loss Modified Huber
Alpha 0.0001
Format Joblib (scikit-learn pipeline)
Trained 2026-03-28

Categories

ID Category
0 Food & Dining
1 Transportation
2 Shopping & Retail
3 Entertainment & Recreation
4 Healthcare & Medical
5 Utilities & Services
6 Financial Services
7 Income
8 Government & Legal
9 Charity & Donations

Performance

Evaluated on 505 unique real-world RBC transactions (3,113 weighted, 2019-2026).

Metric Score
Real-world accuracy (weighted) 53.7%
SGD-only accuracy 28.4%
Validation accuracy 98.0%
Flagged (low confidence) 43.9%

Key finding: 98% validation accuracy vs 28.4% real-world ML accuracy demonstrates severe domain mismatch between synthetic training data and real bank statements.

Usage

import joblib

pipeline = joblib.load("sgd_pipeline.joblib")

texts = ["MCDONALD'S #12345 TORONTO ON", "UBER TRIP HELP.UBER.COM"]
predictions = pipeline.predict(texts)
confidences = pipeline.predict_proba(texts).max(axis=1)

print(predictions)

Dependencies

scikit-learn>=1.3
joblib

Training Data

Why This Model Exists

This baseline established two critical findings:

  1. Domain mismatch is the core challenge: 98% synthetic validation accuracy but only 28.4% on real transactions
  2. Character n-grams are insufficient: The TF-IDF features capture character patterns but not semantic meaning of merchant names

These findings motivated the shift to pre-trained language models (v3 SetFit, v4+ MiniLM).

Part of a Series

See the Transaction Classifier collection for all 7 model versions.

Citation

@misc{zaidi2026txnclassifier,
  title={Transaction Classifier: Multi-Stage Bank Transaction Categorization},
  author={Maaz Zaidi},
  year={2026},
  url={https://huggingface.co/maaz-zaidi/transaction-classifier-sgd}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train maaz-zaidi/transaction-classifier-sgd

Collection including maaz-zaidi/transaction-classifier-sgd

Evaluation results