Instructions to use maaz-zaidi/transaction-classifier-sgd with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use maaz-zaidi/transaction-classifier-sgd with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("maaz-zaidi/transaction-classifier-sgd", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
Transaction Classifier — SGD Baseline (v1)
A classical machine learning baseline using TF-IDF character n-gram features with a Stochastic Gradient Descent (SGD) classifier. This model classifies raw bank transaction strings into 10 budget categories.
This is version 1 (Phase 1) in a progressive model development series - the initial baseline that all subsequent models improved upon.
Model Details
| Property | Value |
|---|---|
| Architecture | TF-IDF (char_wb 3-5 grams) + SGDClassifier |
| Task | Multi-class text classification (10 categories) |
| Training samples | 3,597,859 |
| Max features | 100,000 |
| Loss | Modified Huber |
| Alpha | 0.0001 |
| Format | Joblib (scikit-learn pipeline) |
| Trained | 2026-03-28 |
Categories
| ID | Category |
|---|---|
| 0 | Food & Dining |
| 1 | Transportation |
| 2 | Shopping & Retail |
| 3 | Entertainment & Recreation |
| 4 | Healthcare & Medical |
| 5 | Utilities & Services |
| 6 | Financial Services |
| 7 | Income |
| 8 | Government & Legal |
| 9 | Charity & Donations |
Performance
Evaluated on 505 unique real-world RBC transactions (3,113 weighted, 2019-2026).
| Metric | Score |
|---|---|
| Real-world accuracy (weighted) | 53.7% |
| SGD-only accuracy | 28.4% |
| Validation accuracy | 98.0% |
| Flagged (low confidence) | 43.9% |
Key finding: 98% validation accuracy vs 28.4% real-world ML accuracy demonstrates severe domain mismatch between synthetic training data and real bank statements.
Usage
import joblib
pipeline = joblib.load("sgd_pipeline.joblib")
texts = ["MCDONALD'S #12345 TORONTO ON", "UBER TRIP HELP.UBER.COM"]
predictions = pipeline.predict(texts)
confidences = pipeline.predict_proba(texts).max(axis=1)
print(predictions)
Dependencies
scikit-learn>=1.3
joblib
Training Data
- Primary: mitulshah/transaction-categorization - full 3.6M records (gated dataset)
- Evaluation: 505 real-world RBC bank transactions (2019-2026)
Why This Model Exists
This baseline established two critical findings:
- Domain mismatch is the core challenge: 98% synthetic validation accuracy but only 28.4% on real transactions
- Character n-grams are insufficient: The TF-IDF features capture character patterns but not semantic meaning of merchant names
These findings motivated the shift to pre-trained language models (v3 SetFit, v4+ MiniLM).
Part of a Series
See the Transaction Classifier collection for all 7 model versions.
Citation
@misc{zaidi2026txnclassifier,
title={Transaction Classifier: Multi-Stage Bank Transaction Categorization},
author={Maaz Zaidi},
year={2026},
url={https://huggingface.co/maaz-zaidi/transaction-classifier-sgd}
}
- Downloads last month
- -
Dataset used to train maaz-zaidi/transaction-classifier-sgd
Collection including maaz-zaidi/transaction-classifier-sgd
Evaluation results
- Real-World Accuracy (Weighted)self-reported0.537
- SGD-Only Accuracyself-reported0.284
- Validation Accuracyself-reported0.980