Transaction Classifier — Augmented MiniLM (v5)

A fine-tuned sentence-transformers/all-MiniLM-L6-v2 model trained on abbreviation-augmented transaction data. This model attempted to improve robustness to the abbreviations common in bank transactions by generating synthetic character-level variants of the training data.

This is version 5 (Phase 6a) in a progressive model development series. It was an experimental model that did not improve over the standard fine-tuned MiniLM (v4) and was not adopted for production use.

Model Details

Property Value
Base model sentence-transformers/all-MiniLM-L6-v2 (22M params)
Task Multi-class text classification (10 categories)
Training samples 173,761 (50K base + 3x augmentation)
Epochs 4
Batch size 64
Learning rate 2e-5
Max sequence length 64 tokens
Loss Cross-entropy
Format SafeTensors
Trained 2026-04-03

Categories

ID Category
0 Food & Dining
1 Transportation
2 Shopping & Retail
3 Entertainment & Recreation
4 Healthcare & Medical
5 Utilities & Services
6 Financial Services
7 Income
8 Government & Legal
9 Charity & Donations

Performance

Metric Score
Validation accuracy 98.5%
Validation confidence 98.3%

Note: This model regressed on real-world evaluation compared to the standard fine-tuned MiniLM (v4, 86.5% real accuracy). Abbreviation augmentation introduced noise that confused the WordPiece tokenizer rather than helping it generalize.

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "maaz-zaidi/transaction-classifier-minilm-augmented"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

categories = [
    "Food & Dining", "Transportation", "Shopping & Retail",
    "Entertainment & Recreation", "Healthcare & Medical",
    "Utilities & Services", "Financial Services", "Income",
    "Government & Legal", "Charity & Donations"
]

text = "WALMART SUPERCENTER #1234"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=64)

with torch.no_grad():
    logits = model(**inputs).logits
    predicted = torch.argmax(logits, dim=-1).item()

print(f"Category: {categories[predicted]}")

Training Data

  • Primary: mitulshah/transaction-categorization - 50K samples + 3x abbreviation augmentation = 173,761 (gated dataset)
  • Augmentation strategy: Each sample generated 3 character-level abbreviation variants mimicking real bank statement patterns (vowel dropping, truncation, etc.)

Why This Experiment

Real bank transactions use heavy abbreviations (MCDNLDS, AMZN MKTP, WLMRT). The hypothesis was that training on abbreviated variants would teach the WordPiece tokenizer to be robust to these patterns. However, the augmented abbreviations created tokenization noise that hurt more than it helped. The later metadata-enrichment approach (v7) proved far more effective.

Part of a Series

See the Transaction Classifier collection for all 7 model versions.

Limitations

  • Regressed on real-world accuracy compared to the standard fine-tune (v4)
  • Abbreviation augmentation is counterproductive for WordPiece-based models
  • Superseded by the metadata-enrichment approach (v7)

Citation

@misc{zaidi2026txnclassifier,
  title={Transaction Classifier: Multi-Stage Bank Transaction Categorization},
  author={Maaz Zaidi},
  year={2026},
  url={https://huggingface.co/maaz-zaidi/transaction-classifier-minilm-augmented}
}
Downloads last month
16
Safetensors
Model size
22.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for maaz-zaidi/transaction-classifier-minilm-augmented

Finetuned
(881)
this model

Dataset used to train maaz-zaidi/transaction-classifier-minilm-augmented

Collection including maaz-zaidi/transaction-classifier-minilm-augmented

Evaluation results