Transaction Classifier — Augmented MiniLM (v5)

A fine-tuned sentence-transformers/all-MiniLM-L6-v2 model trained on abbreviation-augmented transaction data. This model attempted to improve robustness to the abbreviations common in bank transactions by generating synthetic character-level variants of the training data.

This is version 5 (Phase 6a) in a progressive model development series. It was an experimental model that did not improve over the standard fine-tuned MiniLM (v4) and was not adopted for production use.

Model Details

Property	Value
Base model	`sentence-transformers/all-MiniLM-L6-v2` (22M params)
Task	Multi-class text classification (10 categories)
Training samples	173,761 (50K base + 3x augmentation)
Epochs	4
Batch size	64
Learning rate	2e-5
Max sequence length	64 tokens
Loss	Cross-entropy
Format	SafeTensors
Trained	2026-04-03

ID	Category
0	Food & Dining
1	Transportation
2	Shopping & Retail
3	Entertainment & Recreation
4	Healthcare & Medical
5	Utilities & Services
6	Financial Services
7	Income
8	Government & Legal
9	Charity & Donations

Performance

Metric	Score
Validation accuracy	98.5%
Validation confidence	98.3%

Note: This model regressed on real-world evaluation compared to the standard fine-tuned MiniLM (v4, 86.5% real accuracy). Abbreviation augmentation introduced noise that confused the WordPiece tokenizer rather than helping it generalize.

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "maaz-zaidi/transaction-classifier-minilm-augmented"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

categories = [
    "Food & Dining", "Transportation", "Shopping & Retail",
    "Entertainment & Recreation", "Healthcare & Medical",
    "Utilities & Services", "Financial Services", "Income",
    "Government & Legal", "Charity & Donations"
]

text = "WALMART SUPERCENTER #1234"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=64)

with torch.no_grad():
    logits = model(**inputs).logits
    predicted = torch.argmax(logits, dim=-1).item()

print(f"Category: {categories[predicted]}")

Training Data

Primary: mitulshah/transaction-categorization - 50K samples + 3x abbreviation augmentation = 173,761 (gated dataset)
Augmentation strategy: Each sample generated 3 character-level abbreviation variants mimicking real bank statement patterns (vowel dropping, truncation, etc.)

Why This Experiment

Real bank transactions use heavy abbreviations (MCDNLDS, AMZN MKTP, WLMRT). The hypothesis was that training on abbreviated variants would teach the WordPiece tokenizer to be robust to these patterns. However, the augmented abbreviations created tokenization noise that hurt more than it helped. The later metadata-enrichment approach (v7) proved far more effective.

Part of a Series

See the Transaction Classifier collection for all 7 model versions.

Limitations

Regressed on real-world accuracy compared to the standard fine-tune (v4)
Abbreviation augmentation is counterproductive for WordPiece-based models
Superseded by the metadata-enrichment approach (v7)

Citation

@misc{zaidi2026txnclassifier,
  title={Transaction Classifier: Multi-Stage Bank Transaction Categorization},
  author={Maaz Zaidi},
  year={2026},
  url={https://huggingface.co/maaz-zaidi/transaction-classifier-minilm-augmented}
}

Downloads last month: 16

Safetensors

Model size

22.7M params

Tensor type

F32

Model tree for maaz-zaidi/transaction-classifier-minilm-augmented

Base model

sentence-transformers/all-MiniLM-L6-v2

Finetuned

(881)

this model

Dataset used to train maaz-zaidi/transaction-classifier-minilm-augmented

Collection including maaz-zaidi/transaction-classifier-minilm-augmented

Transaction Classifier

Collection

A versioned progressive model series for classifying raw bank transaction strings into 10 budget categories. • 7 items • Updated 11 days ago

Evaluation results

Validation Accuracy
self-reported

0.985

maaz-zaidi
/

transaction-classifier-minilm-augmented