Transaction Classifier — CANINE (v6)

A fine-tuned google/canine-s character-level model that classifies bank transaction strings into 10 budget categories. This model processes raw characters directly without WordPiece tokenization, making it theoretically better suited for the abbreviations and non-standard text found in bank transactions.

This is version 6 (Phase 6b) in a progressive model development series. It was an experimental model that did not improve over the MiniLM baseline and was not adopted for production use.

Model Details

Property Value
Base model google/canine-s (subword tokenization variant)
Task Multi-class text classification (10 categories)
Training samples 173,761 (50K base + 3x augmentation)
Epochs 5
Batch size 32
Learning rate 5e-5
Max sequence length 128 characters
Loss Cross-entropy
Format SafeTensors
Size ~504 MB
Trained 2026-04-03

Categories

ID Category
0 Food & Dining
1 Transportation
2 Shopping & Retail
3 Entertainment & Recreation
4 Healthcare & Medical
5 Utilities & Services
6 Financial Services
7 Income
8 Government & Legal
9 Charity & Donations

Performance

Metric Score
Validation accuracy 98.2%

Note: This model regressed on real-world evaluation compared to the MiniLM fine-tuned model (v4). While character-level processing is conceptually appealing for noisy bank transaction text, the pre-trained semantic knowledge in MiniLM's sentence embeddings proved more valuable than CANINE's character-level flexibility.

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "maaz-zaidi/transaction-classifier-canine"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

categories = [
    "Food & Dining", "Transportation", "Shopping & Retail",
    "Entertainment & Recreation", "Healthcare & Medical",
    "Utilities & Services", "Financial Services", "Income",
    "Government & Legal", "Charity & Donations"
]

text = "MCDONALD'S #12345 TORONTO ON"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

with torch.no_grad():
    logits = model(**inputs).logits
    predicted = torch.argmax(logits, dim=-1).item()

print(f"Category: {categories[predicted]}")

Training Data

  • Primary: mitulshah/transaction-categorization - 50K samples + 3x abbreviation augmentation = 173,761 (gated dataset)
  • Augmentation: Each sample generated 3 variants with character-level abbreviation patterns common in bank transactions

Why This Experiment

Bank transactions contain heavy abbreviations (MCDNLDS, AMZN MKTP, WLMRT) that break WordPiece tokenization. CANINE processes raw characters, so in theory it should handle these better. In practice, the pre-trained world knowledge in MiniLM's sentence embeddings (knowing that "MCDONALD'S" is a restaurant) was more valuable than character-level robustness.

Part of a Series

See the Transaction Classifier collection for all 7 model versions.

Limitations

  • Regressed on real-world accuracy compared to MiniLM (v4)
  • 504 MB model size (~6x larger than MiniLM models)
  • Character-level models require more training data and compute to match subword models with pre-trained knowledge

Citation

@misc{zaidi2026txnclassifier,
  title={Transaction Classifier: Multi-Stage Bank Transaction Categorization},
  author={Maaz Zaidi},
  year={2026},
  url={https://huggingface.co/maaz-zaidi/transaction-classifier-canine}
}
Downloads last month
18
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for maaz-zaidi/transaction-classifier-canine

Base model

google/canine-s
Finetuned
(13)
this model

Dataset used to train maaz-zaidi/transaction-classifier-canine

Collection including maaz-zaidi/transaction-classifier-canine

Evaluation results