Spanish Bank Transaction Classifier
Fine-tuned dccuchile/bert-base-spanish-wwm-cased to classify Spanish bank transaction descriptions into 14 spending categories.
Test accuracy: 86.4%
Motivation
I built a personal finance app and wanted automatic transaction categorization. The tricky part wasn't the model — it was the data. Real bank statements are messy: truncated merchant names, mixed case, reference codes, and genuinely ambiguous transactions (COMPRA AMAZON could be anything).
Dataset
Rather than scraping real transactions, I generated a synthetic dataset of 1,400 examples designed to be hard to classify. The key constraints:
- Realistic formatting — descriptions follow actual bank statement patterns (
COMPRA MERCADONA,CARGO PERIODICO NETFLIX,TPV ZARA MAD) - Deliberate ambiguity — the same merchant appears under different categories depending on context (
NIKE→ Ropa or Deporte,CARREFOUR→ Alimentacion or Tecnologia) - Imbalanced classes — reflects real spending distributions, not artificial balance
- Noisy examples — truncated strings, mixed case, embedded dates, Bizum transfers with no useful description
Dataset: marinamen/gastos-bancarios-es
Categories
Alimentacion · Restaurantes · Transporte · Ropa · Salud · Tecnologia · Hogar · Viajes · Educacion · Deporte · Entretenimiento · Energia · Alquiler · Varios
Training
model = "dccuchile/bert-base-spanish-wwm-cased"
epochs = 15
learning_rate = 3e-5
batch_size = 16
max_length = 64
Chose Spanish BERT over multilingual DistilBERT after the multilingual model stalled at 42% — the Spanish-specific pretraining made a significant difference on short, noisy text.
Results
| Category | Precision | Recall | F1 |
|---|---|---|---|
| Alquiler | 1.000 | 1.000 | 1.000 |
| Educacion | 0.909 | 1.000 | 0.952 |
| Varios | 1.000 | 0.857 | 0.923 |
| Alimentacion | 0.867 | 0.929 | 0.897 |
| Deporte | 0.778 | 0.875 | 0.824 |
| Entretenimiento | 0.857 | 0.667 | 0.750 |
Entretenimiento is the weakest category — it overlaps with Tecnologia on merchants like Apple and Amazon, which is expected given how those companies operate.
Usage
from transformers import pipeline
classifier = pipeline("text-classification", model="marinamen/gastos-bancarios-classifier")
classifier("CARGO PERIODICO NETFLIX")
# [{'label': 'Entretenimiento', 'score': 0.97}]
Try it
Interactive demo: gastos-bancarios-space
- Downloads last month
- -