Spanish Bank Transaction Classifier

Fine-tuned dccuchile/bert-base-spanish-wwm-cased to classify Spanish bank transaction descriptions into 14 spending categories.

Test accuracy: 86.4%

Motivation

I built a personal finance app and wanted automatic transaction categorization. The tricky part wasn't the model — it was the data. Real bank statements are messy: truncated merchant names, mixed case, reference codes, and genuinely ambiguous transactions (COMPRA AMAZON could be anything).

Dataset

Rather than scraping real transactions, I generated a synthetic dataset of 1,400 examples designed to be hard to classify. The key constraints:

  • Realistic formatting — descriptions follow actual bank statement patterns (COMPRA MERCADONA, CARGO PERIODICO NETFLIX, TPV ZARA MAD)
  • Deliberate ambiguity — the same merchant appears under different categories depending on context (NIKE → Ropa or Deporte, CARREFOUR → Alimentacion or Tecnologia)
  • Imbalanced classes — reflects real spending distributions, not artificial balance
  • Noisy examples — truncated strings, mixed case, embedded dates, Bizum transfers with no useful description

Dataset: marinamen/gastos-bancarios-es

Categories

Alimentacion · Restaurantes · Transporte · Ropa · Salud · Tecnologia · Hogar · Viajes · Educacion · Deporte · Entretenimiento · Energia · Alquiler · Varios

Training

model = "dccuchile/bert-base-spanish-wwm-cased"
epochs = 15
learning_rate = 3e-5
batch_size = 16
max_length = 64

Chose Spanish BERT over multilingual DistilBERT after the multilingual model stalled at 42% — the Spanish-specific pretraining made a significant difference on short, noisy text.

Results

Category Precision Recall F1
Alquiler 1.000 1.000 1.000
Educacion 0.909 1.000 0.952
Varios 1.000 0.857 0.923
Alimentacion 0.867 0.929 0.897
Deporte 0.778 0.875 0.824
Entretenimiento 0.857 0.667 0.750

Entretenimiento is the weakest category — it overlaps with Tecnologia on merchants like Apple and Amazon, which is expected given how those companies operate.

Usage

from transformers import pipeline

classifier = pipeline("text-classification", model="marinamen/gastos-bancarios-classifier")
classifier("CARGO PERIODICO NETFLIX")
# [{'label': 'Entretenimiento', 'score': 0.97}]

Try it

Interactive demo: gastos-bancarios-space

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train marinamen/gastos-bancarios-classifier

Space using marinamen/gastos-bancarios-classifier 1