ScrubData planner — Qwen3-4B fine-tuned for tabular cleaning plans

A ≤4B planner for hands-off data cleaning: it reads an aggregated column profile (per-value frequency counts) and emits a structured JSON cleaning plan that a deterministic pandas executor applies. Built for the Build Small Hackathon (🏡 Backyard AI · Tiny Titan · Well-Tuned).

Live demo: https://huggingface.co/spaces/build-small-hackathon/scrubdata · Code/paper: see the Space repo (docs/paper/) · Traces: build-small-hackathon/scrubdata-traces

What's special about the training data

Every training example is execution-verified: a candidate (dirty table, plan) pair is kept only if running the executor on it provably recovers the known-clean table. Mix: synthetic high-cardinality categorical tables (Zipf long-tail + realistic typos) + 20% real-derived pairs from the Raha benchmarks (cell-aligned, learnable canonicalizations only).

Measured

  • Canonicalization micro-F1 0.90 (vs 0.45 for a much larger zero-shot generic model, 0.13 for a rule heuristic) on frozen held-out gold.
  • Real hospital typos (Raha, OOD): repair recall 0.00 → 0.42 from adding the real-derived 20% (synthetic-only fails to transfer — documented honestly).
  • In production the model is wrapped with reference grounding + calibrated abstention (it never free-generates a canonical for a grounded column type).

How to run

Ollama / llama.cpp (recommended): use the non-thinking Modelfile from the Space repo (notebooks/Modelfile). Q8_0 GGUF: ricalanis/scrubdata-qwen3-4b-v4-q8 (Q4_K_M corrupts this model on Unsloth 2026.6.x exports — use Q8_0).

Transformers (bf16 + adapter): suppress the tool-call tokens at decode time or the base model's tool-calling prior dominates:

model.generate(..., suppress_tokens=[151657, 151658])  # <tool_call>, </tool_call>

Limitations

English-centric; plans use a closed op vocabulary; canonicalization quality on entity columns depends on the reference taxonomy's coverage; not a de-identification guarantee.

Downloads last month
92
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ricalanis/scrubdata-qwen3-4b-v4-q8

Quantized
(264)
this model