ScrubData planner — Qwen3-4B fine-tuned for tabular cleaning plans
A ≤4B planner for hands-off data cleaning: it reads an aggregated column profile (per-value frequency counts) and emits a structured JSON cleaning plan that a deterministic pandas executor applies. Built for the Build Small Hackathon (🏡 Backyard AI · Tiny Titan · Well-Tuned).
Live demo: https://huggingface.co/spaces/build-small-hackathon/scrubdata ·
Code/paper: see the Space repo (docs/paper/) · Traces:
build-small-hackathon/scrubdata-traces
What's special about the training data
Every training example is execution-verified: a candidate (dirty table, plan) pair is kept only if running the executor on it provably recovers the known-clean table. Mix: synthetic high-cardinality categorical tables (Zipf long-tail + realistic typos) + 20% real-derived pairs from the Raha benchmarks (cell-aligned, learnable canonicalizations only).
Shipped composition (WS1 — verified union planner): in the product, every
model-proposed mapping is scored by a deterministic verifier (errors-are-rare frequency
gates, variant similarity, reference agreement; threshold SCRUBDATA_TAU, default 0.5)
and unioned with the grounded heuristic. Measured on hospital's 509 real errors:
0.905 precision @ 0.413 coverage (gated model plan alone: 0.993 @ 0.287 — 146/147
committed changes correct; seed-robust: 0.891 ± 0.012 @ 0.396 ± 0.025 over 3 training
seeds). Dropped merges become review flags, never silent skips.
Measured
- Canonicalization micro-F1 0.90 (vs 0.45 for a much larger zero-shot generic model, 0.13 for a rule heuristic) on frozen held-out gold.
- Real hospital typos (Raha, OOD): repair recall 0.00 → 0.42 from adding the real-derived 20% (synthetic-only fails to transfer — documented honestly).
- In production the model is wrapped with reference grounding + calibrated abstention (it never free-generates a canonical for a grounded column type).
How to run
Ollama / llama.cpp (recommended): use the non-thinking Modelfile from the Space repo
(notebooks/Modelfile). Q8_0 GGUF: ricalanis/scrubdata-qwen3-4b-v4-q8 (Q4_K_M corrupts
this model on Unsloth 2026.6.x exports — use Q8_0).
Transformers (bf16 + adapter): suppress the tool-call tokens at decode time or the base model's tool-calling prior dominates:
model.generate(..., suppress_tokens=[151657, 151658]) # <tool_call>, </tool_call>
Limitations
English-centric; plans use a closed op vocabulary; canonicalization quality on entity columns depends on the reference taxonomy's coverage; not a de-identification guarantee.
Model tree for ricalanis/scrubdata-qwen3-4b
Base model
Qwen/Qwen3-4B-Instruct-2507