Field Notes: shipping two on-device vision apps on a 1.3B model

Published June 15, 2026

Build Small Hackathon · Backyard AI track · model: openbmb/MiniCPM-V-4.6 (~1.3B)

I set out to build practical tools for people I actually know — not a demo that only works on the happy path. Two shipped, both running entirely on-device on a single ~1.3-billion-parameter vision-language model, no cloud APIs:

🚗 Lot Scout — reads a used-car listing screenshot and flags scams, salvage titles, and bad prices before you waste a Saturday. 🥫 SafeBite — scans a food label and flags your allergens, including the hidden ones, in the grocery aisle with no signal. Here's what I learned getting there.

Verify the model before you build on it The brief assumed MiniCPM-V used the old model.chat(msgs=...) API. It doesn't — 4.6 uses the standard apply_chat_template + model.generate() path (no .chat()), loaded via MiniCPMV4_6ForConditionalGeneration. I also nearly grabbed the wrong checkpoint: MiniCPM-V-4_5 is 8.7B (over the 4B Tiny-Titan bar), while MiniCPM-V-4.6 is ~1.3B. Ten minutes reading the model card and a real run against the official demo Space saved hours of building on a wrong assumption. Phase zero is non-negotiable.
The one insight that shaped both apps: perception vs. judgment A 1.3B model is genuinely good at perception — reading text out of an image. It is genuinely unreliable at judgment — world knowledge, arithmetic, calibrated estimates.

I learned this the expensive way. Lot Scout's first design asked the model for a fair-price band. It confidently returned $40,000–$60,000 for a 2016 BMW worth ~$20k — anchoring near MSRP and ignoring depreciation. On a $6,900 scam listing, that would have told the buyer it was the deal of the century. Dangerous.

So both apps converged on the same architecture: one vision call for perception, deterministic rules for everything else.

extract (vision model) → validate/normalize (rules) → match/flag (rules) → advise (rules) Lot Scout: the model extracts facts; deterministic rules detect salvage-title language, scam patterns (deposit-first, will-ship, no test drive, manufactured urgency), and a bounded depreciation heuristic does the price sanity check — clearly labelled an estimate, with KBB/Edmunds links for the real number. SafeBite: the model OCRs the ingredients; a deterministic alias dictionary matches against your allergen profile (whey/casein → dairy, semolina → gluten, albumin → egg) with word-boundary matching so eggplant doesn't trip egg and peanut butter doesn't trip dairy. The payoff: the apps never hallucinate a "safe to eat" or a fake market price, the logic is auditable, and each runs in 3–5 seconds.

Small models can't be trusted to format — so don't ask them to Two concrete failures and their fixes:

Nested JSON breaks. Asking for structured JSON with nested arrays, the model truncated mid-array and dropped quotes around values containing apostrophes. Fix: keep schemas flat, and add a lenient parser that repairs the one glitch it reliably makes (a stray . where a , belongs). It can't tell "Contains" from "May contain." SafeBite needs that distinction (AVOID vs CAUTION). The model kept misbucketing. Fix: have it copy allergen statements verbatim, then classify contains-vs-may-contain deterministically by trigger phrase ("may contain", "made on shared equipment", "traces of"). Move the judgment out of the model. General rule: ask the model only for what it perceives, and own the structure yourself.

Make the agent loop visible (and shareable) Both apps log every pipeline step's input/output/timing to a trace, and expose a downloadable JSON trace per run. I published representative traces as open datasets (lot-scout-traces, safebite-traces) so the reasoning is inspectable, not a black box.
Off the grid is a feature, not a footnote Everything runs in the Space on the model itself — no network round-trip per request. SafeBite's whole premise depends on it: you're in a grocery aisle with bad reception, and your health profile never leaves the device. "Small + local" isn't a constraint to apologize for; for these use cases it's the actual selling point.

What I'd do next Ground the price band in real market comps (Lot Scout) and add barcode + crowd-sourced corrections (SafeBite). A quantized build for true phone-offline use. Stream traces to the open datasets live. Two small, honest, on-device tools that do one useful thing well. Try them: Lot Scout · SafeBite.

#BuildSmallHackathon · built with Gradio on Hugging Face Spaces (ZeroGPU).

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote