Field Notes: shipping two on-device vision apps on a 1.3B model

Community Article
Published June 15, 2026

Build Small Hackathon ยท Backyard AI track ยท model: openbmb/MiniCPM-V-4.6 (~1.3B)

I set out to build practical tools for people I actually know โ€” not a demo that only works on the happy path. Two shipped, both running entirely on-device on a single ~1.3-billion-parameter vision-language model, no cloud APIs:

๐Ÿš— Lot Scout โ€” reads a used-car listing screenshot and flags scams, salvage titles, and bad prices before you waste a Saturday. ๐Ÿฅซ SafeBite โ€” scans a food label and flags your allergens, including the hidden ones, in the grocery aisle with no signal. Here's what I learned getting there.

  1. Verify the model before you build on it The brief assumed MiniCPM-V used the old model.chat(msgs=...) API. It doesn't โ€” 4.6 uses the standard apply_chat_template + model.generate() path (no .chat()), loaded via MiniCPMV4_6ForConditionalGeneration. I also nearly grabbed the wrong checkpoint: MiniCPM-V-4_5 is 8.7B (over the 4B Tiny-Titan bar), while MiniCPM-V-4.6 is ~1.3B. Ten minutes reading the model card and a real run against the official demo Space saved hours of building on a wrong assumption. Phase zero is non-negotiable.

  2. The one insight that shaped both apps: perception vs. judgment A 1.3B model is genuinely good at perception โ€” reading text out of an image. It is genuinely unreliable at judgment โ€” world knowledge, arithmetic, calibrated estimates.

I learned this the expensive way. Lot Scout's first design asked the model for a fair-price band. It confidently returned $40,000โ€“$60,000 for a 2016 BMW worth ~$20k โ€” anchoring near MSRP and ignoring depreciation. On a $6,900 scam listing, that would have told the buyer it was the deal of the century. Dangerous.

So both apps converged on the same architecture: one vision call for perception, deterministic rules for everything else.

extract (vision model) โ†’ validate/normalize (rules) โ†’ match/flag (rules) โ†’ advise (rules) Lot Scout: the model extracts facts; deterministic rules detect salvage-title language, scam patterns (deposit-first, will-ship, no test drive, manufactured urgency), and a bounded depreciation heuristic does the price sanity check โ€” clearly labelled an estimate, with KBB/Edmunds links for the real number. SafeBite: the model OCRs the ingredients; a deterministic alias dictionary matches against your allergen profile (whey/casein โ†’ dairy, semolina โ†’ gluten, albumin โ†’ egg) with word-boundary matching so eggplant doesn't trip egg and peanut butter doesn't trip dairy. The payoff: the apps never hallucinate a "safe to eat" or a fake market price, the logic is auditable, and each runs in 3โ€“5 seconds.

  1. Small models can't be trusted to format โ€” so don't ask them to Two concrete failures and their fixes:

Nested JSON breaks. Asking for structured JSON with nested arrays, the model truncated mid-array and dropped quotes around values containing apostrophes. Fix: keep schemas flat, and add a lenient parser that repairs the one glitch it reliably makes (a stray . where a , belongs). It can't tell "Contains" from "May contain." SafeBite needs that distinction (AVOID vs CAUTION). The model kept misbucketing. Fix: have it copy allergen statements verbatim, then classify contains-vs-may-contain deterministically by trigger phrase ("may contain", "made on shared equipment", "traces of"). Move the judgment out of the model. General rule: ask the model only for what it perceives, and own the structure yourself.

  1. Make the agent loop visible (and shareable) Both apps log every pipeline step's input/output/timing to a trace, and expose a downloadable JSON trace per run. I published representative traces as open datasets (lot-scout-traces, safebite-traces) so the reasoning is inspectable, not a black box.

  2. Off the grid is a feature, not a footnote Everything runs in the Space on the model itself โ€” no network round-trip per request. SafeBite's whole premise depends on it: you're in a grocery aisle with bad reception, and your health profile never leaves the device. "Small + local" isn't a constraint to apologize for; for these use cases it's the actual selling point.

What I'd do next Ground the price band in real market comps (Lot Scout) and add barcode + crowd-sourced corrections (SafeBite). A quantized build for true phone-offline use. Stream traces to the open datasets live. Two small, honest, on-device tools that do one useful thing well. Try them: Lot Scout ยท SafeBite.

#BuildSmallHackathon ยท built with Gradio on Hugging Face Spaces (ZeroGPU).

Community

Sign up or log in to comment