GUI Element Classifier β€” MobileNetV3-small (15 classes)

Lightweight (6 MB) MobileNetV3-small ONNX classifier for 15 GUI element types. CPU-friendly (5 ms per crop, ONNX Runtime), designed as a deterministic preprocessing step before VLM-based GUI agent pipelines. No GPU required, no external dependencies beyond onnxruntime, numpy, and pillow.

What it's for

You have a screenshot, a list of detected element bounding boxes (from any detector β€” YOLOv8, OWL-ViT, SAM-then-filter, accessibility tree, anything else), and you need cheap, deterministic per-element type labels (button vs text_input vs slider vs …) before passing the structured layout to a reasoning LLM. Drop this classifier into the pipeline as the typing layer:

[Screenshot]
   ↓
[Your detector] β†’ list of bboxes
   ↓
[This classifier] β†’ per-bbox type label + confidence
   ↓
[Your reasoning / action LLM] β†’ reasons over typed elements, not pixels

The classifier is not a replacement for VLM captioning β€” it's the cheap deterministic layer that adds structure to your prompt so the LLM doesn't have to look at every region just to figure out what it is.

Classes (15)

button, checkbox, container, dropdown, icon_button, image, label, link, menu_item, scrollbar, slider, tab, text_input, toggle, unknown

The class indices in the model output (0..14) match the alphabetical ordering above. See classes.json for the canonical list.

Files

File Purpose
mobilenetv3_small.onnx ONNX export (fixed batch=1). Primary inference artifact.
mobilenetv3_small.pth PyTorch state_dict for those who want to fine-tune further or re-export with dynamic axes.
classes.json Class names + ordering.
inference_example.py 100-line self-contained demo. pip install onnxruntime numpy pillow then python inference_example.py crop.png.

Quick start

from PIL import Image
import numpy as np
import onnxruntime as ort

CLASSES = ['button','checkbox','container','dropdown','icon_button',
           'image','label','link','menu_item','scrollbar',
           'slider','tab','text_input','toggle','unknown']

MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32)
STD  = np.array([0.229, 0.224, 0.225], dtype=np.float32)

def preprocess(img: Image.Image) -> np.ndarray:
    img = img.convert("RGB")
    w, h = img.size
    m = max(w, h)
    pad = Image.new("RGB", (m, m), (128, 128, 128))
    pad.paste(img, ((m - w) // 2, (m - h) // 2))
    arr = np.array(pad.resize((224, 224), Image.BILINEAR), dtype=np.float32) / 255.0
    arr = (arr - MEAN) / STD
    return arr.transpose(2, 0, 1)[None, :, :, :].astype(np.float32)

sess = ort.InferenceSession("mobilenetv3_small.onnx",
                            providers=["CPUExecutionProvider"])
crop = Image.open("button.png")
logits = sess.run(None, {sess.get_inputs()[0].name: preprocess(crop)})[0]
probs = np.exp(logits - logits.max()) / np.exp(logits - logits.max()).sum()
idx = int(probs.argmax())
print(CLASSES[idx], float(probs[0, idx]))

Preprocessing (must-match-or-quality-degrades)

  1. PadToSquare with gray (128, 128, 128) on the shorter axis.
  2. Resize to 224x224 with Image.BILINEAR.
  3. array / 255.0 β†’ float32 in [0, 1].
  4. ImageNet normalize: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225].
  5. Transpose HWC β†’ CHW. Add batch dim.

The model is sensitive to all five steps β€” wrong pad colour, BICUBIC instead of BILINEAR, or skipping the ImageNet stats will degrade accuracy noticeably.

Performance

Aggregate metrics from the training-time held-out evaluation:

Metric Value
Test set 1449 examples (sampled from the training distribution, not strictly out-of-distribution β€” see Limitations)
Accuracy ~0.72
Macro F1 ~0.58
Weighted F1 ~0.76

Weighted F1 is dominated by the icon_button class (~50% of test support, F1=0.78). Per-class F1s vary widely β€” tab and text_input clear 0.95+, container and slider lag at 0.15-0.25. If your domain skews heavily to specific classes, expect class-imbalance effects.

Latency: ~5 ms per crop on a modern x86 laptop CPU (Intel i7-12th gen, single thread). The shipped ONNX is fixed batch_size=1 for the simplest possible drop-in; if you're processing >100 crops per screenshot, re-export with dynamic axes from the included .pth for batching.

Limitations & honest scope

  • Training-time test-set metrics are not a tight estimate of accuracy on your domain. The aggregate numbers above (acc ~0.72, weighted F1 ~0.76) come from a single held-out split sampled from the training distribution β€” this is not strictly out-of-distribution evaluation. On a domain that differs from the training mix (web vs desktop, different OS look-and-feel, dark mode vs light mode, custom design systems), expect a meaningful gap from these numbers. Validate on your own labelled crops before depending on the figures.
  • Per-class F1 varies widely. tab and text_input clear 0.95+, icon_button sits around 0.78 (and dominates weighted F1 because of class imbalance β€” ~50% of test support), while container and slider lag at 0.15-0.25. The aggregate numbers (acc / macro / weighted F1) hide this variance β€” read the per-class story above before trusting the headline.
  • Linux desktop bias. Training data skews toward Linux desktop UIs (XFCE / GTK toolkits / Firefox / Mousepad / Thunar / terminal). Web pages, macOS, Windows 11, and mobile UIs are likely under-represented and may need domain adaptation.
  • unknown is a real class. When the classifier produces unknown with high confidence, the input is genuinely ambiguous (small icon with no clear visual identity); don't paper over it with argmax-but-skip-unknown logic.
  • container and slider underperform at training-time evaluation. Consider using bbox geometry as a sanity check (containers are large, sliders are wide-and-thin) alongside the model rather than trusting it alone for those two classes.
  • Single-label argmax. No multi-class output. If a region could legitimately be both tab and button (some apps style tabs as buttons), the model picks one.
  • Fixed batch_size=1 in the shipped ONNX. For high-throughput scenarios, re-export from the included .pth with dynamic axes (torch.onnx.export(..., dynamic_axes={'input': {0: 'batch'}})).
  • ImageNet preprocessing is assumed. The model was trained against the standard ImageNet mean/std + PadToSquare-with-gray. Substituting different normalization will silently degrade results.

License

Apache-2.0. The MobileNetV3-small architecture itself was originally introduced by Google Research; this export uses the open architecture and re-trained weights.

Citation

@misc{gui_element_classifier_mobilenetv3_2026,
  author = {Diogo Neno},
  title  = {15-class MobileNetV3-small GUI Element Classifier},
  year   = {2026},
  url    = {https://huggingface.co/diogoneno/gui-element-classifier},
}

Changelog

  • 2026-05-02 β€” Initial public release.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support