GoClick Model

This is the on-device GUI element detection model for the GoClick Android App.

Model Description

GoClick is a Florence-2 based model that locates GUI elements on screen using natural language descriptions. The model runs completely on-device using ONNX Runtime.

Model Variants

This repository includes multiple quantized versions for different performance needs:

Variant	Size	Use Case
`vision_encoder_int8.onnx`	91MB	Best balance of speed and accuracy
`encoder_model_int8.onnx`	80MB	Best balance of speed and accuracy
`decoder_model_int8.onnx`	132MB	Best balance of speed and accuracy
`vision_encoder_fp16.onnx`	176MB	Better accuracy, larger size
`encoder_model_fp16.onnx`	158MB	Better accuracy, larger size
`decoder_model_fp16.onnx`	261MB	Better accuracy, larger size
`vision_encoder.onnx`	350MB	Full precision (float32)
`encoder_model.onnx`	316MB	Full precision (float32)
`decoder_model.onnx`	521MB	Full precision (float32)

Usage with GoClick App

Download the model files (we recommend starting with the INT8 versions)
Place them in app/src/main/assets/ directory of the GoClick Android project
Build and run the app

Required Files for App

Minimum required files to run the app:

vision_encoder_int8.onnx
encoder_model_int8.onnx
decoder_model_int8.onnx
vocab.json
tokenizer.json
tokenizer_config.json
special_tokens_map.json
mask.jpg

How It Works

Screen Capture: The app captures the device screen using MediaProjection
Vision Encoder: Processes the screenshot into visual features
Text Encoder: Encodes the natural language query
Decoder: Generates coordinate tokens for the target element location
Post-processing: Converts tokens to actual screen coordinates

Model Architecture

Based on Florence-2, adapted for mobile deployment:

Vision encoder for screen understanding
Text encoder for natural language queries
Auto-regressive decoder for coordinate generation

License

MIT License

ThreeLucky
/

Goclick-Ondevice