GoClick Model

This is the on-device GUI element detection model for the GoClick Android App.

Model Description

GoClick is a Florence-2 based model that locates GUI elements on screen using natural language descriptions. The model runs completely on-device using ONNX Runtime.

Model Variants

This repository includes multiple quantized versions for different performance needs:

Variant Size Use Case
vision_encoder_int8.onnx 91MB Best balance of speed and accuracy
encoder_model_int8.onnx 80MB Best balance of speed and accuracy
decoder_model_int8.onnx 132MB Best balance of speed and accuracy
vision_encoder_fp16.onnx 176MB Better accuracy, larger size
encoder_model_fp16.onnx 158MB Better accuracy, larger size
decoder_model_fp16.onnx 261MB Better accuracy, larger size
vision_encoder.onnx 350MB Full precision (float32)
encoder_model.onnx 316MB Full precision (float32)
decoder_model.onnx 521MB Full precision (float32)

Usage with GoClick App

  1. Download the model files (we recommend starting with the INT8 versions)
  2. Place them in app/src/main/assets/ directory of the GoClick Android project
  3. Build and run the app

Required Files for App

Minimum required files to run the app:

  • vision_encoder_int8.onnx
  • encoder_model_int8.onnx
  • decoder_model_int8.onnx
  • vocab.json
  • tokenizer.json
  • tokenizer_config.json
  • special_tokens_map.json
  • mask.jpg

How It Works

  1. Screen Capture: The app captures the device screen using MediaProjection
  2. Vision Encoder: Processes the screenshot into visual features
  3. Text Encoder: Encodes the natural language query
  4. Decoder: Generates coordinate tokens for the target element location
  5. Post-processing: Converts tokens to actual screen coordinates

Model Architecture

Based on Florence-2, adapted for mobile deployment:

  • Vision encoder for screen understanding
  • Text encoder for natural language queries
  • Auto-regressive decoder for coordinate generation

License

MIT License

Links

Downloads last month
41
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support