CMGUI Screen-Grounded Summarizer

This repository contains the latest exported checkpoint for the CMGUI Chinese mobile screenshot summarization project.

The model reads a real mobile screenshot together with OCR/UI elements, bounding boxes, and optional task context. It generates a natural Chinese screen summary and predicts structured UI function entries and key visual evidence ids.

Checkpoint

Source checkpoint:

runs/rich_cmgui_20260512_titlefix_s1e2/stage3_vision_adapter/checkpoint-best

Export date: 2026-05-19

Architecture:

SigLIP2 visual encoder
+ OCR/UI/layout element memory
+ task/context memory
+ mT5-large decoder
+ element-level structured heads for UI functions and evidence

This is a custom PyTorch checkpoint, not a plain AutoModel.from_pretrained package. The code snapshot in code/ shows the loader and inference path used by the project.

Files

File Purpose
pytorch_model.bin Custom model state dict
rich_config.json Model, data, decoding, and structured-head config
decoder_tokenizer/ mT5 tokenizer
image_processor/ SigLIP2 image processor
checkpoint_metrics.json Best-checkpoint validation metrics saved during training
checkpoint_manifest.json Export metadata and recommended runtime settings
reports/eval_report_20260512_titlefix_s1e2.md Full valid/test evaluation report
code/ Loader, CLI inference, and GUI code snapshot

Recommended Inference Settings

num_beams=1
max_new_tokens=384
generation_no_repeat_ngram_size=3
generation_repetition_penalty=1.1
generation_block_extra_ids=true
generation_block_title_prefix=true
generation_force_json_start=false
structured_function_mode=heads
structured_function_threshold=0.20
structured_search_threshold=0.20
structured_evidence_mode=heads
structured_evidence_threshold=0.50
structured_max_functions=8
structured_max_evidence=8
structured_evidence_fallback_top1=false
allow_template_fallback=false

Evaluation

Full evaluation used 500 validation samples and 500 test samples from the qwen8000 processed split.

Split Rows Grounded ROUGE-L char Evidence precision Function F1
valid 500 0.4434 0.4157 0.5542 0.3588
test 500 0.4398 0.4107 0.5831 0.3819

Generation health checks:

title_prefix_rate=0.0
extra_id_rate=0.0
json_valid_rate=1.0

The 0.20 function/search threshold is important. The checkpoint config stores 0.50, but the full threshold sweep found that 0.20 gives the best function-count balance and higher Function F1.

Usage Notes

The model is intended for research and demo use on mobile screenshot summarization. It is useful for:

  • Chinese mobile screen summarization
  • screenshot-grounded UI evidence selection
  • function-entry detection from OCR/UI layout
  • comparing a deployable student model against Qwen teacher models

Known limitations:

  • Page intent can still be confused on ecommerce and search-result pages.
  • Function heads under-predict dense icon grids and category/navigation pages.
  • Evidence ids are useful but not perfectly localized.
  • The qwen8000 valid/test split has no positive search-function examples, so search-function behavior needs a separate held-out search-positive test set.

Loading

Use the project code snapshot in code/ together with the original project data preprocessing format. In the original workspace, batch inference is run as:

python scripts\infer_rich.py `
  --checkpoint runs\rich_cmgui_20260512_titlefix_s1e2\stage3_vision_adapter\checkpoint-best `
  --input_file data\rich_cmgui\processed\test_rich_teacher500_natural_qwen8000.jsonl `
  --batch_size 2 `
  --num_beams 1 `
  --max_new_tokens 384 `
  --structured_function_mode heads `
  --structured_function_threshold 0.20 `
  --structured_search_threshold 0.20 `
  --structured_evidence_mode heads `
  --structured_evidence_threshold 0.50 `
  --allow_template_fallback false
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hsq12138/CMGUI_Screen-Grounded_Summarizer

Base model

google/mt5-large
Finetuned
(97)
this model