CMGUI Screen-Grounded Summarizer

This repository contains the latest exported checkpoint for the CMGUI Chinese mobile screenshot summarization project.

The model reads a real mobile screenshot together with OCR/UI elements, bounding boxes, and optional task context. It generates a natural Chinese screen summary and predicts structured UI function entries and key visual evidence ids.

Checkpoint

Source checkpoint:

runs/rich_cmgui_20260512_titlefix_s1e2/stage3_vision_adapter/checkpoint-best

Export date: 2026-05-19

Architecture:

SigLIP2 visual encoder
+ OCR/UI/layout element memory
+ task/context memory
+ mT5-large decoder
+ element-level structured heads for UI functions and evidence

This is a custom PyTorch checkpoint, not a plain AutoModel.from_pretrained package. The code snapshot in code/ shows the loader and inference path used by the project.

Files

File	Purpose
`pytorch_model.bin`	Custom model state dict
`rich_config.json`	Model, data, decoding, and structured-head config
`decoder_tokenizer/`	mT5 tokenizer
`image_processor/`	SigLIP2 image processor
`checkpoint_metrics.json`	Best-checkpoint validation metrics saved during training
`checkpoint_manifest.json`	Export metadata and recommended runtime settings
`reports/eval_report_20260512_titlefix_s1e2.md`	Full valid/test evaluation report
`code/`	Loader, CLI inference, and GUI code snapshot

Recommended Inference Settings

num_beams=1
max_new_tokens=384
generation_no_repeat_ngram_size=3
generation_repetition_penalty=1.1
generation_block_extra_ids=true
generation_block_title_prefix=true
generation_force_json_start=false
structured_function_mode=heads
structured_function_threshold=0.20
structured_search_threshold=0.20
structured_evidence_mode=heads
structured_evidence_threshold=0.50
structured_max_functions=8
structured_max_evidence=8
structured_evidence_fallback_top1=false
allow_template_fallback=false

Evaluation

Full evaluation used 500 validation samples and 500 test samples from the qwen8000 processed split.

Split	Rows	Grounded	ROUGE-L char	Evidence precision	Function F1
valid	500	0.4434	0.4157	0.5542	0.3588
test	500	0.4398	0.4107	0.5831	0.3819

Generation health checks:

title_prefix_rate=0.0
extra_id_rate=0.0
json_valid_rate=1.0

The 0.20 function/search threshold is important. The checkpoint config stores 0.50, but the full threshold sweep found that 0.20 gives the best function-count balance and higher Function F1.

Usage Notes

The model is intended for research and demo use on mobile screenshot summarization. It is useful for:

Chinese mobile screen summarization
screenshot-grounded UI evidence selection
function-entry detection from OCR/UI layout
comparing a deployable student model against Qwen teacher models

Known limitations:

Page intent can still be confused on ecommerce and search-result pages.
Function heads under-predict dense icon grids and category/navigation pages.
Evidence ids are useful but not perfectly localized.
The qwen8000 valid/test split has no positive search-function examples, so search-function behavior needs a separate held-out search-positive test set.

Loading

Use the project code snapshot in code/ together with the original project data preprocessing format. In the original workspace, batch inference is run as:

python scripts\infer_rich.py `
  --checkpoint runs\rich_cmgui_20260512_titlefix_s1e2\stage3_vision_adapter\checkpoint-best `
  --input_file data\rich_cmgui\processed\test_rich_teacher500_natural_qwen8000.jsonl `
  --batch_size 2 `
  --num_beams 1 `
  --max_new_tokens 384 `
  --structured_function_mode heads `
  --structured_function_threshold 0.20 `
  --structured_search_threshold 0.20 `
  --structured_evidence_mode heads `
  --structured_evidence_threshold 0.50 `
  --allow_template_fallback false

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for hsq12138/CMGUI_Screen-Grounded_Summarizer

Base model

google/mt5-large

Finetuned

(97)

this model