Safetensors
qwen3_vl
yuhangzang commited on
Commit
d50a3c9
·
verified ·
1 Parent(s): 1475c13

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +155 -3
README.md CHANGED
@@ -1,3 +1,155 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - internlm/VC-RewardBench
5
+ base_model:
6
+ - Qwen/Qwen3-VL-8B-Instruct
7
+ ---
8
+
9
+ <p align="center">
10
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/63859cf3b2906edaf83af9f0/gcuIXKMoDd-nQoPrynVQF.png" width="30%">
11
+ </p>
12
+
13
+ # Visual-ERM
14
+
15
+ Visual-ERM is a **multimodal generative reward model** for **vision-to-code** tasks.
16
+ It evaluates outputs directly in the **rendered visual space** and produces **fine-grained**, **interpretable**, and **task-agnostic** discrepancy feedback for structured visual reconstruction.
17
+
18
+ <p align="center">
19
+ <a href="https://arxiv.org/abs/2603.13224">📄 Paper</a> |
20
+ <a href="https://github.com/InternLM/Visual-ERM">💻 GitHub</a> |
21
+ <a href="https://huggingface.co/datasets/internlm/VC-RewardBench">📊 VC-RewardBench</a>
22
+ </p>
23
+
24
+ ## Model Overview
25
+
26
+ Existing rewards for vision-to-code usually fall into two categories:
27
+
28
+ 1. **Text-based rewards** such as edit distance or TEDS, which ignore important visual cues like layout, spacing, alignment, and style.
29
+ 2. **Vision embedding rewards** such as DINO similarity, which are often coarse-grained and can be vulnerable to reward hacking.
30
+
31
+ Visual-ERM addresses this by directly comparing:
32
+
33
+ - the **ground-truth image**, and
34
+ - the **rendered image** produced from a model prediction,
35
+
36
+ and then generating **structured discrepancy annotations** that can be converted into reward signals or used for reflection-based refinement.
37
+
38
+ ## What this model does
39
+
40
+ Visual-ERM is designed to judge whether a predicted result is **visually equivalent** to the target.
41
+
42
+ Given a pair of images, it can identify discrepancies such as:
43
+
44
+ - **category**
45
+ - **severity**
46
+ - **location**
47
+ - **description**
48
+
49
+ This makes Visual-ERM useful not only as a reward model for RL, but also as a **visual critic** for test-time reflection and revision.
50
+
51
+ ## Supported Tasks
52
+
53
+ Visual-ERM is designed for structured visual reconstruction tasks, including:
54
+
55
+ - **Chart-to-Code**
56
+ - **Table-to-Markdown**
57
+ - **SVG-to-Code**
58
+
59
+ ## Key Features
60
+
61
+ - **Visual-space reward modeling**
62
+ Evaluates predictions in rendered visual space instead of relying only on text matching or coarse embedding similarity.
63
+
64
+ - **Fine-grained and interpretable feedback**
65
+ Produces structured discrepancy annotations rather than a single black-box score.
66
+
67
+ - **Task-agnostic reward supervision**
68
+ A unified reward model that generalizes across multiple vision-to-code tasks.
69
+
70
+ - **Useful for both training and inference**
71
+ Can be used as a reward model in RL and as a visual critic during test-time refinement.
72
+
73
+ ## VC-RewardBench
74
+
75
+ We also release **VisualCritic-RewardBench (VC-RewardBench)**, a benchmark for evaluating fine-grained image-to-image discrepancy judgment on structured visual data.
76
+
77
+ ### Benchmark Features
78
+
79
+ - Covers **charts**, **tables**, and **SVGs**
80
+ - Contains **1,335** carefully curated instances
81
+ - Each instance includes:
82
+ - a ground-truth image
83
+ - a corrupted / rendered counterpart
84
+ - fine-grained discrepancy annotations
85
+
86
+ Dataset link:
87
+ https://huggingface.co/datasets/internlm/VC-RewardBench
88
+
89
+ ## How to Use
90
+
91
+ Visual-ERM is fine-tuned from **Qwen/Qwen3-VL-8B-Instruct** and follows the same multimodal interface.
92
+
93
+ ### Input
94
+
95
+ Visual-ERM takes as input:
96
+
97
+ - a **reference / ground-truth image**
98
+ - a **rendered prediction image**
99
+ - a **prompt** asking the model to identify fine-grained visual discrepancies
100
+
101
+ ### Output
102
+
103
+ The model outputs structured discrepancy annotations, which can then be:
104
+
105
+ - converted into a scalar reward for RL
106
+ - used as feedback for reflection-and-revision
107
+ - evaluated directly on VC-RewardBench
108
+
109
+ A typical output format is:
110
+
111
+ ```json
112
+ {
113
+ "errors": [
114
+ {
115
+ "category": "structure_error",
116
+ "severity": 3,
117
+ "location": "legend area",
118
+ "description": "The legend is placed outside the plot area in the prediction."
119
+ },
120
+ {
121
+ "category": "style_error",
122
+ "severity": 2,
123
+ "location": "bar colors",
124
+ "description": "The colors differ from those in the reference image."
125
+ }
126
+ ]
127
+ }
128
+ ```
129
+
130
+ ### Inference / Evaluation / RL
131
+
132
+ For full inference scripts, RL training pipelines, evaluation code, and prompt templates, please refer to the official repository:
133
+
134
+ https://github.com/InternLM/Visual-ERM
135
+
136
+ ## Intended Use
137
+
138
+ Visual-ERM is intended for:
139
+
140
+ - **reward modeling** in vision-to-code RL pipelines
141
+ - **visual discrepancy judgment** between target and predicted renderings
142
+ - **reflection-based refinement** at inference time
143
+ - **research on visual reward modeling** and multimodal RL
144
+
145
+ ## Citation
146
+
147
+ If you find this model useful, please consider citing:
148
+
149
+ ```bibtex
150
+ TBD
151
+ ```
152
+
153
+ ## Contact
154
+
155
+ If you are interested in **visual reward modeling**, **vision-to-code**, or **reinforcement learning for multimodal models**, feel free to reach out.