Improve model card: Add abstract, update paper link & license, enable images

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +20 -20
README.md CHANGED
@@ -1,22 +1,19 @@
1
  ---
2
- pipeline_tag: robotics
3
  library_name: transformers
4
- license: cc-by-nc-sa-4.0
 
5
  tags:
6
- - vision-language-model
7
- - manipulation
8
- - robotics
9
  ---
10
 
11
-
12
  # VLAC: A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning
13
- <div align="center">
14
 
15
- [[paper]](https://github.com/InternRobotics/VLAC/blob/main/data/VLAC_EAI.pdf)
16
- [[code]](https://github.com/InternRobotics/VLAC)
17
- [[model]](https://huggingface.co/InternRobotics/VLAC)
18
 
19
- </div>
 
20
 
21
  ## πŸš€ Interactive Demo & Homepage
22
 
@@ -27,9 +24,9 @@ tags:
27
 
28
  </div>
29
 
30
- <!-- <div align="center">
31
- <img src="https://github.com/InternRobotics/VLAC/tree/main/data/title_banner-2.gif" alt="VLAC banner" width="800"></img>
32
- </div> -->
33
 
34
  ## VLAC-2B
35
 
@@ -53,15 +50,15 @@ VLAC-8B is coming soon! Now the 8B model can be used on Homepage.
53
 
54
  β€’ **Trajectory quality screening** - VLAC can evaluate the collected trajectories and filters out low score trajectories based on the VOC value and mask the action with negative pair-wise score, that is, data with low fluency and quality, improving the effect and efficiency of imitation learning.
55
 
56
- <!-- ## Framework
57
 
58
  <div align="center">
59
- <img src="https://github.com/InternRobotics/VLAC/blob/main/data/framework.png" alt="VLAC Framework" width="800"/>
60
  </div>
61
 
62
- *The VLAC model is trained on a combination of comprehensive public robotic manipulation datasets, human demonstration data, self-collected manipulation data, and various image understanding datasets. Video data is processed into pair-wise samples to learn the different task progress between any two frames, supplemented with task descriptions and task completion evaluation to enable task progress understanding and action generation, as illustrated in the bottom-left corner. As shown in the diagram on the right, the model demonstrates strong generalization capabilities to new robots, scenarios, and tasks not covered in the training dataset. It can predict task progress and distinguish failure action or trajectory, providing dense reward feedback for real-world reinforcement learning and offering guidance for data refinement. Additionally, the model can directly perform manipulation tasks, exhibiting zero-shot capabilities to handle different scenarios.* -->
63
 
64
- ## Framework & Performance
65
 
66
  Details about the model's performance and evaluation metrics can be found in the [Homepage](https://vlac.intern-ai.org.cn/).
67
 
@@ -185,8 +182,11 @@ If you find our work helpful, please cite:
185
  }
186
  ```
187
 
 
 
 
 
188
  ## πŸ™ Acknowledgments
189
 
190
  - [SWIFT](https://github.com/modelscope/ms-swift)
191
- - [InternVL](https://github.com/OpenGVLab/InternVL)
192
-
 
1
  ---
 
2
  library_name: transformers
3
+ license: mit
4
+ pipeline_tag: robotics
5
  tags:
6
+ - vision-language-model
7
+ - manipulation
8
+ - robotics
9
  ---
10
 
 
11
  # VLAC: A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning
 
12
 
13
+ [[Paper](https://huggingface.co/papers/2509.15937)] [[Code](https://github.com/InternRobotics/VLAC)] [[Project Page](https://vlac.intern-ai.org.cn/)] [[Model](https://huggingface.co/InternRobotics/VLAC)]
 
 
14
 
15
+ ## Abstract
16
+ Robotic real-world reinforcement learning (RL) with vision-language-action (VLA) models is bottlenecked by sparse, handcrafted rewards and inefficient exploration. We introduce VLAC, a general process reward model built upon InternVL and trained on large scale heterogeneous datasets. Given pairwise observations and a language goal, it outputs dense progress delta and done signal, eliminating task-specific reward engineering, and supports one-shot in-context transfer to unseen tasks and environments. VLAC is trained on vision-language datasets to strengthen perception, dialogic and reasoning capabilities, together with robot and human trajectories data that ground action generation and progress estimation, and additionally strengthened to reject irrelevant prompts as well as detect regression or stagnation by constructing large numbers of negative and semantically mismatched samples. With prompt control, a single VLAC model alternately generating reward and action tokens, unifying critic and policy. Deployed inside an asynchronous real-world RL loop, we layer a graded human-in-the-loop protocol (offline demonstration replay, return and explore, human guided explore) that accelerates exploration and stabilizes early learning. Across four distinct real-world manipulation tasks, VLAC lifts success rates from about 30% to about 90% within 200 real-world interaction episodes; incorporating human-in-the-loop interventions yields a further 50% improvement in sample efficiency and achieves up to 100% final success.
17
 
18
  ## πŸš€ Interactive Demo & Homepage
19
 
 
24
 
25
  </div>
26
 
27
+ <div align="center">
28
+ <img src="https://huggingface.co/InternRobotics/VLAC/resolve/main/data/title_banner-2.gif" alt="VLAC banner" width="800"></img>
29
+ </div>
30
 
31
  ## VLAC-2B
32
 
 
50
 
51
  β€’ **Trajectory quality screening** - VLAC can evaluate the collected trajectories and filters out low score trajectories based on the VOC value and mask the action with negative pair-wise score, that is, data with low fluency and quality, improving the effect and efficiency of imitation learning.
52
 
53
+ ## Framework
54
 
55
  <div align="center">
56
+ <img src="https://huggingface.co/InternRobotics/VLAC/resolve/main/data/framework.png" alt="VLAC Framework" width="800"/>
57
  </div>
58
 
59
+ *The VLAC model is trained on a combination of comprehensive public robotic manipulation datasets, human demonstration data, self-collected manipulation data, and various image understanding datasets. Video data is processed into pair-wise samples to learn the different task progress between any two frames, supplemented with task descriptions and task completion evaluation to enable task progress understanding and action generation, as illustrated in the bottom-left corner. As shown in the diagram on the right, the model demonstrates strong generalization capabilities to new robots, scenarios, and tasks not covered in the training dataset. It can predict task progress and distinguish failure action or trajectory, providing dense reward feedback for real-world reinforcement learning and offering guidance for data refinement. Additionally, the model can directly perform manipulation tasks, exhibiting zero-shot capabilities to handle different scenarios.*
60
 
61
+ ## Performance
62
 
63
  Details about the model's performance and evaluation metrics can be found in the [Homepage](https://vlac.intern-ai.org.cn/).
64
 
 
182
  }
183
  ```
184
 
185
+ ## πŸ“„ License
186
+
187
+ This project is licensed under the MIT License.
188
+
189
  ## πŸ™ Acknowledgments
190
 
191
  - [SWIFT](https://github.com/modelscope/ms-swift)
192
+ - [InternVL](https://github.com/OpenGVLab/InternVL)