breezeyoung commited on
Commit
922b63e
·
verified ·
1 Parent(s): 4f07533

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +142 -3
README.md CHANGED
@@ -1,3 +1,142 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ library_name: transformers
4
+ pipeline_tag: robotics
5
+ tags:
6
+ - robotics
7
+ - vision-language-action
8
+ - vla
9
+ - contrastive-reinforcement-learning
10
+ - goal-conditioned-rl
11
+ - qwen3-vl
12
+ - prts
13
+ - custom_code
14
+ language:
15
+ - en
16
+ ---
17
+
18
+ <h1 align="center">PRTS-4B &mdash; Primitive Reasoning and Tasking System</h1>
19
+
20
+ <p align="center">
21
+ <a href="https://arxiv.org/abs/2604.27472"><img src="https://img.shields.io/badge/arXiv-2604.27472-b31b1b.svg" alt="arXiv"></a>
22
+ &nbsp;
23
+ <a href="https://github.com/TeleHuman/PRTS"><img src="https://img.shields.io/badge/GitHub-PRTS-181717.svg" alt="GitHub"></a>
24
+ &nbsp;
25
+ <a href="https://rhodes-team-prts.github.io/"><img src="https://img.shields.io/badge/Project-Page-1f6feb.svg" alt="Project Page"></a>
26
+ </p>
27
+
28
+ **PRTS-4B** is a **Vision&ndash;Language&ndash;Action (VLA) foundation model** that, for the first time, scales **reward-label-free contrastive RL** into VLA pre-training itself. By treating language instructions as goals and supervising a contrastive value head co-trained inside the same forward pass as behavior cloning, PRTS equips a **Qwen3-VL-4B** backbone with a quantitative, language-grounded sense of *how close the current state is to satisfying the instruction*.
29
+
30
+ The released checkpoint is the result of pre-training on **~167&nbsp;B tokens** of action-labeled and embodied-reasoning data on 64 &times; H100 GPUs.
31
+
32
+ 📄 Paper: [arXiv:2604.27472](https://arxiv.org/abs/2604.27472) &middot;
33
+ 💻 Code: [github.com/TeleHuman/PRTS](https://github.com/TeleHuman/PRTS) &middot;
34
+ 🌐 Project: [rhodes-team-prts.github.io](https://rhodes-team-prts.github.io/)
35
+
36
+
37
+ ## Highlights
38
+
39
+ - **Goal-reachability awareness, end-to-end.** &nbsp; The contrastive value head is co-trained inside the policy backbone &mdash; no separate value network, no curated reward dataset, no offline-RL post-training loop.
40
+ - **Reward-label-free.** &nbsp; Supervision comes purely from the temporal structure of demonstrations.
41
+ - **Out-of-distribution wins grow with the shift.** &nbsp; On 5 simulation suites and 14 real-world tasks, PRTS matches or exceeds the strongest prior VLAs at &frac14;&ndash;&frac18; the post-training compute, with the gap **widening** off-distribution &mdash; novel-instruction following (`+38.8` over &pi;<sub>0.5</sub>), long-horizon execution, and recovery under human intervention.
42
+
43
+
44
+ ## Loading the checkpoint
45
+
46
+ The released model ships its own `modeling_*.py`, `configuration_*.py`, and `processing_*.py` next to the weights, so it can be loaded directly via `transformers` with `trust_remote_code=True`. **No need to clone the GitHub repo for a smoke test.**
47
+
48
+ ### Recommended environment
49
+
50
+ | Component | Note |
51
+ | :--- | :--- |
52
+ | Python | 3.10+ (3.11+ recommended) |
53
+ | `transformers` | `== 4.57.3` |
54
+ | PyTorch | recent CUDA build from [pytorch.org](https://pytorch.org) |
55
+
56
+ ```bash
57
+ pip install "transformers==4.57.3" torch safetensors huggingface_hub \
58
+ numpy pillow sentencepiece protobuf colorama tokenizers
59
+ pip install accelerate # recommended for device_map="auto"
60
+ ```
61
+
62
+ ### From the Hub
63
+
64
+ ```python
65
+ import torch
66
+ from transformers import AutoConfig, AutoModel, AutoProcessor
67
+
68
+ REPO_ID = "TeleEmbodied/PRTS-4B"
69
+
70
+ config = AutoConfig.from_pretrained(REPO_ID, trust_remote_code=True)
71
+ model = AutoModel.from_pretrained(
72
+ REPO_ID,
73
+ trust_remote_code=True,
74
+ torch_dtype=torch.bfloat16,
75
+ device_map="auto",
76
+ )
77
+ processor = AutoProcessor.from_pretrained(REPO_ID, trust_remote_code=True)
78
+
79
+ print(config.model_type) # prts_qwen3_vl
80
+ print(type(model).__name__)
81
+ print(type(processor).__name__)
82
+ ```
83
+
84
+ ## Prompt format
85
+
86
+ PRTS expects a **single user turn** containing camera images, a discretized proprioceptive state, and a language instruction, followed by an assistant turn that emits the action chunk. The full prompt is built from these constants (declared in `prts/constants.py` of the open-source repo):
87
+
88
+ | Token | Meaning |
89
+ | :--- | :--- |
90
+ | `<\|im_start\|>` `<\|im_end\|>` | Qwen-style turn delimiters |
91
+ | `<\|vision_start\|>` `<\|image_pad\|>` `<\|vision_end\|>` | One image placeholder block per camera |
92
+ | `<\|goal_repr\|>` | CRL value-head anchor tokens |
93
+ | `<\|action_start\|>` `<\|action_pad\|>` `<\|action_end\|>` | Slot the action expert fills with the predicted action chunk token |
94
+
95
+ ### Layout of one rollout step
96
+
97
+ ```text
98
+ <|im_start|>system
99
+ You are a helpful physical assistant.<|im_end|>
100
+ <|im_start|>user
101
+ {cam_1_name}: <|vision_start|><|image_pad|><|vision_end|>
102
+ {cam_2_name}: <|vision_start|><|image_pad|><|vision_end|>
103
+ ...
104
+ Proprioception (normalized to 0-1000 scale): {s_1} {s_2} ... {s_D}
105
+ Instruction: {language instruction}
106
+ Predict the next action chunk in low-level robotics action format.<|im_end|>
107
+ <|im_start|>assistant
108
+ <|action_start|><|action_token_1|>...<|action_token_999|><|action_end|><|im_end|>
109
+ ```
110
+
111
+ ### Field-by-field spec
112
+
113
+ - **System message:** fixed to `You are a helpful physical assistant.`
114
+ - **Image block:** One `{cam_name}: <|vision_start|><|image_pad|><|vision_end|>` line per camera.
115
+ - **Proprioceptive state:** The robot state is **q01/q99-normalized to `[-1, 1]`** per dimension (using stats from `compute_stats.py`), then linearly remapped to integers in `[0, 1000]` and rendered as a space-separated list. The line is prefixed by `Proprioception (normalized to 0-1000 scale): `. Omit the line entirely if the embodiment has no proprioception channel. Values outside the bounds are clipped to -1 or 1000.
116
+ - **Instruction:** Free-form English natural-language goal (e.g. `Left gripper sequentially grasps two shoes and places them in the shoebox. Right gripper closes the shoebox.`).
117
+ - **Suffix:** Always end the user turn with `Predict the next action chunk in low-level robotics action format.` if you want to use PRTS to generate actions.
118
+
119
+ ## License
120
+
121
+ This model is released under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). Free for academic and non-commercial research; commercial use is **not** permitted under this license.
122
+
123
+ ---
124
+
125
+ ## Citation
126
+
127
+ If you find PRTS useful, please cite:
128
+
129
+ ```bibtex
130
+ @article{zhang2026prts,
131
+ title = {PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations},
132
+ author={Yang Zhang and Jiangyuan Zhao and Chenyou Fan and Fangzheng Yan and Tian Li and Haitong Tang and Sen Fu and Xuan'er Wu and Qizhen Weng and Weinan Zhang and Xiu Li and Chi Zhang and Chenjia Bai and Xuelong Li},
133
+ journal = {arXiv preprint arXiv:2604.27472},
134
+ year = {2026},
135
+ }
136
+ ```
137
+
138
+ ---
139
+
140
+ ## Acknowledgements
141
+
142
+ PRTS builds on [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL), [FlashAttention](https://github.com/Dao-AILab/flash-attention), [LeRobot](https://github.com/huggingface/lerobot), and [OpenPI](https://github.com/openpilab/openpi). We thank the authors of [Contrastive RL](https://github.com/google-research/google-research/tree/master/contrastive_rl) for the ideas behind the contrastive value formulation.