TeleEmbodied
/

PRTS-4B

+---
+license: cc-by-nc-4.0
+library_name: transformers
+pipeline_tag: robotics
+tags:
+- robotics
+- vision-language-action
+- vla
+- contrastive-reinforcement-learning
+- goal-conditioned-rl
+- qwen3-vl
+- prts
+- custom_code
+language:
+- en
+---
+<h1 align="center">PRTS-4B &mdash; Primitive Reasoning and Tasking System</h1>
+<p align="center">
+  <a href="https://arxiv.org/abs/2604.27472"><img src="https://img.shields.io/badge/arXiv-2604.27472-b31b1b.svg" alt="arXiv"></a>
+  &nbsp;
+  <a href="https://github.com/TeleHuman/PRTS"><img src="https://img.shields.io/badge/GitHub-PRTS-181717.svg" alt="GitHub"></a>
+  &nbsp;
+  <a href="https://rhodes-team-prts.github.io/"><img src="https://img.shields.io/badge/Project-Page-1f6feb.svg" alt="Project Page"></a>
+</p>
+**PRTS-4B** is a **Vision&ndash;Language&ndash;Action (VLA) foundation model** that, for the first time, scales **reward-label-free contrastive RL** into VLA pre-training itself. By treating language instructions as goals and supervising a contrastive value head co-trained inside the same forward pass as behavior cloning, PRTS equips a **Qwen3-VL-4B** backbone with a quantitative, language-grounded sense of *how close the current state is to satisfying the instruction*.
+The released checkpoint is the result of pre-training on **~167&nbsp;B tokens** of action-labeled and embodied-reasoning data on 64 &times; H100 GPUs.
+📄 Paper: [arXiv:2604.27472](https://arxiv.org/abs/2604.27472) &middot;
+💻 Code: [github.com/TeleHuman/PRTS](https://github.com/TeleHuman/PRTS) &middot;
+🌐 Project: [rhodes-team-prts.github.io](https://rhodes-team-prts.github.io/)
+## Highlights
+- **Goal-reachability awareness, end-to-end.** &nbsp; The contrastive value head is co-trained inside the policy backbone &mdash; no separate value network, no curated reward dataset, no offline-RL post-training loop.
+- **Reward-label-free.** &nbsp; Supervision comes purely from the temporal structure of demonstrations.
+- **Out-of-distribution wins grow with the shift.** &nbsp; On 5 simulation suites and 14 real-world tasks, PRTS matches or exceeds the strongest prior VLAs at &frac14;&ndash;&frac18; the post-training compute, with the gap **widening** off-distribution &mdash; novel-instruction following (`+38.8` over &pi;<sub>0.5</sub>), long-horizon execution, and recovery under human intervention.
+## Loading the checkpoint
+The released model ships its own `modeling_*.py`, `configuration_*.py`, and `processing_*.py` next to the weights, so it can be loaded directly via `transformers` with `trust_remote_code=True`. **No need to clone the GitHub repo for a smoke test.**
+### Recommended environment
+| Component | Note |
+| :--- | :--- |
+| Python | 3.10+ (3.11+ recommended) |
+| `transformers` | `== 4.57.3` |
+| PyTorch | recent CUDA build from [pytorch.org](https://pytorch.org) |
+```bash
+pip install "transformers==4.57.3" torch safetensors huggingface_hub \
+            numpy pillow sentencepiece protobuf colorama tokenizers
+pip install accelerate     # recommended for device_map="auto"
+```
+### From the Hub
+```python
+import torch
+from transformers import AutoConfig, AutoModel, AutoProcessor
+REPO_ID = "TeleEmbodied/PRTS-4B"
+config    = AutoConfig.from_pretrained(REPO_ID, trust_remote_code=True)
+model     = AutoModel.from_pretrained(
+    REPO_ID,
+    trust_remote_code=True,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+processor = AutoProcessor.from_pretrained(REPO_ID, trust_remote_code=True)
+print(config.model_type)        # prts_qwen3_vl
+print(type(model).__name__)
+print(type(processor).__name__)
+```
+## Prompt format
+PRTS expects a **single user turn** containing camera images, a discretized proprioceptive state, and a language instruction, followed by an assistant turn that emits the action chunk. The full prompt is built from these constants (declared in `prts/constants.py` of the open-source repo):
+| Token | Meaning |
+| :--- | :--- |
+| `<\|im_start\|>` `<\|im_end\|>` | Qwen-style turn delimiters |
+| `<\|vision_start\|>` `<\|image_pad\|>` `<\|vision_end\|>` | One image placeholder block per camera |
+| `<\|goal_repr\|>` | CRL value-head anchor tokens |
+| `<\|action_start\|>` `<\|action_pad\|>` `<\|action_end\|>` | Slot the action expert fills with the predicted action chunk token |
+### Layout of one rollout step
+```text
+<|im_start|>system
+You are a helpful physical assistant.<|im_end|>
+<|im_start|>user
+{cam_1_name}: <|vision_start|><|image_pad|><|vision_end|>
+{cam_2_name}: <|vision_start|><|image_pad|><|vision_end|>
+...
+Proprioception (normalized to 0-1000 scale): {s_1} {s_2} ... {s_D}
+Instruction: {language instruction}
+Predict the next action chunk in low-level robotics action format.<|im_end|>
+<|im_start|>assistant
+<|action_start|><|action_token_1|>...<|action_token_999|><|action_end|><|im_end|>
+```
+### Field-by-field spec
+- **System message:** fixed to `You are a helpful physical assistant.`
+- **Image block:** One `{cam_name}: <|vision_start|><|image_pad|><|vision_end|>` line per camera.
+- **Proprioceptive state:** The robot state is **q01/q99-normalized to `[-1, 1]`** per dimension (using stats from `compute_stats.py`), then linearly remapped to integers in `[0, 1000]` and rendered as a space-separated list. The line is prefixed by `Proprioception (normalized to 0-1000 scale): `. Omit the line entirely if the embodiment has no proprioception channel. Values outside the bounds are clipped to -1 or 1000.
+- **Instruction:** Free-form English natural-language goal (e.g. `Left gripper sequentially grasps two shoes and places them in the shoebox. Right gripper closes the shoebox.`).
+- **Suffix:** Always end the user turn with `Predict the next action chunk in low-level robotics action format.` if you want to use PRTS to generate actions.
+## License
+This model is released under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). Free for academic and non-commercial research; commercial use is **not** permitted under this license.
+---
+## Citation
+If you find PRTS useful, please cite:
+```bibtex
+@article{zhang2026prts,
+  title   = {PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations},
+  author={Yang Zhang and Jiangyuan Zhao and Chenyou Fan and Fangzheng Yan and Tian Li and Haitong Tang and Sen Fu and Xuan'er Wu and Qizhen Weng and Weinan Zhang and Xiu Li and Chi Zhang and Chenjia Bai and Xuelong Li},
+  journal = {arXiv preprint arXiv:2604.27472},
+  year    = {2026},
+}
+```
+---
+## Acknowledgements
+PRTS builds on [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL), [FlashAttention](https://github.com/Dao-AILab/flash-attention), [LeRobot](https://github.com/huggingface/lerobot), and [OpenPI](https://github.com/openpilab/openpi). We thank the authors of [Contrastive RL](https://github.com/google-research/google-research/tree/master/contrastive_rl) for the ideas behind the contrastive value formulation.