nielsr HF Staff commited on
Commit
4d11dc0
·
verified ·
1 Parent(s): 7a0afa7

Update model card metadata and usage information

Browse files

Hi, I'm Niels from the community science team at Hugging Face. I've updated the model card for OneVL to improve its discoverability and usability.

Changes include:
- Added `library_name: transformers` to the YAML metadata to enable the "Use in Transformers" button.
- Updated the `pipeline_tag` to `image-to-image` as requested.
- Included links to the [research paper](https://arxiv.org/abs/2604.18486), [GitHub repository](https://github.com/xiaomi-research/onevl), and [project page](https://Xiaomi-Embodied-Intelligence.github.io/OneVL/).
- Added a sample usage section with the inference commands provided in the official repository.
- Summarized the architecture and key benchmark results.

These updates ensure the model follows Hugging Face Hub best practices.

Files changed (1) hide show
  1. README.md +24 -133
README.md CHANGED
@@ -1,105 +1,35 @@
1
  ---
2
- license: apache-2.0
 
3
  language:
4
  - en
 
 
 
5
  tags:
6
  - autonomous-driving
7
  - vision-language-action
8
  - chain-of-thought
9
  - trajectory-prediction
10
  - VLA
11
- base_model:
12
- - Qwen/Qwen3-VL-4B-Instruct
13
- pipeline_tag: image-text-to-text
14
  ---
15
 
16
  # OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
17
 
18
  **[📄 Paper (arXiv)](https://arxiv.org/abs/2604.18486)** | **[💻 GitHub](https://github.com/xiaomi-research/onevl)** | **[🌐 Project Page](https://Xiaomi-Embodied-Intelligence.github.io/OneVL/)**
19
 
20
- *Xiaomi Embodied Intelligence Team*
21
-
22
- ---
23
 
24
  ## Overview
25
 
26
- **OneVL** is a Vision-Language-Action (VLA) framework for autonomous driving that achieves **state-of-the-art trajectory prediction accuracy** while matching the inference latency of answer-only autoregressive models.
27
-
28
- Prior latent Chain-of-Thought (CoT) methods compress reasoning into opaque hidden states — fast, but consistently underperform explicit CoT on driving tasks. OneVL identifies the root cause: purely linguistic latents encode abstract semantic labels rather than the spatiotemporal causal dynamics that govern real driving scenes. OneVL addresses this with **dual-modal auxiliary decoders** that force compact latent tokens to encode both human-readable reasoning *and* future scene dynamics simultaneously.
29
-
30
- At inference, both decoders are discarded and all latents are **prefilled** into the prompt context in a single parallel pass — matching answer-only AR prediction speed while recovering the interpretability of explicit CoT in both vision and language.
31
-
32
- OneVL is the **first latent CoT method to surpass explicit autoregressive CoT** across all four driving benchmarks.
33
-
34
- ---
35
-
36
- ## Architecture
37
-
38
- OneVL augments **Qwen3-VL-4B-Instruct** with three components:
39
-
40
- **Latent Token Interface** — 4 visual latent tokens + 2 language latent tokens are inserted in the assistant response before the answer, using existing vocabulary tokens (no new special tokens added).
41
-
42
- **Visual Auxiliary Decoder** — Predicts future-frame visual tokens at t+0.5s and t+1.0s from visual latent hidden states (using the Emu3.5 IBQ 131k codebook). Acts as a **world model** supervision signal that forces the latent space to encode genuine physical scene dynamics — agent trajectories, road geometry, and environmental change — rather than abstract descriptions.
43
-
44
- **Language Auxiliary Decoder** — Reconstructs explicit CoT reasoning text from language latent hidden states, conditioned on ViT visual features. Recovers 97% of explicit CoT text quality while running at answer-only speed.
45
-
46
- **Prefill Inference** — Both decoders are discarded at inference time. All latent tokens are processed in a single parallel prefill pass; only the trajectory answer is generated autoregressively. This achieves **1.5× speedup over explicit CoT on NAVSIM** and **2.3× on ROADWork**.
47
-
48
- ### Three-Stage Training Pipeline
49
-
50
- Training proceeds in three stages to ensure stable joint optimization:
51
- - **Stage 0**: Main model warmup (trajectory prediction)
52
- - **Stage 1**: Auxiliary decoder warmup (language + visual decoders independently)
53
- - **Stage 2**: Joint end-to-end fine-tuning (all components together)
54
-
55
- Staged training is essential — ablation shows that skipping it collapses PDM-score from 88.84 to 67.13.
56
-
57
- ---
58
-
59
- ## Results
60
-
61
- ### NAVSIM
62
-
63
- | Method | Model Size | PDM-score ↑ | Latency (s) ↓ | Interpretability |
64
- |---|:---:|:---:|:---:|:---:|
65
- | AR Answer | 4B | 87.47 | 4.49 | — |
66
- | AR CoT+Answer | 4B | 88.29 | 6.58 | Language |
67
- | COCONUT | 4B | 84.84 | 5.93 | — |
68
- | CODI | 4B | 83.92 | 8.62 | — |
69
- | SIM-CoT | 4B | 84.21 | 10.86 | Language |
70
- | **OneVL** | **4B** | **88.84** | **4.46** | **Vision + Language** |
71
-
72
- ### ROADWork
73
-
74
- | Method | ADE (px) ↓ | FDE (px) ↓ | Latency (s) ↓ |
75
- |---|:---:|:---:|:---:|
76
- | AR CoT+Answer | 13.18 | 29.98 | 10.74 |
77
- | **OneVL** | **12.49** | **28.80** | **4.71** |
78
-
79
- ### Impromptu
80
-
81
- | Method | ADE (m) ↓ | FDE (m) ↓ | Latency (s) ↓ |
82
- |---|:---:|:---:|:---:|
83
- | AR CoT+Answer | 1.42 | 3.96 | 6.84 |
84
- | **OneVL** | **1.34** | **3.70** | **4.02** |
85
-
86
- ### APR1 (Alpamayo-R1)
87
-
88
- | Method | ADE (m) ↓ | FDE (m) ↓ | Latency (s) ↓ |
89
- |---|:---:|:---:|:---:|
90
- | AR CoT+Answer | 2.99 | 8.54 | 3.51 |
91
- | **OneVL** | **2.62** | 7.53 | **3.26** |
92
-
93
- ### CoT Text Quality (NAVSIM)
94
 
95
- | Method | Meta Action Acc. ↑ | STS Score ↑ | LLM Judge ↑ | Latency (s) ↓ |
96
- |---|:---:|:---:|:---:|:---:|
97
- | AR CoT+Answer | 73.20 | 79.75 | 81.86 | 6.58 |
98
- | **OneVL** | 71.00 | 78.26 | 79.13 | **4.46** |
99
 
100
- OneVL's language auxiliary decoder recovers 97% of explicit CoT quality at answer-only inference speed.
101
-
102
- ---
 
103
 
104
  ## Usage
105
 
@@ -109,6 +39,7 @@ OneVL's language auxiliary decoder recovers 97% of explicit CoT quality at answe
109
  - `transformers >= 4.57.0` (required for `Qwen3VLForConditionalGeneration`)
110
 
111
  ```bash
 
112
  uv venv venv/onevl --python 3.12
113
  source venv/onevl/bin/activate
114
  pip install -r requirements.txt
@@ -127,55 +58,18 @@ python infer_onevl.py \
127
  --max_new_tokens 1024 --answer_prefix "[" --prefix_k 0
128
  ```
129
 
130
- ### Inference with Language + Visual Explanation
131
-
132
- ```bash
133
- python infer_onevl.py \
134
- --model_path /path/to/OneVL-checkpoint \
135
- --test_set_path test_data/navsim_test.json \
136
- --image_base_path "" \
137
- --output_path output/navsim/results_explain.json \
138
- --device cuda:0 \
139
- --num_latent 2 --num_latent_vis 4 \
140
- --max_new_tokens 1024 --answer_prefix "[" --prefix_k 0 \
141
- --decoder_explain --aux_visual_condition \
142
- --c_thought 2 --max_explain_tokens 1024 \
143
- --visual_decoder_explain --visual_aux_visual_condition \
144
- --c_thought_visual 4 --max_visual_tokens 2560
145
- ```
146
-
147
- ### Multi-GPU Inference
148
 
149
- ```bash
150
- export MODEL_PATH=/path/to/OneVL-checkpoint
151
- export TEST_SET_PATH=test_data/navsim_test.json
152
- export OUTPUT_PATH=output/navsim/navsim_results.json
153
- bash run_infer.sh
154
- ```
155
-
156
- Per-benchmark scripts are available in `scripts/`:
157
-
158
- ```bash
159
- bash scripts/infer_navsim.sh
160
- bash scripts/infer_ar1.sh
161
- bash scripts/infer_roadwork.sh
162
- bash scripts/infer_impromptu.sh
163
- ```
164
-
165
- For full documentation, evaluation scripts, and data format details, see the [GitHub repository](https://github.com/xiaomi-research/onevl).
166
-
167
- ---
168
-
169
- ## Open-Source Status
170
 
171
- | Component | Status |
172
- |---|:---:|
173
- | Technical Report | ✅ Released |
174
- | Model Weights | ✅ Released |
175
- | Inference Code | ✅ Released |
176
- | Training Code | 🔜 Coming Soon |
177
 
178
- ---
 
 
 
 
 
179
 
180
  ## Citation
181
 
@@ -189,10 +83,7 @@ For full documentation, evaluation scripts, and data format details, see the [Gi
189
  }
190
  ```
191
 
192
- ---
193
-
194
  ## License
195
 
196
- Released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
197
-
198
- Model weights are built on [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) and the visual tokenizer is from [Emu3.5-VisionTokenizer](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer); please refer to their respective licenses as well.
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen3-VL-4B-Instruct
4
  language:
5
  - en
6
+ license: apache-2.0
7
+ pipeline_tag: image-to-image
8
+ library_name: transformers
9
  tags:
10
  - autonomous-driving
11
  - vision-language-action
12
  - chain-of-thought
13
  - trajectory-prediction
14
  - VLA
 
 
 
15
  ---
16
 
17
  # OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
18
 
19
  **[📄 Paper (arXiv)](https://arxiv.org/abs/2604.18486)** | **[💻 GitHub](https://github.com/xiaomi-research/onevl)** | **[🌐 Project Page](https://Xiaomi-Embodied-Intelligence.github.io/OneVL/)**
20
 
21
+ OneVL is a Vision-Language-Action (VLA) framework for autonomous driving that achieves state-of-the-art trajectory prediction accuracy while matching the inference latency of answer-only autoregressive models.
 
 
22
 
23
  ## Overview
24
 
25
+ OneVL addresses the limitations of prior latent Chain-of-Thought (CoT) methods by introducing **dual-modal auxiliary decoders**. These decoders force compact latent tokens to encode both human-readable reasoning and future scene dynamics. During inference, these decoders are discarded, and the latent tokens are prefilled into the context in a single parallel pass, achieving high performance at answer-only speeds.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
+ ### Key Architecture Components
 
 
 
28
 
29
+ - **Latent Token Interface**: 4 visual and 2 language latent tokens inserted before the response.
30
+ - **Visual Auxiliary Decoder**: Acts as a world model, predicting future-frame visual tokens (at t+0.5s and t+1.0s).
31
+ - **Language Auxiliary Decoder**: Reconstructs explicit CoT reasoning text from language latent hidden states.
32
+ - **Prefill Inference**: Enables 1.5× to 2.3× speedup over explicit autoregressive CoT.
33
 
34
  ## Usage
35
 
 
39
  - `transformers >= 4.57.0` (required for `Qwen3VLForConditionalGeneration`)
40
 
41
  ```bash
42
+ # Environment Setup
43
  uv venv venv/onevl --python 3.12
44
  source venv/onevl/bin/activate
45
  pip install -r requirements.txt
 
58
  --max_new_tokens 1024 --answer_prefix "[" --prefix_k 0
59
  ```
60
 
61
+ For full inference options, including language and visual explanations, please refer to the [GitHub repository](https://github.com/xiaomi-research/onevl).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
+ ## Results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
+ OneVL is the first latent CoT method to surpass explicit autoregressive CoT across all major autonomous driving benchmarks.
 
 
 
 
 
66
 
67
+ | Benchmark | Metric | AR CoT+Answer | OneVL |
68
+ |---|:---:|:---:|:---:|
69
+ | **NAVSIM** | PDM-score ↑ | 88.29 | **88.84** |
70
+ | **ROADWork** | ADE (px) ↓ | 13.18 | **12.49** |
71
+ | **Impromptu** | ADE (m) ↓ | 1.42 | **1.34** |
72
+ | **APR1** | ADE (m) ↓ | 2.99 | **2.62** |
73
 
74
  ## Citation
75
 
 
83
  }
84
  ```
85
 
 
 
86
  ## License
87
 
88
+ This project is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
89
+ Model weights are built on [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) and the visual tokenizer is from [Emu3.5-VisionTokenizer](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer).