JinghuiLuAstronaut commited on
Commit
7a0afa7
Β·
verified Β·
1 Parent(s): 2398d8e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +198 -0
README.md ADDED
@@ -0,0 +1,198 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - autonomous-driving
7
+ - vision-language-action
8
+ - chain-of-thought
9
+ - trajectory-prediction
10
+ - VLA
11
+ base_model:
12
+ - Qwen/Qwen3-VL-4B-Instruct
13
+ pipeline_tag: image-text-to-text
14
+ ---
15
+
16
+ # OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
17
+
18
+ **[πŸ“„ Paper (arXiv)](https://arxiv.org/abs/2604.18486)** | **[πŸ’» GitHub](https://github.com/xiaomi-research/onevl)** | **[🌐 Project Page](https://Xiaomi-Embodied-Intelligence.github.io/OneVL/)**
19
+
20
+ *Xiaomi Embodied Intelligence Team*
21
+
22
+ ---
23
+
24
+ ## Overview
25
+
26
+ **OneVL** is a Vision-Language-Action (VLA) framework for autonomous driving that achieves **state-of-the-art trajectory prediction accuracy** while matching the inference latency of answer-only autoregressive models.
27
+
28
+ Prior latent Chain-of-Thought (CoT) methods compress reasoning into opaque hidden states β€” fast, but consistently underperform explicit CoT on driving tasks. OneVL identifies the root cause: purely linguistic latents encode abstract semantic labels rather than the spatiotemporal causal dynamics that govern real driving scenes. OneVL addresses this with **dual-modal auxiliary decoders** that force compact latent tokens to encode both human-readable reasoning *and* future scene dynamics simultaneously.
29
+
30
+ At inference, both decoders are discarded and all latents are **prefilled** into the prompt context in a single parallel pass β€” matching answer-only AR prediction speed while recovering the interpretability of explicit CoT in both vision and language.
31
+
32
+ OneVL is the **first latent CoT method to surpass explicit autoregressive CoT** across all four driving benchmarks.
33
+
34
+ ---
35
+
36
+ ## Architecture
37
+
38
+ OneVL augments **Qwen3-VL-4B-Instruct** with three components:
39
+
40
+ **Latent Token Interface** β€” 4 visual latent tokens + 2 language latent tokens are inserted in the assistant response before the answer, using existing vocabulary tokens (no new special tokens added).
41
+
42
+ **Visual Auxiliary Decoder** β€” Predicts future-frame visual tokens at t+0.5s and t+1.0s from visual latent hidden states (using the Emu3.5 IBQ 131k codebook). Acts as a **world model** supervision signal that forces the latent space to encode genuine physical scene dynamics β€” agent trajectories, road geometry, and environmental change β€” rather than abstract descriptions.
43
+
44
+ **Language Auxiliary Decoder** β€” Reconstructs explicit CoT reasoning text from language latent hidden states, conditioned on ViT visual features. Recovers 97% of explicit CoT text quality while running at answer-only speed.
45
+
46
+ **Prefill Inference** β€” Both decoders are discarded at inference time. All latent tokens are processed in a single parallel prefill pass; only the trajectory answer is generated autoregressively. This achieves **1.5Γ— speedup over explicit CoT on NAVSIM** and **2.3Γ— on ROADWork**.
47
+
48
+ ### Three-Stage Training Pipeline
49
+
50
+ Training proceeds in three stages to ensure stable joint optimization:
51
+ - **Stage 0**: Main model warmup (trajectory prediction)
52
+ - **Stage 1**: Auxiliary decoder warmup (language + visual decoders independently)
53
+ - **Stage 2**: Joint end-to-end fine-tuning (all components together)
54
+
55
+ Staged training is essential β€” ablation shows that skipping it collapses PDM-score from 88.84 to 67.13.
56
+
57
+ ---
58
+
59
+ ## Results
60
+
61
+ ### NAVSIM
62
+
63
+ | Method | Model Size | PDM-score ↑ | Latency (s) ↓ | Interpretability |
64
+ |---|:---:|:---:|:---:|:---:|
65
+ | AR Answer | 4B | 87.47 | 4.49 | β€” |
66
+ | AR CoT+Answer | 4B | 88.29 | 6.58 | Language |
67
+ | COCONUT | 4B | 84.84 | 5.93 | β€” |
68
+ | CODI | 4B | 83.92 | 8.62 | β€” |
69
+ | SIM-CoT | 4B | 84.21 | 10.86 | Language |
70
+ | **OneVL** | **4B** | **88.84** | **4.46** | **Vision + Language** |
71
+
72
+ ### ROADWork
73
+
74
+ | Method | ADE (px) ↓ | FDE (px) ↓ | Latency (s) ↓ |
75
+ |---|:---:|:---:|:---:|
76
+ | AR CoT+Answer | 13.18 | 29.98 | 10.74 |
77
+ | **OneVL** | **12.49** | **28.80** | **4.71** |
78
+
79
+ ### Impromptu
80
+
81
+ | Method | ADE (m) ↓ | FDE (m) ↓ | Latency (s) ↓ |
82
+ |---|:---:|:---:|:---:|
83
+ | AR CoT+Answer | 1.42 | 3.96 | 6.84 |
84
+ | **OneVL** | **1.34** | **3.70** | **4.02** |
85
+
86
+ ### APR1 (Alpamayo-R1)
87
+
88
+ | Method | ADE (m) ↓ | FDE (m) ↓ | Latency (s) ↓ |
89
+ |---|:---:|:---:|:---:|
90
+ | AR CoT+Answer | 2.99 | 8.54 | 3.51 |
91
+ | **OneVL** | **2.62** | 7.53 | **3.26** |
92
+
93
+ ### CoT Text Quality (NAVSIM)
94
+
95
+ | Method | Meta Action Acc. ↑ | STS Score ↑ | LLM Judge ↑ | Latency (s) ↓ |
96
+ |---|:---:|:---:|:---:|:---:|
97
+ | AR CoT+Answer | 73.20 | 79.75 | 81.86 | 6.58 |
98
+ | **OneVL** | 71.00 | 78.26 | 79.13 | **4.46** |
99
+
100
+ OneVL's language auxiliary decoder recovers 97% of explicit CoT quality at answer-only inference speed.
101
+
102
+ ---
103
+
104
+ ## Usage
105
+
106
+ ### Requirements
107
+
108
+ - Python 3.10+, CUDA GPU (β‰₯16 GB VRAM recommended)
109
+ - `transformers >= 4.57.0` (required for `Qwen3VLForConditionalGeneration`)
110
+
111
+ ```bash
112
+ uv venv venv/onevl --python 3.12
113
+ source venv/onevl/bin/activate
114
+ pip install -r requirements.txt
115
+ ```
116
+
117
+ ### Inference (Trajectory Prediction Only)
118
+
119
+ ```bash
120
+ python infer_onevl.py \
121
+ --model_path /path/to/OneVL-checkpoint \
122
+ --test_set_path test_data/navsim_test.json \
123
+ --image_base_path "" \
124
+ --output_path output/navsim/results.json \
125
+ --device cuda:0 \
126
+ --num_latent 2 --num_latent_vis 4 \
127
+ --max_new_tokens 1024 --answer_prefix "[" --prefix_k 0
128
+ ```
129
+
130
+ ### Inference with Language + Visual Explanation
131
+
132
+ ```bash
133
+ python infer_onevl.py \
134
+ --model_path /path/to/OneVL-checkpoint \
135
+ --test_set_path test_data/navsim_test.json \
136
+ --image_base_path "" \
137
+ --output_path output/navsim/results_explain.json \
138
+ --device cuda:0 \
139
+ --num_latent 2 --num_latent_vis 4 \
140
+ --max_new_tokens 1024 --answer_prefix "[" --prefix_k 0 \
141
+ --decoder_explain --aux_visual_condition \
142
+ --c_thought 2 --max_explain_tokens 1024 \
143
+ --visual_decoder_explain --visual_aux_visual_condition \
144
+ --c_thought_visual 4 --max_visual_tokens 2560
145
+ ```
146
+
147
+ ### Multi-GPU Inference
148
+
149
+ ```bash
150
+ export MODEL_PATH=/path/to/OneVL-checkpoint
151
+ export TEST_SET_PATH=test_data/navsim_test.json
152
+ export OUTPUT_PATH=output/navsim/navsim_results.json
153
+ bash run_infer.sh
154
+ ```
155
+
156
+ Per-benchmark scripts are available in `scripts/`:
157
+
158
+ ```bash
159
+ bash scripts/infer_navsim.sh
160
+ bash scripts/infer_ar1.sh
161
+ bash scripts/infer_roadwork.sh
162
+ bash scripts/infer_impromptu.sh
163
+ ```
164
+
165
+ For full documentation, evaluation scripts, and data format details, see the [GitHub repository](https://github.com/xiaomi-research/onevl).
166
+
167
+ ---
168
+
169
+ ## Open-Source Status
170
+
171
+ | Component | Status |
172
+ |---|:---:|
173
+ | Technical Report | βœ… Released |
174
+ | Model Weights | βœ… Released |
175
+ | Inference Code | βœ… Released |
176
+ | Training Code | πŸ”œ Coming Soon |
177
+
178
+ ---
179
+
180
+ ## Citation
181
+
182
+ ```bibtex
183
+ @article{lu2026onevl,
184
+ title={OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation},
185
+ author={Lu, Jinghui and Guan, Jiayi and Huang, Zhijian and Li, Jinlong and Li, Guang and Kong, Lingdong and Li, Yingyan and Wang, Han and Xu, Shaoqing and Luo, Yuechen and others},
186
+ journal={arXiv preprint arXiv:2604.18486},
187
+ year={2026},
188
+ url={https://arxiv.org/abs/2604.18486}
189
+ }
190
+ ```
191
+
192
+ ---
193
+
194
+ ## License
195
+
196
+ Released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
197
+
198
+ Model weights are built on [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) and the visual tokenizer is from [Emu3.5-VisionTokenizer](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer); please refer to their respective licenses as well.