Update model card metadata and usage information
Browse filesHi, I'm Niels from the community science team at Hugging Face. I've updated the model card for OneVL to improve its discoverability and usability.
Changes include:
- Added `library_name: transformers` to the YAML metadata to enable the "Use in Transformers" button.
- Updated the `pipeline_tag` to `image-to-image` as requested.
- Included links to the [research paper](https://arxiv.org/abs/2604.18486), [GitHub repository](https://github.com/xiaomi-research/onevl), and [project page](https://Xiaomi-Embodied-Intelligence.github.io/OneVL/).
- Added a sample usage section with the inference commands provided in the official repository.
- Summarized the architecture and key benchmark results.
These updates ensure the model follows Hugging Face Hub best practices.
|
@@ -1,105 +1,35 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
| 3 |
language:
|
| 4 |
- en
|
|
|
|
|
|
|
|
|
|
| 5 |
tags:
|
| 6 |
- autonomous-driving
|
| 7 |
- vision-language-action
|
| 8 |
- chain-of-thought
|
| 9 |
- trajectory-prediction
|
| 10 |
- VLA
|
| 11 |
-
base_model:
|
| 12 |
-
- Qwen/Qwen3-VL-4B-Instruct
|
| 13 |
-
pipeline_tag: image-text-to-text
|
| 14 |
---
|
| 15 |
|
| 16 |
# OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
|
| 17 |
|
| 18 |
**[📄 Paper (arXiv)](https://arxiv.org/abs/2604.18486)** | **[💻 GitHub](https://github.com/xiaomi-research/onevl)** | **[🌐 Project Page](https://Xiaomi-Embodied-Intelligence.github.io/OneVL/)**
|
| 19 |
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
---
|
| 23 |
|
| 24 |
## Overview
|
| 25 |
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
Prior latent Chain-of-Thought (CoT) methods compress reasoning into opaque hidden states — fast, but consistently underperform explicit CoT on driving tasks. OneVL identifies the root cause: purely linguistic latents encode abstract semantic labels rather than the spatiotemporal causal dynamics that govern real driving scenes. OneVL addresses this with **dual-modal auxiliary decoders** that force compact latent tokens to encode both human-readable reasoning *and* future scene dynamics simultaneously.
|
| 29 |
-
|
| 30 |
-
At inference, both decoders are discarded and all latents are **prefilled** into the prompt context in a single parallel pass — matching answer-only AR prediction speed while recovering the interpretability of explicit CoT in both vision and language.
|
| 31 |
-
|
| 32 |
-
OneVL is the **first latent CoT method to surpass explicit autoregressive CoT** across all four driving benchmarks.
|
| 33 |
-
|
| 34 |
-
---
|
| 35 |
-
|
| 36 |
-
## Architecture
|
| 37 |
-
|
| 38 |
-
OneVL augments **Qwen3-VL-4B-Instruct** with three components:
|
| 39 |
-
|
| 40 |
-
**Latent Token Interface** — 4 visual latent tokens + 2 language latent tokens are inserted in the assistant response before the answer, using existing vocabulary tokens (no new special tokens added).
|
| 41 |
-
|
| 42 |
-
**Visual Auxiliary Decoder** — Predicts future-frame visual tokens at t+0.5s and t+1.0s from visual latent hidden states (using the Emu3.5 IBQ 131k codebook). Acts as a **world model** supervision signal that forces the latent space to encode genuine physical scene dynamics — agent trajectories, road geometry, and environmental change — rather than abstract descriptions.
|
| 43 |
-
|
| 44 |
-
**Language Auxiliary Decoder** — Reconstructs explicit CoT reasoning text from language latent hidden states, conditioned on ViT visual features. Recovers 97% of explicit CoT text quality while running at answer-only speed.
|
| 45 |
-
|
| 46 |
-
**Prefill Inference** — Both decoders are discarded at inference time. All latent tokens are processed in a single parallel prefill pass; only the trajectory answer is generated autoregressively. This achieves **1.5× speedup over explicit CoT on NAVSIM** and **2.3× on ROADWork**.
|
| 47 |
-
|
| 48 |
-
### Three-Stage Training Pipeline
|
| 49 |
-
|
| 50 |
-
Training proceeds in three stages to ensure stable joint optimization:
|
| 51 |
-
- **Stage 0**: Main model warmup (trajectory prediction)
|
| 52 |
-
- **Stage 1**: Auxiliary decoder warmup (language + visual decoders independently)
|
| 53 |
-
- **Stage 2**: Joint end-to-end fine-tuning (all components together)
|
| 54 |
-
|
| 55 |
-
Staged training is essential — ablation shows that skipping it collapses PDM-score from 88.84 to 67.13.
|
| 56 |
-
|
| 57 |
-
---
|
| 58 |
-
|
| 59 |
-
## Results
|
| 60 |
-
|
| 61 |
-
### NAVSIM
|
| 62 |
-
|
| 63 |
-
| Method | Model Size | PDM-score ↑ | Latency (s) ↓ | Interpretability |
|
| 64 |
-
|---|:---:|:---:|:---:|:---:|
|
| 65 |
-
| AR Answer | 4B | 87.47 | 4.49 | — |
|
| 66 |
-
| AR CoT+Answer | 4B | 88.29 | 6.58 | Language |
|
| 67 |
-
| COCONUT | 4B | 84.84 | 5.93 | — |
|
| 68 |
-
| CODI | 4B | 83.92 | 8.62 | — |
|
| 69 |
-
| SIM-CoT | 4B | 84.21 | 10.86 | Language |
|
| 70 |
-
| **OneVL** | **4B** | **88.84** | **4.46** | **Vision + Language** |
|
| 71 |
-
|
| 72 |
-
### ROADWork
|
| 73 |
-
|
| 74 |
-
| Method | ADE (px) ↓ | FDE (px) ↓ | Latency (s) ↓ |
|
| 75 |
-
|---|:---:|:---:|:---:|
|
| 76 |
-
| AR CoT+Answer | 13.18 | 29.98 | 10.74 |
|
| 77 |
-
| **OneVL** | **12.49** | **28.80** | **4.71** |
|
| 78 |
-
|
| 79 |
-
### Impromptu
|
| 80 |
-
|
| 81 |
-
| Method | ADE (m) ↓ | FDE (m) ↓ | Latency (s) ↓ |
|
| 82 |
-
|---|:---:|:---:|:---:|
|
| 83 |
-
| AR CoT+Answer | 1.42 | 3.96 | 6.84 |
|
| 84 |
-
| **OneVL** | **1.34** | **3.70** | **4.02** |
|
| 85 |
-
|
| 86 |
-
### APR1 (Alpamayo-R1)
|
| 87 |
-
|
| 88 |
-
| Method | ADE (m) ↓ | FDE (m) ↓ | Latency (s) ↓ |
|
| 89 |
-
|---|:---:|:---:|:---:|
|
| 90 |
-
| AR CoT+Answer | 2.99 | 8.54 | 3.51 |
|
| 91 |
-
| **OneVL** | **2.62** | 7.53 | **3.26** |
|
| 92 |
-
|
| 93 |
-
### CoT Text Quality (NAVSIM)
|
| 94 |
|
| 95 |
-
|
| 96 |
-
|---|:---:|:---:|:---:|:---:|
|
| 97 |
-
| AR CoT+Answer | 73.20 | 79.75 | 81.86 | 6.58 |
|
| 98 |
-
| **OneVL** | 71.00 | 78.26 | 79.13 | **4.46** |
|
| 99 |
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
-
|
|
|
|
| 103 |
|
| 104 |
## Usage
|
| 105 |
|
|
@@ -109,6 +39,7 @@ OneVL's language auxiliary decoder recovers 97% of explicit CoT quality at answe
|
|
| 109 |
- `transformers >= 4.57.0` (required for `Qwen3VLForConditionalGeneration`)
|
| 110 |
|
| 111 |
```bash
|
|
|
|
| 112 |
uv venv venv/onevl --python 3.12
|
| 113 |
source venv/onevl/bin/activate
|
| 114 |
pip install -r requirements.txt
|
|
@@ -127,55 +58,18 @@ python infer_onevl.py \
|
|
| 127 |
--max_new_tokens 1024 --answer_prefix "[" --prefix_k 0
|
| 128 |
```
|
| 129 |
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
```bash
|
| 133 |
-
python infer_onevl.py \
|
| 134 |
-
--model_path /path/to/OneVL-checkpoint \
|
| 135 |
-
--test_set_path test_data/navsim_test.json \
|
| 136 |
-
--image_base_path "" \
|
| 137 |
-
--output_path output/navsim/results_explain.json \
|
| 138 |
-
--device cuda:0 \
|
| 139 |
-
--num_latent 2 --num_latent_vis 4 \
|
| 140 |
-
--max_new_tokens 1024 --answer_prefix "[" --prefix_k 0 \
|
| 141 |
-
--decoder_explain --aux_visual_condition \
|
| 142 |
-
--c_thought 2 --max_explain_tokens 1024 \
|
| 143 |
-
--visual_decoder_explain --visual_aux_visual_condition \
|
| 144 |
-
--c_thought_visual 4 --max_visual_tokens 2560
|
| 145 |
-
```
|
| 146 |
-
|
| 147 |
-
### Multi-GPU Inference
|
| 148 |
|
| 149 |
-
|
| 150 |
-
export MODEL_PATH=/path/to/OneVL-checkpoint
|
| 151 |
-
export TEST_SET_PATH=test_data/navsim_test.json
|
| 152 |
-
export OUTPUT_PATH=output/navsim/navsim_results.json
|
| 153 |
-
bash run_infer.sh
|
| 154 |
-
```
|
| 155 |
-
|
| 156 |
-
Per-benchmark scripts are available in `scripts/`:
|
| 157 |
-
|
| 158 |
-
```bash
|
| 159 |
-
bash scripts/infer_navsim.sh
|
| 160 |
-
bash scripts/infer_ar1.sh
|
| 161 |
-
bash scripts/infer_roadwork.sh
|
| 162 |
-
bash scripts/infer_impromptu.sh
|
| 163 |
-
```
|
| 164 |
-
|
| 165 |
-
For full documentation, evaluation scripts, and data format details, see the [GitHub repository](https://github.com/xiaomi-research/onevl).
|
| 166 |
-
|
| 167 |
-
---
|
| 168 |
-
|
| 169 |
-
## Open-Source Status
|
| 170 |
|
| 171 |
-
|
| 172 |
-
|---|:---:|
|
| 173 |
-
| Technical Report | ✅ Released |
|
| 174 |
-
| Model Weights | ✅ Released |
|
| 175 |
-
| Inference Code | ✅ Released |
|
| 176 |
-
| Training Code | 🔜 Coming Soon |
|
| 177 |
|
| 178 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 179 |
|
| 180 |
## Citation
|
| 181 |
|
|
@@ -189,10 +83,7 @@ For full documentation, evaluation scripts, and data format details, see the [Gi
|
|
| 189 |
}
|
| 190 |
```
|
| 191 |
|
| 192 |
-
---
|
| 193 |
-
|
| 194 |
## License
|
| 195 |
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
Model weights are built on [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) and the visual tokenizer is from [Emu3.5-VisionTokenizer](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer); please refer to their respective licenses as well.
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- Qwen/Qwen3-VL-4B-Instruct
|
| 4 |
language:
|
| 5 |
- en
|
| 6 |
+
license: apache-2.0
|
| 7 |
+
pipeline_tag: image-to-image
|
| 8 |
+
library_name: transformers
|
| 9 |
tags:
|
| 10 |
- autonomous-driving
|
| 11 |
- vision-language-action
|
| 12 |
- chain-of-thought
|
| 13 |
- trajectory-prediction
|
| 14 |
- VLA
|
|
|
|
|
|
|
|
|
|
| 15 |
---
|
| 16 |
|
| 17 |
# OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
|
| 18 |
|
| 19 |
**[📄 Paper (arXiv)](https://arxiv.org/abs/2604.18486)** | **[💻 GitHub](https://github.com/xiaomi-research/onevl)** | **[🌐 Project Page](https://Xiaomi-Embodied-Intelligence.github.io/OneVL/)**
|
| 20 |
|
| 21 |
+
OneVL is a Vision-Language-Action (VLA) framework for autonomous driving that achieves state-of-the-art trajectory prediction accuracy while matching the inference latency of answer-only autoregressive models.
|
|
|
|
|
|
|
| 22 |
|
| 23 |
## Overview
|
| 24 |
|
| 25 |
+
OneVL addresses the limitations of prior latent Chain-of-Thought (CoT) methods by introducing **dual-modal auxiliary decoders**. These decoders force compact latent tokens to encode both human-readable reasoning and future scene dynamics. During inference, these decoders are discarded, and the latent tokens are prefilled into the context in a single parallel pass, achieving high performance at answer-only speeds.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
+
### Key Architecture Components
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
+
- **Latent Token Interface**: 4 visual and 2 language latent tokens inserted before the response.
|
| 30 |
+
- **Visual Auxiliary Decoder**: Acts as a world model, predicting future-frame visual tokens (at t+0.5s and t+1.0s).
|
| 31 |
+
- **Language Auxiliary Decoder**: Reconstructs explicit CoT reasoning text from language latent hidden states.
|
| 32 |
+
- **Prefill Inference**: Enables 1.5× to 2.3× speedup over explicit autoregressive CoT.
|
| 33 |
|
| 34 |
## Usage
|
| 35 |
|
|
|
|
| 39 |
- `transformers >= 4.57.0` (required for `Qwen3VLForConditionalGeneration`)
|
| 40 |
|
| 41 |
```bash
|
| 42 |
+
# Environment Setup
|
| 43 |
uv venv venv/onevl --python 3.12
|
| 44 |
source venv/onevl/bin/activate
|
| 45 |
pip install -r requirements.txt
|
|
|
|
| 58 |
--max_new_tokens 1024 --answer_prefix "[" --prefix_k 0
|
| 59 |
```
|
| 60 |
|
| 61 |
+
For full inference options, including language and visual explanations, please refer to the [GitHub repository](https://github.com/xiaomi-research/onevl).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
+
## Results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
+
OneVL is the first latent CoT method to surpass explicit autoregressive CoT across all major autonomous driving benchmarks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
+
| Benchmark | Metric | AR CoT+Answer | OneVL |
|
| 68 |
+
|---|:---:|:---:|:---:|
|
| 69 |
+
| **NAVSIM** | PDM-score ↑ | 88.29 | **88.84** |
|
| 70 |
+
| **ROADWork** | ADE (px) ↓ | 13.18 | **12.49** |
|
| 71 |
+
| **Impromptu** | ADE (m) ↓ | 1.42 | **1.34** |
|
| 72 |
+
| **APR1** | ADE (m) ↓ | 2.99 | **2.62** |
|
| 73 |
|
| 74 |
## Citation
|
| 75 |
|
|
|
|
| 83 |
}
|
| 84 |
```
|
| 85 |
|
|
|
|
|
|
|
| 86 |
## License
|
| 87 |
|
| 88 |
+
This project is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
|
| 89 |
+
Model weights are built on [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) and the visual tokenizer is from [Emu3.5-VisionTokenizer](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer).
|
|
|