base_model:
- Qwen/Qwen3-VL-4B-Instruct
language:
- en
license: apache-2.0
pipeline_tag: image-to-image
library_name: transformers
tags:
- autonomous-driving
- vision-language-action
- chain-of-thought
- trajectory-prediction
- VLA
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
π Paper (arXiv) | π» GitHub | π Project Page
OneVL is a Vision-Language-Action (VLA) framework for autonomous driving that achieves state-of-the-art trajectory prediction accuracy while matching the inference latency of answer-only autoregressive models.
Overview
OneVL addresses the limitations of prior latent Chain-of-Thought (CoT) methods by introducing dual-modal auxiliary decoders. These decoders force compact latent tokens to encode both human-readable reasoning and future scene dynamics. During inference, these decoders are discarded, and the latent tokens are prefilled into the context in a single parallel pass, achieving high performance at answer-only speeds.
Key Architecture Components
- Latent Token Interface: 4 visual and 2 language latent tokens inserted before the response.
- Visual Auxiliary Decoder: Acts as a world model, predicting future-frame visual tokens (at t+0.5s and t+1.0s).
- Language Auxiliary Decoder: Reconstructs explicit CoT reasoning text from language latent hidden states.
- Prefill Inference: Enables 1.5Γ to 2.3Γ speedup over explicit autoregressive CoT.
Usage
Requirements
- Python 3.10+, CUDA GPU (β₯16 GB VRAM recommended)
transformers >= 4.57.0(required forQwen3VLForConditionalGeneration)
# Environment Setup
uv venv venv/onevl --python 3.12
source venv/onevl/bin/activate
pip install -r requirements.txt
Inference (Trajectory Prediction Only)
python infer_onevl.py \
--model_path /path/to/OneVL-checkpoint \
--test_set_path test_data/navsim_test.json \
--image_base_path "" \
--output_path output/navsim/results.json \
--device cuda:0 \
--num_latent 2 --num_latent_vis 4 \
--max_new_tokens 1024 --answer_prefix "[" --prefix_k 0
For full inference options, including language and visual explanations, please refer to the GitHub repository.
Results
OneVL is the first latent CoT method to surpass explicit autoregressive CoT across all major autonomous driving benchmarks.
| Benchmark | Metric | AR CoT+Answer | OneVL |
|---|---|---|---|
| NAVSIM | PDM-score β | 88.29 | 88.84 |
| ROADWork | ADE (px) β | 13.18 | 12.49 |
| Impromptu | ADE (m) β | 1.42 | 1.34 |
| APR1 | ADE (m) β | 2.99 | 2.62 |
Citation
@article{lu2026onevl,
title={OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation},
author={Lu, Jinghui and Guan, Jiayi and Huang, Zhijian and Li, Jinlong and Li, Guang and Kong, Lingdong and Li, Yingyan and Wang, Han and Xu, Shaoqing and Luo, Yuechen and others},
journal={arXiv preprint arXiv:2604.18486},
year={2026},
url={https://arxiv.org/abs/2604.18486}
}
License
This project is released under the Apache 2.0 License. Model weights are built on Qwen3-VL-4B-Instruct and the visual tokenizer is from Emu3.5-VisionTokenizer.