StarVLA-VLNCE-Qwen3VL-4B
StarVLA-VLNCE-Qwen3VL-4B is a Qwen3-VL based vision-language model fine-tuned for VLN-CE-style vision-language navigation tasks in the StarVLA framework.
The model is intended to be used as the QwenVL VLM server checkpoint for VLN-CE evaluation.
Model Details
- Model type: Vision-Language Model
- Base model: Qwen3-VL-4B-Instruct
- Framework: StarVLA
- Task: Vision-and-Language Navigation in Continuous Environments $VLN-CE$
- Training datasets: R2R and RxR formatted for QwenVL-style conversations
- Model repo:
Ricky06662/StarVLA-VLNCE-Qwen3VL-4B
Intended Use
This model is designed for VLN-CE evaluation with StarVLA. It can be launched as a standalone websocket VLM server and queried by the VLN-CE evaluator.
Example usage:
bash examples/VLN-CE/eval_files/run_qwenvl_vlm_server.sh Ricky06662/StarVLA-VLNCE-Qwen3VL-4B
You can also specify the checkpoint through environment variables:
CKPT=Ricky06662/StarVLA-VLNCE-Qwen3VL-4B \
GPU_ID=0 \
SERVER_HOST=0.0.0.0 \
PORT=6694 \
bash examples/VLN-CE/eval_files/run_qwenvl_vlm_server.sh
For multi-GPU evaluation, start one independent server per GPU:
GPU_IDS=0,1,2,3 \
PORT=6694 \
bash examples/VLN-CE/eval_files/run_qwenvl_vlm_server.sh Ricky06662/StarVLA-VLNCE-Qwen3VL-4B
Training Data
The model was fine-tuned on VLN-CE navigation instruction data, including R2R and RxR-style trajectory annotations reformatted into the QwenVL conversation format.
Each training sample links visual observations to a multi-turn instruction/response format.
Example format:
{
"image": ["path/to/images/001.jpg", "...", "path/to/images/008.jpg"],
"conversations": [
{
"from": "human",
"value": "<image>\nNavigation instruction or question."
},
{
"from": "gpt",
"value": "Model response."
}
]
}
Evaluation
Evaluation should be performed using the VLN-CE evaluation pipeline in StarVLA together with the external VLN-CE simulator environment.
The recommended setup separates:
- The StarVLA / QwenVL model server environment.
- The VLN-CE simulator environment.
The two communicate through websocket during evaluation.
For the VLN-CE simulator setup, please refer to:
Limitations
- This model is specialized for VLN-CE-style vision-language navigation tasks.
- It is not intended as a general-purpose chatbot.
- Performance may degrade outside indoor navigation environments or when the input format differs from the training format.
- The model relies on the StarVLA evaluation pipeline for intended usage.
Citation
If you use this model, please cite the StarVLA project and the original datasets used for VLN-CE training/evaluation.
@misc{starvla,
title = {StarVLA: Vision-Language-Action Framework},
author = {StarVLA Contributors},
year = {2025},
url = {https://github.com/liyaxuanliyaxuan/StarVLA}
}
- Downloads last month
- 15