StarVLA-VLNCE-Qwen3VL-4B

StarVLA-VLNCE-Qwen3VL-4B is a Qwen3-VL based vision-language model fine-tuned for VLN-CE-style vision-language navigation tasks in the StarVLA framework.

The model is intended to be used as the QwenVL VLM server checkpoint for VLN-CE evaluation.

Model Details

Model type: Vision-Language Model
Base model: Qwen3-VL-4B-Instruct
Framework: StarVLA
Task: Vision-and-Language Navigation in Continuous Environments $VLN-CE$
Training datasets: R2R and RxR formatted for QwenVL-style conversations
Model repo: Ricky06662/StarVLA-VLNCE-Qwen3VL-4B

Intended Use

This model is designed for VLN-CE evaluation with StarVLA. It can be launched as a standalone websocket VLM server and queried by the VLN-CE evaluator.

Example usage:

bash examples/VLN-CE/eval_files/run_qwenvl_vlm_server.sh Ricky06662/StarVLA-VLNCE-Qwen3VL-4B

You can also specify the checkpoint through environment variables:

CKPT=Ricky06662/StarVLA-VLNCE-Qwen3VL-4B \
GPU_ID=0 \
SERVER_HOST=0.0.0.0 \
PORT=6694 \
bash examples/VLN-CE/eval_files/run_qwenvl_vlm_server.sh

For multi-GPU evaluation, start one independent server per GPU:

GPU_IDS=0,1,2,3 \
PORT=6694 \
bash examples/VLN-CE/eval_files/run_qwenvl_vlm_server.sh Ricky06662/StarVLA-VLNCE-Qwen3VL-4B

Training Data

The model was fine-tuned on VLN-CE navigation instruction data, including R2R and RxR-style trajectory annotations reformatted into the QwenVL conversation format.

Each training sample links visual observations to a multi-turn instruction/response format.

Example format:

{
  "image": ["path/to/images/001.jpg", "...", "path/to/images/008.jpg"],
  "conversations": [
    {
      "from": "human",
      "value": "<image>\nNavigation instruction or question."
    },
    {
      "from": "gpt",
      "value": "Model response."
    }
  ]
}

Evaluation

Evaluation should be performed using the VLN-CE evaluation pipeline in StarVLA together with the external VLN-CE simulator environment.

The recommended setup separates:

The StarVLA / QwenVL model server environment.
The VLN-CE simulator environment.

The two communicate through websocket during evaluation.

For the VLN-CE simulator setup, please refer to:

StarVLA-VLN-CE-Evaluation

Limitations

This model is specialized for VLN-CE-style vision-language navigation tasks.
It is not intended as a general-purpose chatbot.
Performance may degrade outside indoor navigation environments or when the input format differs from the training format.
The model relies on the StarVLA evaluation pipeline for intended usage.

Citation

If you use this model, please cite the StarVLA project and the original datasets used for VLN-CE training/evaluation.

@misc{starvla,
  title = {StarVLA: Vision-Language-Action Framework},
  author = {StarVLA Contributors},
  year = {2025},
  url = {https://github.com/liyaxuanliyaxuan/StarVLA}
}

Downloads last month: 15

Safetensors

Model size

570k params

Tensor type

BF16