StarVLA-VLNCE-Qwen3VL-4B

StarVLA-VLNCE-Qwen3VL-4B is a Qwen3-VL based vision-language model fine-tuned for VLN-CE-style vision-language navigation tasks in the StarVLA framework.

The model is intended to be used as the QwenVL VLM server checkpoint for VLN-CE evaluation.

Model Details

  • Model type: Vision-Language Model
  • Base model: Qwen3-VL-4B-Instruct
  • Framework: StarVLA
  • Task: Vision-and-Language Navigation in Continuous Environments $VLN-CE$
  • Training datasets: R2R and RxR formatted for QwenVL-style conversations
  • Model repo: Ricky06662/StarVLA-VLNCE-Qwen3VL-4B

Intended Use

This model is designed for VLN-CE evaluation with StarVLA. It can be launched as a standalone websocket VLM server and queried by the VLN-CE evaluator.

Example usage:

bash examples/VLN-CE/eval_files/run_qwenvl_vlm_server.sh Ricky06662/StarVLA-VLNCE-Qwen3VL-4B

You can also specify the checkpoint through environment variables:

CKPT=Ricky06662/StarVLA-VLNCE-Qwen3VL-4B \
GPU_ID=0 \
SERVER_HOST=0.0.0.0 \
PORT=6694 \
bash examples/VLN-CE/eval_files/run_qwenvl_vlm_server.sh

For multi-GPU evaluation, start one independent server per GPU:

GPU_IDS=0,1,2,3 \
PORT=6694 \
bash examples/VLN-CE/eval_files/run_qwenvl_vlm_server.sh Ricky06662/StarVLA-VLNCE-Qwen3VL-4B

Training Data

The model was fine-tuned on VLN-CE navigation instruction data, including R2R and RxR-style trajectory annotations reformatted into the QwenVL conversation format.

Each training sample links visual observations to a multi-turn instruction/response format.

Example format:

{
  "image": ["path/to/images/001.jpg", "...", "path/to/images/008.jpg"],
  "conversations": [
    {
      "from": "human",
      "value": "<image>\nNavigation instruction or question."
    },
    {
      "from": "gpt",
      "value": "Model response."
    }
  ]
}

Evaluation

Evaluation should be performed using the VLN-CE evaluation pipeline in StarVLA together with the external VLN-CE simulator environment.

The recommended setup separates:

  1. The StarVLA / QwenVL model server environment.
  2. The VLN-CE simulator environment.

The two communicate through websocket during evaluation.

For the VLN-CE simulator setup, please refer to:

StarVLA-VLN-CE-Evaluation

Limitations

  • This model is specialized for VLN-CE-style vision-language navigation tasks.
  • It is not intended as a general-purpose chatbot.
  • Performance may degrade outside indoor navigation environments or when the input format differs from the training format.
  • The model relies on the StarVLA evaluation pipeline for intended usage.

Citation

If you use this model, please cite the StarVLA project and the original datasets used for VLN-CE training/evaluation.

@misc{starvla,
  title = {StarVLA: Vision-Language-Action Framework},
  author = {StarVLA Contributors},
  year = {2025},
  url = {https://github.com/liyaxuanliyaxuan/StarVLA}
}
Downloads last month
15
Safetensors
Model size
570k params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support