ReasonFlow VLA — Stage 1: Robot Grounding SFT

This is the Stage 1 checkpoint of ReasonFlow VLA, a multi-stage Vision-Language-Action system developed as a Final Year Project at Universiti Teknikal Malaysia Melaka (UTeM).

It is a Qwen3.5-4B (natively multimodal) model fine-tuned via supervised instruction tuning across eight robot-domain datasets to establish foundational robotic knowledge before any RL or distillation is applied in later stages.

This checkpoint is the shared initialisation point for both the Teacher and Student models in Stage 2 (GRPO Teacher-Student Distillation).


Model Details

Property Value
Base Model Qwen/Qwen3.5-4B
Modality Vision + Language (natively multimodal)
Training Method Supervised Fine-Tuning (SFT) via Unsloth
Training Samples ~560K
Training Steps 750K (1 epoch)
Learning Rate 1e-5
Batch Size 1 (gradient accumulation = 8, effective batch = 8)
Image Resolution 448 × 448

Training Data

The model was trained on eight curated datasets spanning trajectory prediction, affordance grounding, task planning, video QA, and general visual captioning:

Dataset Task Samples Used
MolmoAct Trajectory 2D end-effector trajectory prediction ~200K (10%)
RoboVQA Robot visual question answering ~100K (10%)
RoboFAC Failure analysis & correction QA ~64K (100%)
ShareRobot Affordance Affordance bounding box prediction ~6.5K (100%)
ShareRobot Planning Multi-step task planning QA ~100K (10%)
Pixmo Cap Dense image captioning ~50K (10%)
Pixmo Cap-QA Caption-grounded QA ~50K (10%)
Pixmo AMA Open-ended visual QA ~50K (10%)

A compact pre-materialized subset (~51K samples) used for cloud training is available at shreethar/FYP-Stage2-dataset.

Sampling policy: datasets with more than 100K samples are sampled at ~10%; datasets smaller than 100K are kept in full.


Task Format

All samples follow a two-turn chat format. Trajectory tasks output normalised waypoint lists; QA tasks output free-form text.

Trajectory example (MolmoAct):

User:   [image] You are a robot manipulation assistant. Given an observation image and a
        task instruction, predict the end-effector's 2D trajectory as 5 waypoints.
        Output ONLY the coordinate list: [[x1,y1],[x2,y2],[x3,y3],[x4,y4],[x5,y5]]
        Task: Pick up the red cup.

Model:  [[142,308],[198,275],[241,233],[280,195],[310,162]]

QA example (RoboFAC):

User:   [video frames] You are a robot manipulation assistant. Answer questions about
        robot tasks, object affordances, and manipulation strategies.
        Why did the robot fail to grasp the object?

Model:  The gripper approached from the wrong angle — the contact point missed the
        graspable region of the handle. The robot should adjust its approach trajectory
        to align with the object's principal axis.

Hardware

Component Spec
GPU NVIDIA RTX A4000 (16 GB VRAM)
RAM 128 GB @ 4400 MT/s
CPU Intel Xeon w3-2425
Training Time ~10 days

Usage

from transformers import AutoProcessor, AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained(
    "shreethar/stage1_unsloth",
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("shreethar/stage1_unsloth")

Project Context

This checkpoint is Stage 1 of the ReasonFlow VLA pipeline:

Stage Description Status
1 Robot Grounding SFT ← this model ✅ Done
2 GRPO Teacher · Student Distillation 🔄 In Progress
3 Action Expert — CFM Adapter 📋 Planned
4 Partial VLM Coupling · Spatial Forcing 📋 Planned
5 LIBERO Evaluation · RL Fine-Tuning ⚗️ Optional

Full project repository: ReasonFlow VLA on GitHub


Citation

If you use this checkpoint, please cite:

@misc{shreethar2025reasonflow,
  title   = {ReasonFlow VLA: A Multi-Stage Vision-Language-Action System with
             Latent Reasoning and Conditional Flow Matching},
  author  = {Shreethar},
  year    = {2025},
  note    = {Final Year Project, Universiti Teknikal Malaysia Melaka (UTeM)},
  url     = {https://huggingface.co/shreethar/stage1_unsloth}
}
Downloads last month
1,017
Safetensors
Model size
5B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shreethar/stage1_unsloth

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(307)
this model