ReasonFlow VLA — Stage 1: Robot Grounding SFT

This is the Stage 1 checkpoint of ReasonFlow VLA, a multi-stage Vision-Language-Action system developed as a Final Year Project at Universiti Teknikal Malaysia Melaka (UTeM).

It is a Qwen3.5-4B (natively multimodal) model fine-tuned via supervised instruction tuning across eight robot-domain datasets to establish foundational robotic knowledge before any RL or distillation is applied in later stages.

This checkpoint is the shared initialisation point for both the Teacher and Student models in Stage 2 (GRPO Teacher-Student Distillation).

Model Details

Property	Value
Base Model	Qwen/Qwen3.5-4B
Modality	Vision + Language (natively multimodal)
Training Method	Supervised Fine-Tuning (SFT) via Unsloth
Training Samples	~560K
Training Steps	~~750K (~~1 epoch)
Learning Rate	1e-5
Batch Size	1 (gradient accumulation = 8, effective batch = 8)
Image Resolution	448 × 448

Training Data

The model was trained on eight curated datasets spanning trajectory prediction, affordance grounding, task planning, video QA, and general visual captioning:

Dataset	Task	Samples Used
MolmoAct Trajectory	2D end-effector trajectory prediction	~200K (10%)
RoboVQA	Robot visual question answering	~100K (10%)
RoboFAC	Failure analysis & correction QA	~64K (100%)
ShareRobot Affordance	Affordance bounding box prediction	~6.5K (100%)
ShareRobot Planning	Multi-step task planning QA	~100K (10%)
Pixmo Cap	Dense image captioning	~50K (10%)
Pixmo Cap-QA	Caption-grounded QA	~50K (10%)
Pixmo AMA	Open-ended visual QA	~50K (10%)

A compact pre-materialized subset (~51K samples) used for cloud training is available at shreethar/FYP-Stage2-dataset.

Sampling policy: datasets with more than 100K samples are sampled at ~10%; datasets smaller than 100K are kept in full.

Task Format

All samples follow a two-turn chat format. Trajectory tasks output normalised waypoint lists; QA tasks output free-form text.

Trajectory example (MolmoAct):

User:   [image] You are a robot manipulation assistant. Given an observation image and a
        task instruction, predict the end-effector's 2D trajectory as 5 waypoints.
        Output ONLY the coordinate list: [[x1,y1],[x2,y2],[x3,y3],[x4,y4],[x5,y5]]
        Task: Pick up the red cup.

Model:  [[142,308],[198,275],[241,233],[280,195],[310,162]]

QA example (RoboFAC):

User:   [video frames] You are a robot manipulation assistant. Answer questions about
        robot tasks, object affordances, and manipulation strategies.
        Why did the robot fail to grasp the object?

Model:  The gripper approached from the wrong angle — the contact point missed the
        graspable region of the handle. The robot should adjust its approach trajectory
        to align with the object's principal axis.

Hardware

Component	Spec
GPU	NVIDIA RTX A4000 (16 GB VRAM)
RAM	128 GB @ 4400 MT/s
CPU	Intel Xeon w3-2425
Training Time	~10 days

Usage

from transformers import AutoProcessor, AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained(
    "shreethar/stage1_unsloth",
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("shreethar/stage1_unsloth")

Project Context

This checkpoint is Stage 1 of the ReasonFlow VLA pipeline:

Stage	Description	Status
1	Robot Grounding SFT ← this model	✅ Done
2	GRPO Teacher · Student Distillation	🔄 In Progress
3	Action Expert — CFM Adapter	📋 Planned
4	Partial VLM Coupling · Spatial Forcing	📋 Planned
5	LIBERO Evaluation · RL Fine-Tuning	⚗️ Optional

Full project repository: ReasonFlow VLA on GitHub

Citation

If you use this checkpoint, please cite:

@misc{shreethar2025reasonflow,
  title   = {ReasonFlow VLA: A Multi-Stage Vision-Language-Action System with
             Latent Reasoning and Conditional Flow Matching},
  author  = {Shreethar},
  year    = {2025},
  note    = {Final Year Project, Universiti Teknikal Malaysia Melaka (UTeM)},
  url     = {https://huggingface.co/shreethar/stage1_unsloth}
}

Downloads last month: 1,017

Safetensors

Model size

5B params

Tensor type

BF16

F32

Model tree for shreethar/stage1_unsloth

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Finetuned

(307)

this model