Instructions to use shreethar/stage1_unsloth with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Local Apps Settings
- Unsloth Studio
How to use shreethar/stage1_unsloth with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for shreethar/stage1_unsloth to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for shreethar/stage1_unsloth to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for shreethar/stage1_unsloth to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="shreethar/stage1_unsloth", max_seq_length=2048, )
ReasonFlow VLA — Stage 1: Robot Grounding SFT
This is the Stage 1 checkpoint of ReasonFlow VLA, a multi-stage Vision-Language-Action system developed as a Final Year Project at Universiti Teknikal Malaysia Melaka (UTeM).
It is a Qwen3.5-4B (natively multimodal) model fine-tuned via supervised instruction tuning across eight robot-domain datasets to establish foundational robotic knowledge before any RL or distillation is applied in later stages.
This checkpoint is the shared initialisation point for both the Teacher and Student models in Stage 2 (GRPO Teacher-Student Distillation).
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.5-4B |
| Modality | Vision + Language (natively multimodal) |
| Training Method | Supervised Fine-Tuning (SFT) via Unsloth |
| Training Samples | ~560K |
| Training Steps | |
| Learning Rate | 1e-5 |
| Batch Size | 1 (gradient accumulation = 8, effective batch = 8) |
| Image Resolution | 448 × 448 |
Training Data
The model was trained on eight curated datasets spanning trajectory prediction, affordance grounding, task planning, video QA, and general visual captioning:
| Dataset | Task | Samples Used |
|---|---|---|
| MolmoAct Trajectory | 2D end-effector trajectory prediction | ~200K (10%) |
| RoboVQA | Robot visual question answering | ~100K (10%) |
| RoboFAC | Failure analysis & correction QA | ~64K (100%) |
| ShareRobot Affordance | Affordance bounding box prediction | ~6.5K (100%) |
| ShareRobot Planning | Multi-step task planning QA | ~100K (10%) |
| Pixmo Cap | Dense image captioning | ~50K (10%) |
| Pixmo Cap-QA | Caption-grounded QA | ~50K (10%) |
| Pixmo AMA | Open-ended visual QA | ~50K (10%) |
A compact pre-materialized subset (~51K samples) used for cloud training is available at
shreethar/FYP-Stage2-dataset.
Sampling policy: datasets with more than 100K samples are sampled at ~10%; datasets smaller than 100K are kept in full.
Task Format
All samples follow a two-turn chat format. Trajectory tasks output normalised waypoint lists; QA tasks output free-form text.
Trajectory example (MolmoAct):
User: [image] You are a robot manipulation assistant. Given an observation image and a
task instruction, predict the end-effector's 2D trajectory as 5 waypoints.
Output ONLY the coordinate list: [[x1,y1],[x2,y2],[x3,y3],[x4,y4],[x5,y5]]
Task: Pick up the red cup.
Model: [[142,308],[198,275],[241,233],[280,195],[310,162]]
QA example (RoboFAC):
User: [video frames] You are a robot manipulation assistant. Answer questions about
robot tasks, object affordances, and manipulation strategies.
Why did the robot fail to grasp the object?
Model: The gripper approached from the wrong angle — the contact point missed the
graspable region of the handle. The robot should adjust its approach trajectory
to align with the object's principal axis.
Hardware
| Component | Spec |
|---|---|
| GPU | NVIDIA RTX A4000 (16 GB VRAM) |
| RAM | 128 GB @ 4400 MT/s |
| CPU | Intel Xeon w3-2425 |
| Training Time | ~10 days |
Usage
from transformers import AutoProcessor, AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained(
"shreethar/stage1_unsloth",
torch_dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("shreethar/stage1_unsloth")
Project Context
This checkpoint is Stage 1 of the ReasonFlow VLA pipeline:
| Stage | Description | Status |
|---|---|---|
| 1 | Robot Grounding SFT ← this model | ✅ Done |
| 2 | GRPO Teacher · Student Distillation | 🔄 In Progress |
| 3 | Action Expert — CFM Adapter | 📋 Planned |
| 4 | Partial VLM Coupling · Spatial Forcing | 📋 Planned |
| 5 | LIBERO Evaluation · RL Fine-Tuning | ⚗️ Optional |
Full project repository: ReasonFlow VLA on GitHub
Citation
If you use this checkpoint, please cite:
@misc{shreethar2025reasonflow,
title = {ReasonFlow VLA: A Multi-Stage Vision-Language-Action System with
Latent Reasoning and Conditional Flow Matching},
author = {Shreethar},
year = {2025},
note = {Final Year Project, Universiti Teknikal Malaysia Melaka (UTeM)},
url = {https://huggingface.co/shreethar/stage1_unsloth}
}
- Downloads last month
- 1,017