Qwen2.5-VL-3B-Instruct-RVQ-Human-Motion-CoT-PoC
This model is a motion-language variant of Qwen/Qwen2.5-VL-3B-Instruct.
Given a natural-language action prompt, it produces a first person chain of thought about the movement as well as tokens that can be decoded by RVQ.
This VLA model was not trained on image -> text -> motion, but only text -> motion. It was an experiment if it will be able to create motion related to the image anyway.
The model proven somewhat capable of that, seeing an image of a man walking, it would produce a walking motion, for example. But the capability is very limited.
This model is just a proof of conecept that usually can generate basic moves but usually fails on more complex ones.
RVQ used: https://huggingface.co/Wojtekb30/Motion-RVQ-263d-reconstructor-humanML
Those tokens are decoded by the included RVQ decoder into a 3D human motion sequence (animation).
In order for the model to generate moves, you must use system prompt:
You are an embodied AI. You reason about your physical state and output precise motor actions inside <move></move> tags.
Recommended temperature of 0.5.
What makes this model different
- It is trained to emit discrete movement tokens directly in chat output in between reasoning text.
- It uses explicit motion token vocabulary:
<move>,</move><m_{level}_{value}>wherelevel in [0..3]andvalue in [0..1023]
- The generated response can contain both language and motion tokens in one assistant turn.
- It takes only 3 movement tokens for RVQ to decode a coarse 0.5 seconds of motion, and 10 tokens for detailed one. A robot or avatar can keep up even if LLM inference is slow.
Output format
Typical response pattern:
I lean forward and begin stepping with a steady pace.
<move><m_0_123><m_1_54><m_2_901><m_3_77>...</move>
Motion decoding expects 4 RVQ levels per frame. In practice:
- one frame = 4 tokens
- token order is grouped frame-by-frame
- each token is tagged by its RVQ level in the token text itself
Quick start
Please look into RunVLA.py file.
Training notes
The included training script performs full fine-tuning with:
- Base model (fully trained except vision encoder)
- Added special tokens for motion vocabulary (4 x 1024 RVQ bins + move delimiters)
- Chat-formatted supervised fine-tuning (SFT)
- Loss masking to train on assistant completion only
- This model was brought down to loss 1.0
Trained on: https://huggingface.co/Wojtekb30/language-action-RVQ-CoT-humanML
Limitations
- Motion quality depends on RVQ decoder fit and token correctness.
- Invalid or incomplete token sequences can fail to decode cleanly.
- The provided visualizer is a simple skeleton renderer for quick inspection.
- This model is intended for research and prototyping.
- This model is not a true VLA, but an experiment if it will be able to use images in motion generation without being directly trained to do that.
Again, this model is just a proof of conecept that usually can generate basic moves but usually fails on more complex ones.
- Downloads last month
- 10
Model tree for Wojtekb30/Qwen2.5-VL-3B-Instruct-RVQ-Human-Motion-CoT-PoC
Base model
Qwen/Qwen2.5-VL-3B-Instruct