Qwen2.5-VL-3B-Instruct-RVQ-Human-Motion-CoT-PoC

This model is a motion-language variant of Qwen/Qwen2.5-VL-3B-Instruct.

Given a natural-language action prompt, it produces a first person chain of thought about the movement as well as tokens that can be decoded by RVQ.

This VLA model was not trained on image -> text -> motion, but only text -> motion. It was an experiment if it will be able to create motion related to the image anyway.

The model proven somewhat capable of that, seeing an image of a man walking, it would produce a walking motion, for example. But the capability is very limited.

This model is just a proof of conecept that usually can generate basic moves but usually fails on more complex ones.

RVQ used: https://huggingface.co/Wojtekb30/Motion-RVQ-263d-reconstructor-humanML

Those tokens are decoded by the included RVQ decoder into a 3D human motion sequence (animation).

In order for the model to generate moves, you must use system prompt:

You are an embodied AI. You reason about your physical state and output precise motor actions inside <move></move> tags.

Recommended temperature of 0.5.

What makes this model different

It is trained to emit discrete movement tokens directly in chat output in between reasoning text.
It uses explicit motion token vocabulary:
- <move>, </move>
- <m_{level}_{value}> where level in [0..3] and value in [0..1023]
The generated response can contain both language and motion tokens in one assistant turn.
It takes only 3 movement tokens for RVQ to decode a coarse 0.5 seconds of motion, and 10 tokens for detailed one. A robot or avatar can keep up even if LLM inference is slow.

Output format

Typical response pattern:

I lean forward and begin stepping with a steady pace.
<move><m_0_123><m_1_54><m_2_901><m_3_77>...</move>

Motion decoding expects 4 RVQ levels per frame. In practice:

one frame = 4 tokens
token order is grouped frame-by-frame
each token is tagged by its RVQ level in the token text itself

Quick start

Please look into RunVLA.py file.

Training notes

The included training script performs full fine-tuning with:

Base model (fully trained except vision encoder)
Added special tokens for motion vocabulary (4 x 1024 RVQ bins + move delimiters)
Chat-formatted supervised fine-tuning (SFT)
Loss masking to train on assistant completion only
This model was brought down to loss 1.0

Trained on: https://huggingface.co/Wojtekb30/language-action-RVQ-CoT-humanML

Limitations

Motion quality depends on RVQ decoder fit and token correctness.
Invalid or incomplete token sequences can fail to decode cleanly.
The provided visualizer is a simple skeleton renderer for quick inspection.
This model is intended for research and prototyping.
This model is not a true VLA, but an experiment if it will be able to use images in motion generation without being directly trained to do that.

Again, this model is just a proof of conecept that usually can generate basic moves but usually fails on more complex ones.

Downloads last month: 10

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Wojtekb30/Qwen2.5-VL-3B-Instruct-RVQ-Human-Motion-CoT-PoC

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

(793)

this model

Dataset used to train Wojtekb30/Qwen2.5-VL-3B-Instruct-RVQ-Human-Motion-CoT-PoC

Collection including Wojtekb30/Qwen2.5-VL-3B-Instruct-RVQ-Human-Motion-CoT-PoC

Human motion generator proof of concept

Collection

My experiment on human motion generation using RVQ and Qwen2.5 1.5B LLM. Split into 2 LLMs trained in different ways. 1st seems better than 2nd. • 8 items • Updated 1 day ago