Qwen2.5-VL-3B-Instruct-RVQ-Human-Motion-CoT-PoC

This model is a motion-language variant of Qwen/Qwen2.5-VL-3B-Instruct.

Given a natural-language action prompt, it produces a first person chain of thought about the movement as well as tokens that can be decoded by RVQ.

This VLA model was not trained on image -> text -> motion, but only text -> motion. It was an experiment if it will be able to create motion related to the image anyway.

The model proven somewhat capable of that, seeing an image of a man walking, it would produce a walking motion, for example. But the capability is very limited.

This model is just a proof of conecept that usually can generate basic moves but usually fails on more complex ones.

RVQ used: https://huggingface.co/Wojtekb30/Motion-RVQ-263d-reconstructor-humanML

Those tokens are decoded by the included RVQ decoder into a 3D human motion sequence (animation).

In order for the model to generate moves, you must use system prompt:

You are an embodied AI. You reason about your physical state and output precise motor actions inside <move></move> tags.

Recommended temperature of 0.5.

What makes this model different

  • It is trained to emit discrete movement tokens directly in chat output in between reasoning text.
  • It uses explicit motion token vocabulary:
    • <move>, </move>
    • <m_{level}_{value}> where level in [0..3] and value in [0..1023]
  • The generated response can contain both language and motion tokens in one assistant turn.
  • It takes only 3 movement tokens for RVQ to decode a coarse 0.5 seconds of motion, and 10 tokens for detailed one. A robot or avatar can keep up even if LLM inference is slow.

Output format

Typical response pattern:

I lean forward and begin stepping with a steady pace.
<move><m_0_123><m_1_54><m_2_901><m_3_77>...</move>

Motion decoding expects 4 RVQ levels per frame. In practice:

  • one frame = 4 tokens
  • token order is grouped frame-by-frame
  • each token is tagged by its RVQ level in the token text itself

Quick start

Please look into RunVLA.py file.

Training notes

The included training script performs full fine-tuning with:

  • Base model (fully trained except vision encoder)
  • Added special tokens for motion vocabulary (4 x 1024 RVQ bins + move delimiters)
  • Chat-formatted supervised fine-tuning (SFT)
  • Loss masking to train on assistant completion only
  • This model was brought down to loss 1.0

Trained on: https://huggingface.co/Wojtekb30/language-action-RVQ-CoT-humanML

Limitations

  • Motion quality depends on RVQ decoder fit and token correctness.
  • Invalid or incomplete token sequences can fail to decode cleanly.
  • The provided visualizer is a simple skeleton renderer for quick inspection.
  • This model is intended for research and prototyping.
  • This model is not a true VLA, but an experiment if it will be able to use images in motion generation without being directly trained to do that.

Again, this model is just a proof of conecept that usually can generate basic moves but usually fails on more complex ones.

Downloads last month
10
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Wojtekb30/Qwen2.5-VL-3B-Instruct-RVQ-Human-Motion-CoT-PoC

Finetuned
(793)
this model

Dataset used to train Wojtekb30/Qwen2.5-VL-3B-Instruct-RVQ-Human-Motion-CoT-PoC

Collection including Wojtekb30/Qwen2.5-VL-3B-Instruct-RVQ-Human-Motion-CoT-PoC