Qwen2.5-1.5B-Instruct-RVQ-Human-Motion-CoT-PoC
This model is a motion-language variant of Qwen/Qwen2.5-1.5B-Instruct.
Given a natural-language action prompt, it produces a first person chain of thought about the movement as well as tokens that can be decoded by RVQ.
This model is just a proof of conecept that usually can generate basic moves but usually fails on more complex ones.
RVQ used: https://huggingface.co/Wojtekb30/Motion-RVQ-263d-reconstructor-humanML
Those tokens are decoded by the included RVQ decoder into a 3D human motion sequence (animation).
This repository contains one of 2 variants I trained, the 2nd is under:
https://huggingface.co/Wojtekb30/Qwen2.5-1.5B-Instruct-RVQ-Human-Motion-CoT-PoC-2
The variants do better on some animations, but do worse on others.
In order for the model to generate moves, you must use system prompt:
You are an embodied AI. You reason about your physical state and output precise motor actions inside <move></move> tags.
Recommended temperature of 0.5.
What makes this model different
- It is trained to emit discrete movement tokens directly in chat output in between reasoning text.
- It uses explicit motion token vocabulary:
<move>,</move><m_{level}_{value}>wherelevel in [0..3]andvalue in [0..1023]
- The generated response can contain both language and motion tokens in one assistant turn.
- It takes only 3 movement tokens for RVQ to decode a coarse 0.5 seconds of motion, and 10 tokens for detailed one. A robot or avatar can keep up even if LLM inference is slow.
Output format
Typical response pattern:
I lean forward and begin stepping with a steady pace.
<move><m_0_123><m_1_54><m_2_901><m_3_77>...</move>
Motion decoding expects 4 RVQ levels per frame. In practice:
- one frame = 4 tokens
- token order is grouped frame-by-frame
- each token is tagged by its RVQ level in the token text itself
Repository contents
model.safetensors: fine-tuned Qwen weightstokenizer.json,added_tokens.json,tokenizer_config.json: tokenizer with motion tokensrvq_model/rvq_model.py: RVQ VAE architecturervq_model/motion_rvq_weights.safetensors: RVQ decoder/quantizer weightsrvq_model/Mean.npy,rvq_model/Std.npy: normalization stats for motion reconstructionRunQwen.py: end-to-end generation + token parsing + RVQ decoding + animationTrainQwen.py: fine-tuning script used to train the model
Quick start
Install minimal dependencies:
pip install torch transformers safetensors numpy matplotlib
Run inference and visualize animation:
python RunQwen.py
Edit PROMPT_TEXT in RunQwen.py to try your own instruction.
Inference pipeline
- Load tokenizer + fine-tuned Qwen model.
- Prompt in chat format.
- Generate assistant response.
- Extract all
<m_level_value>tokens from<move>...</move>. - Rebuild RVQ token matrix with shape
[4, T]. - Sum quantizer embeddings across 4 levels.
- Decode latent sequence with the RVQ decoder.
- De-normalize with
Std.npyandMean.npy. - Render skeleton animation.
Minimal parsing example
import re
response = """... assistant text ..."""
move_blocks = re.findall(r'<move>(.*?)</move>', response, re.DOTALL)
tokens = []
for block in move_blocks:
tokens.extend(re.findall(r'<m_(\d+)_(\d+)>', block))
# tokens -> list of (level, value) pairs
Training notes
The included training script (TrainQwen.py) performs full fine-tuning with:
- Base model:
Qwen/Qwen2.5-1.5B-Instruct - Added special tokens for motion vocabulary (4 x 1024 RVQ bins + move delimiters)
- Chat-formatted supervised fine-tuning (SFT)
- Loss masking to train on assistant completion only
- This model was brought down to loss 1.0
Trained on: https://huggingface.co/Wojtekb30/language-action-RVQ-CoT-humanML
Limitations
- Motion quality depends on RVQ decoder fit and token correctness.
- Invalid or incomplete token sequences can fail to decode cleanly.
- The provided visualizer is a simple skeleton renderer for quick inspection.
- This model is intended for research and prototyping.
Again, this model is just a proof of conecept that usually can generate basic moves but usually fails on more complex ones.
- Downloads last month
- 56