Qwen2.5-1.5B-Instruct-RVQ-Human-Motion-CoT-PoC

This model is a motion-language variant of Qwen/Qwen2.5-1.5B-Instruct.

Given a natural-language action prompt, it produces a first person chain of thought about the movement as well as tokens that can be decoded by RVQ.

This model is just a proof of conecept that usually can generate basic moves but usually fails on more complex ones.

RVQ used: https://huggingface.co/Wojtekb30/Motion-RVQ-263d-reconstructor-humanML

Those tokens are decoded by the included RVQ decoder into a 3D human motion sequence (animation).

This repository contains one of 2 variants I trained, the 2nd is under:

https://huggingface.co/Wojtekb30/Qwen2.5-1.5B-Instruct-RVQ-Human-Motion-CoT-PoC-2

The variants do better on some animations, but do worse on others.

In order for the model to generate moves, you must use system prompt:

You are an embodied AI. You reason about your physical state and output precise motor actions inside <move></move> tags.

Recommended temperature of 0.5.

What makes this model different

It is trained to emit discrete movement tokens directly in chat output in between reasoning text.
It uses explicit motion token vocabulary:
- <move>, </move>
- <m_{level}_{value}> where level in [0..3] and value in [0..1023]
The generated response can contain both language and motion tokens in one assistant turn.
It takes only 3 movement tokens for RVQ to decode a coarse 0.5 seconds of motion, and 10 tokens for detailed one. A robot or avatar can keep up even if LLM inference is slow.

Output format

Typical response pattern:

I lean forward and begin stepping with a steady pace.
<move><m_0_123><m_1_54><m_2_901><m_3_77>...</move>

Motion decoding expects 4 RVQ levels per frame. In practice:

one frame = 4 tokens
token order is grouped frame-by-frame
each token is tagged by its RVQ level in the token text itself

Repository contents

model.safetensors: fine-tuned Qwen weights
tokenizer.json, added_tokens.json, tokenizer_config.json: tokenizer with motion tokens
rvq_model/rvq_model.py: RVQ VAE architecture
rvq_model/motion_rvq_weights.safetensors: RVQ decoder/quantizer weights
rvq_model/Mean.npy, rvq_model/Std.npy: normalization stats for motion reconstruction
RunQwen.py: end-to-end generation + token parsing + RVQ decoding + animation
TrainQwen.py: fine-tuning script used to train the model

Quick start

Install minimal dependencies:

pip install torch transformers safetensors numpy matplotlib

Run inference and visualize animation:

python RunQwen.py

Edit PROMPT_TEXT in RunQwen.py to try your own instruction.

Inference pipeline

Load tokenizer + fine-tuned Qwen model.
Prompt in chat format.
Generate assistant response.
Extract all <m_level_value> tokens from <move>...</move>.
Rebuild RVQ token matrix with shape [4, T].
Sum quantizer embeddings across 4 levels.
Decode latent sequence with the RVQ decoder.
De-normalize with Std.npy and Mean.npy.
Render skeleton animation.

Minimal parsing example

import re

response = """... assistant text ..."""
move_blocks = re.findall(r'<move>(.*?)</move>', response, re.DOTALL)
tokens = []
for block in move_blocks:
    tokens.extend(re.findall(r'<m_(\d+)_(\d+)>', block))

# tokens -> list of (level, value) pairs

Training notes

The included training script (TrainQwen.py) performs full fine-tuning with:

Base model: Qwen/Qwen2.5-1.5B-Instruct
Added special tokens for motion vocabulary (4 x 1024 RVQ bins + move delimiters)
Chat-formatted supervised fine-tuning (SFT)
Loss masking to train on assistant completion only
This model was brought down to loss 1.0

Trained on: https://huggingface.co/Wojtekb30/language-action-RVQ-CoT-humanML