Qwen2.5-1.5B-Instruct-RVQ-Human-Motion-CoT-PoC

This model is a motion-language variant of Qwen/Qwen2.5-1.5B-Instruct.

Given a natural-language action prompt, it produces a first person chain of thought about the movement as well as tokens that can be decoded by RVQ.

This model is just a proof of conecept that usually can generate basic moves but usually fails on more complex ones.

RVQ used: https://huggingface.co/Wojtekb30/Motion-RVQ-263d-reconstructor-humanML

Those tokens are decoded by the included RVQ decoder into a 3D human motion sequence (animation).

This repository contains one of 2 variants I trained, the 2nd is under:

https://huggingface.co/Wojtekb30/Qwen2.5-1.5B-Instruct-RVQ-Human-Motion-CoT-PoC-2

The variants do better on some animations, but do worse on others.

In order for the model to generate moves, you must use system prompt:

You are an embodied AI. You reason about your physical state and output precise motor actions inside <move></move> tags.

Recommended temperature of 0.5.

What makes this model different

  • It is trained to emit discrete movement tokens directly in chat output in between reasoning text.
  • It uses explicit motion token vocabulary:
    • <move>, </move>
    • <m_{level}_{value}> where level in [0..3] and value in [0..1023]
  • The generated response can contain both language and motion tokens in one assistant turn.
  • It takes only 3 movement tokens for RVQ to decode a coarse 0.5 seconds of motion, and 10 tokens for detailed one. A robot or avatar can keep up even if LLM inference is slow.

Output format

Typical response pattern:

I lean forward and begin stepping with a steady pace.
<move><m_0_123><m_1_54><m_2_901><m_3_77>...</move>

Motion decoding expects 4 RVQ levels per frame. In practice:

  • one frame = 4 tokens
  • token order is grouped frame-by-frame
  • each token is tagged by its RVQ level in the token text itself

Repository contents

  • model.safetensors: fine-tuned Qwen weights
  • tokenizer.json, added_tokens.json, tokenizer_config.json: tokenizer with motion tokens
  • rvq_model/rvq_model.py: RVQ VAE architecture
  • rvq_model/motion_rvq_weights.safetensors: RVQ decoder/quantizer weights
  • rvq_model/Mean.npy, rvq_model/Std.npy: normalization stats for motion reconstruction
  • RunQwen.py: end-to-end generation + token parsing + RVQ decoding + animation
  • TrainQwen.py: fine-tuning script used to train the model

Quick start

Install minimal dependencies:

pip install torch transformers safetensors numpy matplotlib

Run inference and visualize animation:

python RunQwen.py

Edit PROMPT_TEXT in RunQwen.py to try your own instruction.

Inference pipeline

  1. Load tokenizer + fine-tuned Qwen model.
  2. Prompt in chat format.
  3. Generate assistant response.
  4. Extract all <m_level_value> tokens from <move>...</move>.
  5. Rebuild RVQ token matrix with shape [4, T].
  6. Sum quantizer embeddings across 4 levels.
  7. Decode latent sequence with the RVQ decoder.
  8. De-normalize with Std.npy and Mean.npy.
  9. Render skeleton animation.

Minimal parsing example

import re

response = """... assistant text ..."""
move_blocks = re.findall(r'<move>(.*?)</move>', response, re.DOTALL)
tokens = []
for block in move_blocks:
    tokens.extend(re.findall(r'<m_(\d+)_(\d+)>', block))

# tokens -> list of (level, value) pairs

Training notes

The included training script (TrainQwen.py) performs full fine-tuning with:

  • Base model: Qwen/Qwen2.5-1.5B-Instruct
  • Added special tokens for motion vocabulary (4 x 1024 RVQ bins + move delimiters)
  • Chat-formatted supervised fine-tuning (SFT)
  • Loss masking to train on assistant completion only
  • This model was brought down to loss 1.0

Trained on: https://huggingface.co/Wojtekb30/language-action-RVQ-CoT-humanML

Limitations

  • Motion quality depends on RVQ decoder fit and token correctness.
  • Invalid or incomplete token sequences can fail to decode cleanly.
  • The provided visualizer is a simple skeleton renderer for quick inspection.
  • This model is intended for research and prototyping.

Again, this model is just a proof of conecept that usually can generate basic moves but usually fails on more complex ones.

Downloads last month
56
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Wojtekb30/Qwen2.5-1.5B-Instruct-RVQ-Human-Motion-CoT-PoC

Finetuned
(1592)
this model

Dataset used to train Wojtekb30/Qwen2.5-1.5B-Instruct-RVQ-Human-Motion-CoT-PoC

Space using Wojtekb30/Qwen2.5-1.5B-Instruct-RVQ-Human-Motion-CoT-PoC 1

Collection including Wojtekb30/Qwen2.5-1.5B-Instruct-RVQ-Human-Motion-CoT-PoC