Kimodo Qwen3 Projection Layer / Kimodo Qwen3 投影层

Pre-trained linear projection layer that maps Qwen3-8B text embeddings into the LLM2Vec embedding space used by the Kimodo motion diffusion model.

预训练的线性投影层,将 Qwen3-8B 的文本 embedding 映射到 Kimodo 运动扩散模型所使用的 LLM2Vec embedding 空间。

Why is this needed? / 为什么需要这个?

Kimodo's denoiser was trained with LLM2Vec (Llama-3-8B based) embeddings. Qwen3-8B produces embeddings of the same dimension (4096), but in a completely different semantic space. Directly substituting encoders produces poor motion quality. This projection layer bridges the gap by learning a linear transformation from one space to the other.

Kimodo 的 denoiser 是用 LLM2Vec(基于 Llama-3-8B)的 embedding 训练的。Qwen3-8B 虽然输出相同维度(4096)的 embedding,但语义空间完全不同。直接替换编码器会导致生成质量很差。此投影层通过学习两个空间之间的线性变换来弥合这一差距。

Architecture / 架构

  • Type: nn.Linear(4096, 4096, bias=True)
  • Parameters: ~16.8M
  • File size: ~64 MB
  • Input: Qwen3-8B mean-pooled text embeddings (float32)
  • Output: LLM2Vec-compatible embeddings (float32)

Training / 训练方法

The projection was trained by encoding ~1000 diverse motion descriptions with both LLM2Vec (teacher) and Qwen3-8B (student), then minimizing MSE loss between the projected Qwen3 embeddings and LLM2Vec target embeddings.

投影层的训练方法:用 LLM2Vec(teacher)和 Qwen3-8B(student)分别对约 1000 条多样化的动作描述编码,然后最小化投影后的 Qwen3 embedding 与 LLM2Vec 目标 embedding 之间的 MSE 损失。

  • Teacher: LLM2Vec (Meta-Llama-3-8B-Instruct + MNTP + Supervised LoRA)
  • Student: Qwen3-8B (mean pooling)
  • Training texts: ~1000 motion descriptions covering walking, running, jumping, dancing, sports, combat, daily activities, sci-fi scenarios, horror/survival, and more
  • Optimizer: Adam, lr=1e-3, cosine annealing
  • Epochs: 200

Usage / 使用方法

With Kimodo (automatic download) / 使用 Kimodo(自动下载)

The projection layer is downloaded automatically when using TEXT_ENCODER=qwen3:

使用 TEXT_ENCODER=qwen3 时,投影层会自动下载:

# Linux / macOS
TEXT_ENCODER=qwen3 TEXT_ENCODER_MODE=local TEXT_ENCODER_DEVICE=cpu kimodo_gen "A person walks forward" --bvh
# Windows PowerShell
& {
  $env:TEXT_ENCODER="qwen3"
  $env:TEXT_ENCODER_MODE="local"
  $env:TEXT_ENCODER_DEVICE="cpu"
  kimodo_gen "A person walks forward" --bvh
}

Manual download / 手动下载

from huggingface_hub import hf_hub_download

path = hf_hub_download(
    repo_id="Qian2501/kimodo-qwen3-projection",
    filename="qwen3_8b_projection.pt",
)

Train your own / 自行训练

python -m kimodo.scripts.train_text_projection \
  --base-model /path/to/Meta-Llama-3-8B-Instruct \
  --mntp-adapter /path/to/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp \
  --sup-adapter /path/to/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised \
  --output qwen3_8b_projection.pt --device cpu

See the Kimodo README for full details.

详细说明请参阅 Kimodo README

Related / 相关链接

  • Kimodo - Kinematic Motion Diffusion Model
  • Qwen3-8B - Source text encoder
  • LLM2Vec - Target embedding space
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support