Kimodo Qwen3 Projection Layer / Kimodo Qwen3 投影层
Pre-trained linear projection layer that maps Qwen3-8B text embeddings into the LLM2Vec embedding space used by the Kimodo motion diffusion model.
预训练的线性投影层,将 Qwen3-8B 的文本 embedding 映射到 Kimodo 运动扩散模型所使用的 LLM2Vec embedding 空间。
Why is this needed? / 为什么需要这个?
Kimodo's denoiser was trained with LLM2Vec (Llama-3-8B based) embeddings. Qwen3-8B produces embeddings of the same dimension (4096), but in a completely different semantic space. Directly substituting encoders produces poor motion quality. This projection layer bridges the gap by learning a linear transformation from one space to the other.
Kimodo 的 denoiser 是用 LLM2Vec(基于 Llama-3-8B)的 embedding 训练的。Qwen3-8B 虽然输出相同维度(4096)的 embedding,但语义空间完全不同。直接替换编码器会导致生成质量很差。此投影层通过学习两个空间之间的线性变换来弥合这一差距。
Architecture / 架构
- Type:
nn.Linear(4096, 4096, bias=True) - Parameters: ~16.8M
- File size: ~64 MB
- Input: Qwen3-8B mean-pooled text embeddings (float32)
- Output: LLM2Vec-compatible embeddings (float32)
Training / 训练方法
The projection was trained by encoding ~1000 diverse motion descriptions with both LLM2Vec (teacher) and Qwen3-8B (student), then minimizing MSE loss between the projected Qwen3 embeddings and LLM2Vec target embeddings.
投影层的训练方法:用 LLM2Vec(teacher)和 Qwen3-8B(student)分别对约 1000 条多样化的动作描述编码,然后最小化投影后的 Qwen3 embedding 与 LLM2Vec 目标 embedding 之间的 MSE 损失。
- Teacher: LLM2Vec (Meta-Llama-3-8B-Instruct + MNTP + Supervised LoRA)
- Student: Qwen3-8B (mean pooling)
- Training texts: ~1000 motion descriptions covering walking, running, jumping, dancing, sports, combat, daily activities, sci-fi scenarios, horror/survival, and more
- Optimizer: Adam, lr=1e-3, cosine annealing
- Epochs: 200
Usage / 使用方法
With Kimodo (automatic download) / 使用 Kimodo(自动下载)
The projection layer is downloaded automatically when using TEXT_ENCODER=qwen3:
使用 TEXT_ENCODER=qwen3 时,投影层会自动下载:
# Linux / macOS
TEXT_ENCODER=qwen3 TEXT_ENCODER_MODE=local TEXT_ENCODER_DEVICE=cpu kimodo_gen "A person walks forward" --bvh
# Windows PowerShell
& {
$env:TEXT_ENCODER="qwen3"
$env:TEXT_ENCODER_MODE="local"
$env:TEXT_ENCODER_DEVICE="cpu"
kimodo_gen "A person walks forward" --bvh
}
Manual download / 手动下载
from huggingface_hub import hf_hub_download
path = hf_hub_download(
repo_id="Qian2501/kimodo-qwen3-projection",
filename="qwen3_8b_projection.pt",
)
Train your own / 自行训练
python -m kimodo.scripts.train_text_projection \
--base-model /path/to/Meta-Llama-3-8B-Instruct \
--mntp-adapter /path/to/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp \
--sup-adapter /path/to/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised \
--output qwen3_8b_projection.pt --device cpu
See the Kimodo README for full details.
详细说明请参阅 Kimodo README。