Instructions to use mlx-community/LongCat-Video-Avatar-1.5-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/LongCat-Video-Avatar-1.5-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir LongCat-Video-Avatar-1.5-bf16 mlx-community/LongCat-Video-Avatar-1.5-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Part of the LongCat-Video-Avatar 1.5 β MLX collection.
LongCat-Video-Avatar-1.5-bf16 (MLX)
Apple MLX bf16 weights for LongCat-Video-Avatar-1.5 β Meituan's audio-driven video diffusion model β with the DMD step-distillation LoRA published as a separate file for runtime merging. Use this variant if you want to switch between the 50-step base inference path and the 8-step DMD distilled path at runtime, or to experiment with custom LoRA strengths.
For the simpler default (DMD pre-merged, 8-step only) see mlx-community/LongCat-Video-Avatar-1.5-bf16-dmd-merged.
TL;DR
| Architecture | Wan 2.1 VAE + umT5-XXL + Whisper-Large-v3 + 48-block Avatar DiT + separate DMD LoRA |
| Params | ~13.6 B DiT + ~11 B umT5 + ~0.6 B Whisper encoder + 0.5 B VAE + 0.6 B LoRA |
| Format | bf16, sharded safetensors (HF-style per-component subdirs) |
| Disk | ~46 GB (43 GB base + 2.5 GB LoRA) |
| Hardware | Apple Silicon M-series, 64 GB+ unified memory recommended for 480p |
| Inference | 50-step base OR 8-step with on-demand DMD LoRA merge |
| License | MIT (matches upstream Meituan) |
Quick start
# 1. Pull weights (~46 GB)
hf download mlx-community/LongCat-Video-Avatar-1.5-bf16 \
--local-dir ./weights
# 2. Set up inference (Python 3.12)
git clone https://github.com/xocialize/longcat-avatar-mlx
cd longcat-avatar-mlx
python3.12 -m venv .venv
.venv/bin/pip install -e ".[parity]"
.venv/bin/pip install librosa Pillow imageio imageio-ffmpeg
# 3. Run end-to-end (base variant: pipeline merges LoRA on load)
.venv/bin/python scripts/run_inference.py \
--weights ./weights/.. \
--variant base \
--num-frames 93 \
--out output.mp4
Programmatic LoRA merge:
from longcat_video_avatar.pipeline_mlx import LongCatAvatarPipeline
pipeline = LongCatAvatarPipeline(...) # standard 4-component load
# Optionally merge DMD LoRA at any strength
from safetensors import safe_open
import mlx.core as mx
lora_sd = {}
with safe_open("weights/lora/dmd_lora.safetensors", framework="numpy") as f:
for k in f.keys():
lora_sd[k] = mx.array(f.get_tensor(k))
result = pipeline.merge_dmd_lora(lora_sd, multiplier=1.0)
print(f"merged {len(result['applied'])} target modules")
# Now pipeline.dit is DMD-ready; use 8 sampling steps.
Variants
| Variant | DiT dtype | Disk | Sampling | Best for |
|---|---|---|---|---|
| bf16 (this card) | bf16 (+ separate LoRA) | 46 GB | 8-step or 50-step | runtime-merge / multi-strength experiments |
| bf16-dmd-merged | bf16 | 43 GB | 8-step | 64 GB+ Macs, recommended baseline |
| q4-dmd-merged | 4-bit quantized | 24 GB | 8-step | 32β48 GB Macs, comparable speed to bf16 |
| q8-dmd-merged | 8-bit quantized | 31 GB | 8-step | middle ground RAM / quality |
Performance
Tested on Apple M5 Max (128 GB unified memory):
| Mode | Sampling steps | Resolution | Frames | Wall clock |
|---|---|---|---|---|
| Without LoRA merge | 50 | 256 Γ 432 | 29 | ~5β10 min (est.) |
| With LoRA merge | 8 | 256 Γ 432 | 29 | ~105 s |
LoRA merge cost (one-time at load): ~2β3 seconds for 336 module updates.
Layout
LongCat-Video-Avatar-1.5-bf16/
βββ README.md # this file
βββ pipeline_config.json
βββ vae/ # ~254 MB
βββ text_encoder/ # ~11 GB total, 3 shards
βββ audio_encoder/ # ~1.3 GB
βββ dit/ # ~33 GB total, 7 shards (BASE, no LoRA)
βββ lora/
β βββ dmd_lora.safetensors # ~2.5 GB (336 target modules)
βββ scheduler/ # FlowMatchEuler, shift=7.0
βββ tokenizer/ # umT5 tokenizer files
Source weights
Provenance, in case you want to verify or re-derive these weights:
| Subdir | Source | Conversion |
|---|---|---|
vae/ |
meituan-longcat/LongCat-Video/vae/ |
Conv3d weight transpose (O,I,T,H,W)β(O,T,H,W,I); dtype passthrough |
text_encoder/ |
meituan-longcat/LongCat-Video/text_encoder/ |
HF verbose β mlx-compact key rename; dtype cast to bf16 |
audio_encoder/ |
meituan-longcat/LongCat-Video-Avatar-1.5/whisper-large-v3/model.safetensors |
model.encoder. prefix strip; Conv1d weight transpose; encoder-only |
dit/ |
meituan-longcat/LongCat-Video-Avatar-1.5/base_model/ |
passthrough names; NO LoRA merge (base weights); adaLN_modulation weights kept at fp32 |
lora/ |
meituan-longcat/LongCat-Video-Avatar-1.5/lora/dmd_lora.safetensors |
passthrough β Meituan's encoded names preserved; runtime loader decodes via lora.decode_module_name |
scheduler/, tokenizer/ |
upstream | verbatim copy |
Conversion recipe: recipes/convert_longcat_avatar.py.
Run with --variant base --out <dir> to reproduce these weights from
Meituan's PT sources.
Numerical conventions preserved from upstream
Same as the merged variant. See the dmd-merged card
for the full list (_FP32 norms, velocity flip, disentangled CFG combiner,
scheduler sentinel sigma fix).
License
MIT. Matches upstream Meituan LongCat-Video license. Full attribution in LICENSE.
Citation
@misc{longcat-avatar-mlx,
title = {longcat-avatar-mlx: Apple MLX port of LongCat-Video-Avatar-1.5},
author = {xocialize},
year = {2026},
url = {https://github.com/xocialize/longcat-avatar-mlx},
}
@techreport{meituan2026longcat,
title = {LongCat-Video-Avatar 1.5 Technical Report},
author = {Meituan LongCat Team},
institution = {Meituan},
year = {2026},
url = {https://github.com/meituan-longcat/LongCat-Video},
}
Quantized
Model tree for mlx-community/LongCat-Video-Avatar-1.5-bf16
Base model
meituan-longcat/LongCat-Video-Avatar-1.5