Part of the LongCat-Video-Avatar 1.5 β€” MLX collection.

LongCat-Video-Avatar-1.5-bf16 (MLX)

Apple MLX bf16 weights for LongCat-Video-Avatar-1.5 β€” Meituan's audio-driven video diffusion model β€” with the DMD step-distillation LoRA published as a separate file for runtime merging. Use this variant if you want to switch between the 50-step base inference path and the 8-step DMD distilled path at runtime, or to experiment with custom LoRA strengths.

For the simpler default (DMD pre-merged, 8-step only) see mlx-community/LongCat-Video-Avatar-1.5-bf16-dmd-merged.

TL;DR

Architecture Wan 2.1 VAE + umT5-XXL + Whisper-Large-v3 + 48-block Avatar DiT + separate DMD LoRA
Params ~13.6 B DiT + ~11 B umT5 + ~0.6 B Whisper encoder + 0.5 B VAE + 0.6 B LoRA
Format bf16, sharded safetensors (HF-style per-component subdirs)
Disk ~46 GB (43 GB base + 2.5 GB LoRA)
Hardware Apple Silicon M-series, 64 GB+ unified memory recommended for 480p
Inference 50-step base OR 8-step with on-demand DMD LoRA merge
License MIT (matches upstream Meituan)

Quick start

# 1. Pull weights (~46 GB)
hf download mlx-community/LongCat-Video-Avatar-1.5-bf16 \
    --local-dir ./weights

# 2. Set up inference (Python 3.12)
git clone https://github.com/xocialize/longcat-avatar-mlx
cd longcat-avatar-mlx
python3.12 -m venv .venv
.venv/bin/pip install -e ".[parity]"
.venv/bin/pip install librosa Pillow imageio imageio-ffmpeg

# 3. Run end-to-end (base variant: pipeline merges LoRA on load)
.venv/bin/python scripts/run_inference.py \
    --weights ./weights/.. \
    --variant base \
    --num-frames 93 \
    --out output.mp4

Programmatic LoRA merge:

from longcat_video_avatar.pipeline_mlx import LongCatAvatarPipeline

pipeline = LongCatAvatarPipeline(...)   # standard 4-component load

# Optionally merge DMD LoRA at any strength
from safetensors import safe_open
import mlx.core as mx
lora_sd = {}
with safe_open("weights/lora/dmd_lora.safetensors", framework="numpy") as f:
    for k in f.keys():
        lora_sd[k] = mx.array(f.get_tensor(k))
result = pipeline.merge_dmd_lora(lora_sd, multiplier=1.0)
print(f"merged {len(result['applied'])} target modules")
# Now pipeline.dit is DMD-ready; use 8 sampling steps.

Variants

Variant DiT dtype Disk Sampling Best for
bf16 (this card) bf16 (+ separate LoRA) 46 GB 8-step or 50-step runtime-merge / multi-strength experiments
bf16-dmd-merged bf16 43 GB 8-step 64 GB+ Macs, recommended baseline
q4-dmd-merged 4-bit quantized 24 GB 8-step 32–48 GB Macs, comparable speed to bf16
q8-dmd-merged 8-bit quantized 31 GB 8-step middle ground RAM / quality

Performance

Tested on Apple M5 Max (128 GB unified memory):

Mode Sampling steps Resolution Frames Wall clock
Without LoRA merge 50 256 Γ— 432 29 ~5–10 min (est.)
With LoRA merge 8 256 Γ— 432 29 ~105 s

LoRA merge cost (one-time at load): ~2–3 seconds for 336 module updates.

Layout

LongCat-Video-Avatar-1.5-bf16/
β”œβ”€β”€ README.md                           # this file
β”œβ”€β”€ pipeline_config.json
β”œβ”€β”€ vae/                                # ~254 MB
β”œβ”€β”€ text_encoder/                       # ~11 GB total, 3 shards
β”œβ”€β”€ audio_encoder/                      # ~1.3 GB
β”œβ”€β”€ dit/                                # ~33 GB total, 7 shards (BASE, no LoRA)
β”œβ”€β”€ lora/
β”‚   └── dmd_lora.safetensors            # ~2.5 GB (336 target modules)
β”œβ”€β”€ scheduler/                          # FlowMatchEuler, shift=7.0
└── tokenizer/                          # umT5 tokenizer files

Source weights

Provenance, in case you want to verify or re-derive these weights:

Subdir Source Conversion
vae/ meituan-longcat/LongCat-Video/vae/ Conv3d weight transpose (O,I,T,H,W)β†’(O,T,H,W,I); dtype passthrough
text_encoder/ meituan-longcat/LongCat-Video/text_encoder/ HF verbose β†’ mlx-compact key rename; dtype cast to bf16
audio_encoder/ meituan-longcat/LongCat-Video-Avatar-1.5/whisper-large-v3/model.safetensors model.encoder. prefix strip; Conv1d weight transpose; encoder-only
dit/ meituan-longcat/LongCat-Video-Avatar-1.5/base_model/ passthrough names; NO LoRA merge (base weights); adaLN_modulation weights kept at fp32
lora/ meituan-longcat/LongCat-Video-Avatar-1.5/lora/dmd_lora.safetensors passthrough β€” Meituan's encoded names preserved; runtime loader decodes via lora.decode_module_name
scheduler/, tokenizer/ upstream verbatim copy

Conversion recipe: recipes/convert_longcat_avatar.py. Run with --variant base --out <dir> to reproduce these weights from Meituan's PT sources.

Numerical conventions preserved from upstream

Same as the merged variant. See the dmd-merged card for the full list (_FP32 norms, velocity flip, disentangled CFG combiner, scheduler sentinel sigma fix).

License

MIT. Matches upstream Meituan LongCat-Video license. Full attribution in LICENSE.

Citation

@misc{longcat-avatar-mlx,
  title  = {longcat-avatar-mlx: Apple MLX port of LongCat-Video-Avatar-1.5},
  author = {xocialize},
  year   = {2026},
  url    = {https://github.com/xocialize/longcat-avatar-mlx},
}

@techreport{meituan2026longcat,
  title       = {LongCat-Video-Avatar 1.5 Technical Report},
  author      = {Meituan LongCat Team},
  institution = {Meituan},
  year        = {2026},
  url         = {https://github.com/meituan-longcat/LongCat-Video},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlx-community/LongCat-Video-Avatar-1.5-bf16

Finetuned
(2)
this model

Collection including mlx-community/LongCat-Video-Avatar-1.5-bf16