LongCat-Video-Avatar-1.5-bf16 (MLX)

Apple MLX bf16 weights for LongCat-Video-Avatar-1.5 — Meituan's audio-driven video diffusion model — with the DMD step-distillation LoRA published as a separate file for runtime merging. Use this variant if you want to switch between the 50-step base inference path and the 8-step DMD distilled path at runtime, or to experiment with custom LoRA strengths.

For the simpler default (DMD pre-merged, 8-step only) see mlx-community/LongCat-Video-Avatar-1.5-bf16-dmd-merged.

TL;DR


Architecture	Wan 2.1 VAE + umT5-XXL + Whisper-Large-v3 + 48-block Avatar DiT + separate DMD LoRA
Params	~13.6 B DiT + ~11 B umT5 + ~0.6 B Whisper encoder + 0.5 B VAE + 0.6 B LoRA
Format	bf16, sharded safetensors (HF-style per-component subdirs)
Disk	~46 GB (43 GB base + 2.5 GB LoRA)
Hardware	Apple Silicon M-series, 64 GB+ unified memory recommended for 480p
Inference	50-step base OR 8-step with on-demand DMD LoRA merge
License	MIT (matches upstream Meituan)

Quick start

# 1. Pull weights (~46 GB)
hf download mlx-community/LongCat-Video-Avatar-1.5-bf16 \
    --local-dir ./weights

# 2. Set up inference (Python 3.12)
git clone https://github.com/xocialize/longcat-avatar-mlx
cd longcat-avatar-mlx
python3.12 -m venv .venv
.venv/bin/pip install -e ".[parity]"
.venv/bin/pip install librosa Pillow imageio imageio-ffmpeg

# 3. Run end-to-end (base variant: pipeline merges LoRA on load)
.venv/bin/python scripts/run_inference.py \
    --weights ./weights/.. \
    --variant base \
    --num-frames 93 \
    --out output.mp4

Programmatic LoRA merge:

from longcat_video_avatar.pipeline_mlx import LongCatAvatarPipeline

pipeline = LongCatAvatarPipeline(...)   # standard 4-component load

# Optionally merge DMD LoRA at any strength
from safetensors import safe_open
import mlx.core as mx
lora_sd = {}
with safe_open("weights/lora/dmd_lora.safetensors", framework="numpy") as f:
    for k in f.keys():
        lora_sd[k] = mx.array(f.get_tensor(k))
result = pipeline.merge_dmd_lora(lora_sd, multiplier=1.0)
print(f"merged {len(result['applied'])} target modules")
# Now pipeline.dit is DMD-ready; use 8 sampling steps.

Variants

Variant	DiT dtype	Disk	Sampling	Best for
bf16 (this card)	bf16 (+ separate LoRA)	46 GB	8-step or 50-step	runtime-merge / multi-strength experiments
bf16-dmd-merged	bf16	43 GB	8-step	64 GB+ Macs, recommended baseline
q4-dmd-merged	4-bit quantized	24 GB	8-step	32–48 GB Macs, comparable speed to bf16
q8-dmd-merged	8-bit quantized	31 GB	8-step	middle ground RAM / quality

Performance

Tested on Apple M5 Max (128 GB unified memory):

Mode	Sampling steps	Resolution	Frames	Wall clock
Without LoRA merge	50	256 × 432	29	~5–10 min (est.)
With LoRA merge	8	256 × 432	29	~105 s

LoRA merge cost (one-time at load): ~2–3 seconds for 336 module updates.

Layout

LongCat-Video-Avatar-1.5-bf16/
├── README.md                           # this file
├── pipeline_config.json
├── vae/                                # ~254 MB
├── text_encoder/                       # ~11 GB total, 3 shards
├── audio_encoder/                      # ~1.3 GB
├── dit/                                # ~33 GB total, 7 shards (BASE, no LoRA)
├── lora/
│   └── dmd_lora.safetensors            # ~2.5 GB (336 target modules)
├── scheduler/                          # FlowMatchEuler, shift=7.0
└── tokenizer/                          # umT5 tokenizer files

Source weights

Provenance, in case you want to verify or re-derive these weights:

Subdir	Source	Conversion
`vae/`	`meituan-longcat/LongCat-Video/vae/`	Conv3d weight transpose `(O,I,T,H,W)→(O,T,H,W,I)`; dtype passthrough
`text_encoder/`	`meituan-longcat/LongCat-Video/text_encoder/`	HF verbose → mlx-compact key rename; dtype cast to bf16
`audio_encoder/`	`meituan-longcat/LongCat-Video-Avatar-1.5/whisper-large-v3/model.safetensors`	`model.encoder.` prefix strip; Conv1d weight transpose; encoder-only
`dit/`	`meituan-longcat/LongCat-Video-Avatar-1.5/base_model/`	passthrough names; NO LoRA merge (base weights); adaLN_modulation weights kept at fp32
`lora/`	`meituan-longcat/LongCat-Video-Avatar-1.5/lora/dmd_lora.safetensors`	passthrough — Meituan's encoded names preserved; runtime loader decodes via `lora.decode_module_name`
`scheduler/`, `tokenizer/`	upstream	verbatim copy

Conversion recipe: recipes/convert_longcat_avatar.py. Run with --variant base --out <dir> to reproduce these weights from Meituan's PT sources.

Numerical conventions preserved from upstream

Same as the merged variant. See the dmd-merged card for the full list (_FP32 norms, velocity flip, disentangled CFG combiner, scheduler sentinel sigma fix).

License

MIT. Matches upstream Meituan LongCat-Video license. Full attribution in LICENSE.

Citation

@misc{longcat-avatar-mlx,
  title  = {longcat-avatar-mlx: Apple MLX port of LongCat-Video-Avatar-1.5},
  author = {xocialize},
  year   = {2026},
  url    = {https://github.com/xocialize/longcat-avatar-mlx},
}

@techreport{meituan2026longcat,
  title       = {LongCat-Video-Avatar 1.5 Technical Report},
  author      = {Meituan LongCat Team},
  institution = {Meituan},
  year        = {2026},
  url         = {https://github.com/meituan-longcat/LongCat-Video},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

MLX

Hardware compatibility

Quantized

Model tree for mlx-community/LongCat-Video-Avatar-1.5-bf16

Base model

meituan-longcat/LongCat-Video-Avatar-1.5

Finetuned

(2)

this model

Collection including mlx-community/LongCat-Video-Avatar-1.5-bf16

LongCat-Video-Avatar 1.5 — MLX

Collection

Apple MLX port of Meituan's audio-driven video diffusion. Source + recipe: github.com/xocialize/longcat-avatar-mlx • 6 items • Updated 3 days ago