📂 Part of the Lance MLX collection on mlx-community.

Wan2.2-VAE-Lance-bf16 (MLX)

MLX port of the 48-channel Wan2.2 3D causal VAE bundled with ByteDance's Lance unified multimodal model. Converted to bf16 for Apple Silicon. ~705 M parameters, encoder + decoder in a single safetensors file.

⚠️ This is NOT the public wan2.2_vae.safetensors (which is 16-channel and incompatible with Lance). This is the re-trained 48-channel variant Lance ships with, required by both Lance image and video pipelines.

Status

🟢 Production-ready as of 2026-05-21. Roundtrip MAD ≈ 7/255 in u8 domain on real photographs at 768².

Component Status
Encoder (Wan22VAEEncoder) ✅ Loads cleanly, 86 keys mapped
Decoder (Wan22VAEDecoder) ✅ Loads cleanly, 110 keys mapped
Streaming feature cache (1+4+4+… chunked encode) ✅ Per-conv feat_cache works across temporal chunks
Per-channel latent normalization (VAE22_MEAN, VAE22_STD) ✅ 48-channel mean/std applied after encode, reversed before decode

Quickstart

Install the mlx-video module (provides the Wan22VAEEncoder / Wan22VAEDecoder classes):

from huggingface_hub import snapshot_download
import mlx.core as mx
from mlx_video.models.wan_2.vae22 import (
    Wan22VAEEncoder, Wan22VAEDecoder, denormalize_latents,
)

repo = snapshot_download("mlx-community/Wan2.2-VAE-Lance-bf16")
weights = mx.load(f"{repo}/vae.safetensors")

enc = Wan22VAEEncoder(z_dim=48, dim=160)
enc.load_weights([
    (k, v) for k, v in weights.items()
    if k.startswith("encoder.") or k.startswith("conv1.")
])
mx.eval(enc.parameters())

dec = Wan22VAEDecoder(z_dim=48, dim=160, dec_dim=256)
dec.load_weights([
    (k, v) for k, v in weights.items()
    if k.startswith("decoder.") or k.startswith("conv2.")
])
mx.eval(dec.parameters())

Encode an image

import numpy as np
from PIL import Image

img = Image.open("photo.jpg").convert("RGB").resize((768, 768))
arr = np.asarray(img, dtype=np.float32) / 127.5 - 1.0     # [-1, 1]
x = mx.array(arr[None, None, ...])                        # (1, 1, H, W, 3)
z = enc(x)                                                # (1, 1, 48, 48, 48)
print("latent shape:", z.shape)
# mean ≈ -0.07, std ≈ 0.60 (per-channel normalized)

Decode a latent

z_denorm = denormalize_latents(z)                          # apply per-channel std/mean
decoded = dec(z_denorm)                                    # (1, T'>=1, H', W', 3) in [-1, 1]
out_img = ((np.array(decoded[0, 0]) + 1.0) * 127.5).clip(0, 255).astype(np.uint8)
Image.fromarray(out_img).save("roundtrip.png")

Architecture

  • Spatial downsample: 16× per axis (H, W ÷ 16 → latent grid)
  • Temporal downsample: 4× (T → ⌈T/4⌉ + 1, causal padding)
  • Latent channels (z_dim): 48 (vs. 16 in the public Wan2.2)
  • Encoder feature dim: 160; Decoder feature dim: 256
  • Encoder topology: dim_mult=(1,2,4,4), num_res_blocks=2, temperal_downsample=(False, True, True)
  • Decoder topology: dim_mult=(1,2,4,4), num_res_blocks=2, temperal_upsample=(True, True, False)

Roundtrip quality

Input Output dims Per-pixel MAD ([0, 255] u8) Max abs error
768² photo (edit_img.jpg, painting) 768² 7.36 / 255 0.82 / 1.0

Loaded in ~0.3 s on M5 Max 128 GB; encode 0.55 s, decode 1.65 s.

Files in this repo

File Size Notes
vae.safetensors 1.41 GB Encoder + decoder, bf16 (197 keys including conv1, conv2, encoder.*, decoder.*)
vae_conversion_report.json PyTorch → MLX conversion provenance: 62 conv3d + 10 conv2d + 50 RMS gamma + 2 attn-norm gamma + 170 renamed + 72 other

Provenance

Source: bytedance-research/Lance/Wan2.2_VAE.pth (PyTorch, 704.7 M params, 196 tensors after splitting nested modules). Converted via scripts/06_convert_wan_vae.py which:

  • Reshapes PyTorch conv3d weights (out, in, T, H, W) → (out, T, H, W, in) for MLX channels-last convention.
  • Reshapes conv2d weights similarly to channels-last.
  • Casts norm gamma to F32 (kept high-precision for stability), other tensors to bf16.
  • Strips the nested vae.* prefix in the original.

Why this is separate from the LLM

Both Lance's image and video pipelines need this VAE. Publishing it once decouples versioning: a fix or upgrade to the VAE doesn't force a re-download of either ~12 GB LLM. The companion repos (mlx-community/Lance-3B-bf16 and mlx-community/Lance-3B-Video-bf16) bundle a copy for convenience, but power users should pin this one and use it across both.

License

Apache 2.0. The original Wan2.2 VAE weights are © Alibaba; this MLX port is © the lance-mlx contributors. See NOTICE in the lance-mlx repo for attribution.

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Wan2.2-VAE-Lance-bf16

Finetuned
(5)
this model

Collection including mlx-community/Wan2.2-VAE-Lance-bf16