📂 Part of the Lance MLX collection on mlx-community.

Wan2.2-VAE-Lance-bf16 (MLX)

MLX port of the 48-channel Wan2.2 3D causal VAE bundled with ByteDance's Lance unified multimodal model. Converted to bf16 for Apple Silicon. ~705 M parameters, encoder + decoder in a single safetensors file.

⚠️ This is NOT the public wan2.2_vae.safetensors (which is 16-channel and incompatible with Lance). This is the re-trained 48-channel variant Lance ships with, required by both Lance image and video pipelines.

Status

🟢 Production-ready as of 2026-05-21. Roundtrip MAD ≈ 7/255 in u8 domain on real photographs at 768².

Component	Status
Encoder (`Wan22VAEEncoder`)	✅ Loads cleanly, 86 keys mapped
Decoder (`Wan22VAEDecoder`)	✅ Loads cleanly, 110 keys mapped
Streaming feature cache (1+4+4+… chunked encode)	✅ Per-conv `feat_cache` works across temporal chunks
Per-channel latent normalization (`VAE22_MEAN`, `VAE22_STD`)	✅ 48-channel mean/std applied after encode, reversed before decode

Quickstart

Install the mlx-video module (provides the Wan22VAEEncoder / Wan22VAEDecoder classes):

from huggingface_hub import snapshot_download
import mlx.core as mx
from mlx_video.models.wan_2.vae22 import (
    Wan22VAEEncoder, Wan22VAEDecoder, denormalize_latents,
)

repo = snapshot_download("mlx-community/Wan2.2-VAE-Lance-bf16")
weights = mx.load(f"{repo}/vae.safetensors")

enc = Wan22VAEEncoder(z_dim=48, dim=160)
enc.load_weights([
    (k, v) for k, v in weights.items()
    if k.startswith("encoder.") or k.startswith("conv1.")
])
mx.eval(enc.parameters())

dec = Wan22VAEDecoder(z_dim=48, dim=160, dec_dim=256)
dec.load_weights([
    (k, v) for k, v in weights.items()
    if k.startswith("decoder.") or k.startswith("conv2.")
])
mx.eval(dec.parameters())

Encode an image

import numpy as np
from PIL import Image

img = Image.open("photo.jpg").convert("RGB").resize((768, 768))
arr = np.asarray(img, dtype=np.float32) / 127.5 - 1.0     # [-1, 1]
x = mx.array(arr[None, None, ...])                        # (1, 1, H, W, 3)
z = enc(x)                                                # (1, 1, 48, 48, 48)
print("latent shape:", z.shape)
# mean ≈ -0.07, std ≈ 0.60 (per-channel normalized)

Decode a latent

z_denorm = denormalize_latents(z)                          # apply per-channel std/mean
decoded = dec(z_denorm)                                    # (1, T'>=1, H', W', 3) in [-1, 1]
out_img = ((np.array(decoded[0, 0]) + 1.0) * 127.5).clip(0, 255).astype(np.uint8)
Image.fromarray(out_img).save("roundtrip.png")

Architecture

Spatial downsample: 16× per axis (H, W ÷ 16 → latent grid)
Temporal downsample: 4× (T → ⌈T/4⌉ + 1, causal padding)
Latent channels (z_dim): 48 (vs. 16 in the public Wan2.2)
Encoder feature dim: 160; Decoder feature dim: 256
Encoder topology: dim_mult=(1,2,4,4), num_res_blocks=2, temperal_downsample=(False, True, True)
Decoder topology: dim_mult=(1,2,4,4), num_res_blocks=2, temperal_upsample=(True, True, False)

Roundtrip quality

Input	Output dims	Per-pixel MAD ([0, 255] u8)	Max abs error
768² photo (`edit_img.jpg`, painting)	768²	7.36 / 255	0.82 / 1.0

Loaded in ~0.3 s on M5 Max 128 GB; encode 0.55 s, decode 1.65 s.

Files in this repo

File	Size	Notes
`vae.safetensors`	1.41 GB	Encoder + decoder, bf16 (197 keys including `conv1`, `conv2`, `encoder.`, `decoder.`)
`vae_conversion_report.json`	–	PyTorch → MLX conversion provenance: 62 conv3d + 10 conv2d + 50 RMS gamma + 2 attn-norm gamma + 170 renamed + 72 other

Provenance

Source: bytedance-research/Lance/Wan2.2_VAE.pth (PyTorch, 704.7 M params, 196 tensors after splitting nested modules). Converted via scripts/06_convert_wan_vae.py which:

Reshapes PyTorch conv3d weights (out, in, T, H, W) → (out, T, H, W, in) for MLX channels-last convention.
Reshapes conv2d weights similarly to channels-last.
Casts norm gamma to F32 (kept high-precision for stability), other tensors to bf16.
Strips the nested vae.* prefix in the original.

Why this is separate from the LLM

Both Lance's image and video pipelines need this VAE. Publishing it once decouples versioning: a fix or upgrade to the VAE doesn't force a re-download of either ~12 GB LLM. The companion repos (mlx-community/Lance-3B-bf16 and mlx-community/Lance-3B-Video-bf16) bundle a copy for convenience, but power users should pin this one and use it across both.

License

Apache 2.0. The original Wan2.2 VAE weights are © Alibaba; this MLX port is © the lance-mlx contributors. See NOTICE in the lance-mlx repo for attribution.

Model tree for mlx-community/Wan2.2-VAE-Lance-bf16

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

bytedance-research/Lance

Finetuned

(5)

this model

Collection including mlx-community/Wan2.2-VAE-Lance-bf16

Lance MLX

Collection

Feature-complete MLX port of ByteDance Lance: t2i, image_edit, x2t_image, t2v, video_edit, x2t_video. • 4 items • Updated about 11 hours ago • 1

mlx-community
/

Wan2.2-VAE-Lance-bf16