Instructions to use mlx-community/Wan2.2-VAE-Lance-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/Wan2.2-VAE-Lance-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Wan2.2-VAE-Lance-bf16 mlx-community/Wan2.2-VAE-Lance-bf16
- Wan2.2
How to use mlx-community/Wan2.2-VAE-Lance-bf16 with Wan2.2:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
📂 Part of the Lance MLX collection on mlx-community.
Wan2.2-VAE-Lance-bf16 (MLX)
MLX port of the 48-channel Wan2.2 3D causal VAE bundled with ByteDance's Lance unified multimodal model. Converted to bf16 for Apple Silicon. ~705 M parameters, encoder + decoder in a single safetensors file.
⚠️ This is NOT the public
wan2.2_vae.safetensors(which is 16-channel and incompatible with Lance). This is the re-trained 48-channel variant Lance ships with, required by both Lance image and video pipelines.
Status
🟢 Production-ready as of 2026-05-21. Roundtrip MAD ≈ 7/255 in u8 domain on real photographs at 768².
| Component | Status |
|---|---|
Encoder (Wan22VAEEncoder) |
✅ Loads cleanly, 86 keys mapped |
Decoder (Wan22VAEDecoder) |
✅ Loads cleanly, 110 keys mapped |
| Streaming feature cache (1+4+4+… chunked encode) | ✅ Per-conv feat_cache works across temporal chunks |
Per-channel latent normalization (VAE22_MEAN, VAE22_STD) |
✅ 48-channel mean/std applied after encode, reversed before decode |
Quickstart
Install the mlx-video module (provides the Wan22VAEEncoder / Wan22VAEDecoder classes):
from huggingface_hub import snapshot_download
import mlx.core as mx
from mlx_video.models.wan_2.vae22 import (
Wan22VAEEncoder, Wan22VAEDecoder, denormalize_latents,
)
repo = snapshot_download("mlx-community/Wan2.2-VAE-Lance-bf16")
weights = mx.load(f"{repo}/vae.safetensors")
enc = Wan22VAEEncoder(z_dim=48, dim=160)
enc.load_weights([
(k, v) for k, v in weights.items()
if k.startswith("encoder.") or k.startswith("conv1.")
])
mx.eval(enc.parameters())
dec = Wan22VAEDecoder(z_dim=48, dim=160, dec_dim=256)
dec.load_weights([
(k, v) for k, v in weights.items()
if k.startswith("decoder.") or k.startswith("conv2.")
])
mx.eval(dec.parameters())
Encode an image
import numpy as np
from PIL import Image
img = Image.open("photo.jpg").convert("RGB").resize((768, 768))
arr = np.asarray(img, dtype=np.float32) / 127.5 - 1.0 # [-1, 1]
x = mx.array(arr[None, None, ...]) # (1, 1, H, W, 3)
z = enc(x) # (1, 1, 48, 48, 48)
print("latent shape:", z.shape)
# mean ≈ -0.07, std ≈ 0.60 (per-channel normalized)
Decode a latent
z_denorm = denormalize_latents(z) # apply per-channel std/mean
decoded = dec(z_denorm) # (1, T'>=1, H', W', 3) in [-1, 1]
out_img = ((np.array(decoded[0, 0]) + 1.0) * 127.5).clip(0, 255).astype(np.uint8)
Image.fromarray(out_img).save("roundtrip.png")
Architecture
- Spatial downsample: 16× per axis (H, W ÷ 16 → latent grid)
- Temporal downsample: 4× (T → ⌈T/4⌉ + 1, causal padding)
- Latent channels (z_dim): 48 (vs. 16 in the public Wan2.2)
- Encoder feature dim: 160; Decoder feature dim: 256
- Encoder topology:
dim_mult=(1,2,4,4),num_res_blocks=2,temperal_downsample=(False, True, True) - Decoder topology:
dim_mult=(1,2,4,4),num_res_blocks=2,temperal_upsample=(True, True, False)
Roundtrip quality
| Input | Output dims | Per-pixel MAD ([0, 255] u8) | Max abs error |
|---|---|---|---|
768² photo (edit_img.jpg, painting) |
768² | 7.36 / 255 | 0.82 / 1.0 |
Loaded in ~0.3 s on M5 Max 128 GB; encode 0.55 s, decode 1.65 s.
Files in this repo
| File | Size | Notes |
|---|---|---|
vae.safetensors |
1.41 GB | Encoder + decoder, bf16 (197 keys including conv1, conv2, encoder.*, decoder.*) |
vae_conversion_report.json |
– | PyTorch → MLX conversion provenance: 62 conv3d + 10 conv2d + 50 RMS gamma + 2 attn-norm gamma + 170 renamed + 72 other |
Provenance
Source: bytedance-research/Lance/Wan2.2_VAE.pth (PyTorch, 704.7 M params, 196 tensors after splitting nested modules).
Converted via scripts/06_convert_wan_vae.py which:
- Reshapes PyTorch conv3d weights
(out, in, T, H, W) → (out, T, H, W, in)for MLX channels-last convention. - Reshapes conv2d weights similarly to channels-last.
- Casts norm gamma to F32 (kept high-precision for stability), other tensors to bf16.
- Strips the nested
vae.*prefix in the original.
Why this is separate from the LLM
Both Lance's image and video pipelines need this VAE. Publishing it once decouples versioning: a fix or upgrade to the VAE doesn't force a re-download of either ~12 GB LLM. The companion repos (mlx-community/Lance-3B-bf16 and mlx-community/Lance-3B-Video-bf16) bundle a copy for convenience, but power users should pin this one and use it across both.
License
Apache 2.0. The original Wan2.2 VAE weights are © Alibaba; this MLX port is © the lance-mlx contributors. See NOTICE in the lance-mlx repo for attribution.
Links
- MLX port code:
github.com/xocialize/lance-mlx - Original PyTorch checkpoint:
bytedance-research/Lance/Wan2.2_VAE.pth - Image specialist (uses this VAE):
mlx-community/Lance-3B-bf16 - Video specialist (uses this VAE):
mlx-community/Lance-3B-Video-bf16
Quantized