Lance-3B-Alis-MLX-Traced

ByteDance Lance 3B (image + video) converted to Apple MLX, byte-clean against the original PyTorch.

Layout

Two standalone weights, one per variant — matching ByteDance's Lance_3B/ + Lance_3B_Video/ layout:

Path	Variant	Size	dtype	keys
`Lance_3B/model.safetensors`	image (LLM + adapters)	24.7 GB	F32	1021
`Lance_3B_Video/model.safetensors`	video (standalone: backbone + 31-frame pos-embed + video ViT)	28.4 GB	F32	1411

The image ViT and the Wan 2.2 VAE are separate files (see the repo for setup).

The weight is not the point — the verification is

The image weight (Lance_3B/) is bit-identical (SHA256 5ede2f0a…547817) with RockTalk/Lance-3B-MLX — both are the same deterministic F32 conversion from bytedance-research/Lance. (This is the F32 build; a separate bf16 build is mlx-community/Lance-3B-bf16.)

The differentiator of this release is not the weight. It is the verification trace: every stage of the port was cross-validated against the original PyTorch via byte-diff before the next stage started, and the full harness + lesson log is public:

👉 github.com/avlp12/lance_alis_mlx

Verification

Every gate uses original PyTorch direct import (not a clean re-implementation) under a shim, same PRNG (NumPy) on both sides, byte-diff at every layer. 23 lessons across stages 1–9; see the repo's LEARNING_LOG/.

Stage	Component	Gate
1	PT → MLX weight conversion	bit-exact (SHA256 match)
5	Wan 2.2 VAE image path (T=1)	~40 dB PSNR round-trip vs PT
6	Flow matching + CFG (T2I)	end-to-end cos ≥ 0.999 vs PT 30-step
7	ViT + X→T + TI2I	cos ≥ 0.999 + real-photo perceptual
8	3D causal video VAE	4 gates cos = 1.000000 (encode + decode)
9	T2V (video DiT + flow matching)	30-step latent cos ≥ 0.999, video pixel cos = 0.999338 vs PT

The video weight reproduces STAGE 9 t2v exactly (single-step cos = 0.999916 / 0.999848 / 0.999452 vs PT). Its converted video supplement is 391/391 byte-identical to RockTalk's, which is in turn byte-clean vs the original PT supplement.

Honesty note. T2V is verified end-to-end and uses only the 1021-key subset. The video weight also bundles the video ViT (vit_model, byte-clean vs PT), but the x2t_video / video_edit pipelines that would consume it are not yet implemented in MLX.

Usage

Inference is pure MLX — no PyTorch at runtime (PyTorch is imported only by the verification harnesses in tools/).

git clone https://github.com/avlp12/lance_alis_mlx
cd lance_alis_mlx
python3.12 -m venv .venv && source .venv/bin/activate
pip install mlx mlx-vlm transformers safetensors einops pillow huggingface_hub numpy

# weights: image (Lance_3B/) + video (Lance_3B_Video/)
hf download avlp12/Lance-3B-Alis-MLX-Traced --local-dir checkpoints/Lance-Alis
hf download RockTalk/Wan2.2-VAE-MLX --local-dir checkpoints/Wan2.2-VAE-MLX

# generate (see the repo README for the exact checkpoints/ layout)
PYTHONPATH=. .venv/bin/python tools/stage6_t2i_smoke.py    # text-to-image
PYTHONPATH=. .venv/bin/python tools/stage7_ti2i_smoke.py   # image edit

Apple Silicon required (developed on M3 Ultra). Python 3.12.

License & citation

Apache 2.0 — same as upstream ByteDance Lance and Alibaba Wan 2.2 VAE.

@misc{fu2026lanceunifiedmultimodalmodeling,
      title         = {Lance: Unified Multimodal Modeling by Multi-Task Synergy},
      author        = {Fengyi Fu and Mengqi Huang and Shaojin Wu and Yunsheng Jiang and Yufei Huo and Hao Li and Yinghang Song and Fei Ding and Jianzhu Guo and Qian He and Zheren Fu and Zhendong Mao and Yongdong Zhang},
      year          = {2026},
      eprint        = {2605.18678},
      archivePrefix = {arXiv},
      primaryClass  = {cs.CV},
      url           = {https://arxiv.org/abs/2605.18678},
}

Acknowledgments

ByteDance Lance team — original PyTorch model and research
RockTalk — MLX checkpoint conversion used as the F32 parity reference (image + video supplement)
Alibaba Wan 2.2 team — 3D Causal VAE architecture