Instructions to use xocialize/SCAIL-2-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use xocialize/SCAIL-2-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir SCAIL-2-bf16 xocialize/SCAIL-2-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
SCAIL-2 โ MLX (work in progress)
โ ๏ธ WIP โ pre-release conversion, expect changes
These are Apple-MLX conversions of zai-org/SCAIL-2 for the xocialize/scail-2-mlx port, published from our own namespace while the port is under active development. File formats, key layouts, and dtypes may change without notice. Quantized (q8/q4) variants, golden end-to-end validation against the PyTorch reference, and an mlx-community release are planned but not done. Use for experimentation, not production.
SCAIL-2 (Zhipu AI, arXiv 2512.05905) is an end-to-end controlled character-animation model: a reference character image + a driving video โ the character performing that motion. Cross-identity replacement, multi-character scenes, and animal driving, with no intermediate pose representations required. The backbone is a Wan2.1-I2V-14B fork with a 3-segment (reference / video / pose) RoPE design and dual mask conditioning.
Files
| file | component | dtype | size |
|---|---|---|---|
dit.safetensors |
SCAIL2 DiT (14B, Wan2.1-I2V fork) | bf16 | 33 GB |
umt5.safetensors |
umT5-XXL text encoder | bf16 | 11 GB |
clip.safetensors |
open-clip xlm-roberta ViT-H/14 visual tower | fp16 | 1.2 GB |
vae.safetensors |
Wan2.1 VAE (16-ch) | fp32 | 0.5 GB |
Keys follow the scail-2-mlx module
tree (MLX nn.Sequential uses .layers.N; conv weights are NDHWC/NHWC).
Tokenizer: use google/umt5-xxl (or the umt5-xxl/ directory bundled with the
original checkpoint).
Usage
git clone https://github.com/xocialize/scail-2-mlx && cd scail-2-mlx
uv venv --python 3.12 .venv
uv pip install -e refs/mlx-video -e .
hf download xocialize/SCAIL-2-bf16 --local-dir weights/mlx
.venv/bin/python scripts/generate.py \
--weights-dir weights/mlx \
--image ref.jpg --mask-image ref_mask.jpg \
--pose driving.mp4 --mask-video driving_mask.mp4 \
--prompt "the girl is dancing" \
--target-h 480 --target-w 832 --save-file out.mp4
Requires Apple Silicon with โฅ 64 GB unified memory at bf16 (active ~34 GB, peak ~47 GB at 832ร480ร65 frames; ~3.7 min/step on an M5 Max โ perf work ongoing). Driving-input preprocessing (masks / pose renders) comes from the upstream SCAIL-Pose toolchain.
Conversion provenance & fidelity
Converted by recipes/convert_scail2.py
from the original FSDP checkpoint via upstream convert.py key remapping
(1307/1307 strict key match). Component-level parity vs the PyTorch reference
(fp32, CPU): CLIP visual max_abs 2.7e-4 on real weights; chunked causal VAE
decode < 5e-4 per frame (canonical 1+(Tโ1)ยท4 frame mapping โ see
Blaizzy/mlx-video#38); DiT
forward parity-locked at fp32 on the CPU oracle. End-to-end golden comparison
against the PyTorch pipeline is pending.
License
Weights: converted from zai-org/SCAIL-2 (model card: MIT; source repository:
Apache-2.0 โ this card is marked Apache-2.0, the stricter of the two, pending
upstream clarification). Conversion code: Apache-2.0. Derived from SCAIL-2
(Zhipu AI), Wan2.1 (Alibaba), open-clip.
Quantized
Model tree for xocialize/SCAIL-2-bf16
Base model
zai-org/SCAIL-2