Instructions to use SceneWorks/scail2-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use SceneWorks/scail2-mlx with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline from diffusers.utils import load_image, export_to_video # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("SceneWorks/scail2-mlx", dtype=torch.bfloat16, device_map="cuda") pipe.to("cuda") prompt = "A man with short gray hair plays a red electric guitar." image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png" ) output = pipe(image=image, prompt=prompt).frames[0] export_to_video(output, "output.mp4") - MLX
How to use SceneWorks/scail2-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir scail2-mlx SceneWorks/scail2-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
SceneWorks/scail2-mlx
Turnkey, SceneWorks-converted weights of zai-org/SCAIL-2 β an end-to-end controlled character-animation / motion-transfer video model β packaged for native Apple-Silicon (MLX) inference inside SceneWorks. This is not an original model; it is a format/dtype repackaging of the upstream release for first-class macOS use (no PyTorch at runtime).
Capabilities (from upstream): character animation from a reference image + driving video, cross-identity character replacement, zero-shot animal-driving, end-to-end and pose-rendered driving, and (experimental) multi-reference. Image output is
num_frames == 1.
What changed vs. upstream
Every component is repackaged to the safetensors layout the SceneWorks Rust/MLX loaders consume β no PyTorch at runtime:
- DiT (
model/1/fsdp2_rank_0000_checkpoint.pt, an FSDP2/SAT checkpoint) was key-remapped to theSCAIL2Modelparameter naming using the upstreamconvert.pycontract (fusedquery_key_valueβq/k/v,key_valueβk/v,clip_feature_key_value_listβk_img/v_img), cast fp32 β bf16, then pre-quantized to group-wise-affine Q4 on disk βdit.safetensors. The attention (q/k/v/o+ I2Vk_img/v_img) and FFN (ffn.0/ffn.2) Linears are packed (weightu32 codes +scales+biasesvia MLXquantize, byte-equal tonn.quantize, group size 64); the patch/text/time/image embeddings, norms, and output head stay dense bf16. Aconfig.jsonquantizationblock marks the snapshot so the loader builds the quantized Linears directly from the packs (no dense bf16 materialized at load). Bit-faithful key remap (987 source keys β 1307 model keys; exact key+shape match againstSCAIL2Model.from_config(config-14b.json)). - VAE (
Wan2.1_VAE.pth, the stock Wan2.1 z16 VAE) βvae.safetensors(f32, channels-last conv transpose, keys unchanged β thesanitize_wan_vae_weightscontract shared with Bernini/wan). Loaded bymlx_gen_wan::WanVae. - Text encoder (
umt5-xxl/models_t5_umt5-xxl-enc-bf16.pth, stock UMT5-XXL) βt5_encoder.safetensors(bf16, sole rename.ffn.gate.0.β.ffn.gate_proj.). Loaded bymlx_gen_wan::Umt5Encoderwithtokenizer.json. - Image encoder (
models_clip_...onlyvisual.pth, open-CLIP XLM-RoBERTa ViT-H/14) βclip.safetensors(f32, de-prefixedvisual.*keys). Loaded bymlx_gen_scail2::ScailClip(32-layer visual tower,use_31_blockpenultimate features).
The converted VAE/UMT5 are byte-size-identical (modulo safetensors header) to Bernini/wan's already-validated Wan2.1 VAE + umt5-xxl safetensors β confirming SCAIL-2 ships the stock components.
Contents (turnkey MLX snapshot)
| file | source | loader | notes |
|---|---|---|---|
dit.safetensors |
converted | Scail2Dit |
SCAIL-2 14B DiT, Q4 packed (attn + FFN) + dense bf16 (embeds/norms/head), ~8.9 GB |
vae.safetensors |
converted | WanVae |
Wan2.1 z16 VAE, f32, stride (4,8,8) (~0.5 GB) |
t5_encoder.safetensors |
converted | Umt5Encoder |
UMT5-XXL encoder, bf16 (~11 GB) |
clip.safetensors |
converted | ScailClip |
open-CLIP ViT-H/14 visual tower, f32, 1280-dim (~2.5 GB) |
tokenizer.json |
upstream, stock | load_tokenizer |
UMT5-XXL HF tokenizer (root copy) |
config.json |
upstream configs/config-14b.json + quantization block |
Scail2Config |
model_type: i2v, dim 5120, ffn 13824, 40 layers/heads, in_dim 20, mask_dim 28, out_dim 16; quantization: {bits 4, group_size 64} |
bias-aware-dpo-lora.pt |
upstream, stock | mlx_gen_scail2 (sc-5451) |
optional Bias-Aware DPO refinement LoRA |
The DiT ships pre-quantized to Q4 on disk (the SceneWorks worker default), so the loader reads the packs directly β there is no dense-bf16 load transient. The VAE / UMT5 / CLIP ship dense (f32 / bf16). This repo ships only the loadable safetensors + tokenizer + the optional DPO LoRA; the redundant raw upstream pickles (Wan2.1_VAE.pth, umt5-xxl/models_t5_...pth, models_clip_...onlyvisual.pth) have been pruned β they are reproducible from the upstream release and the Rust loaders never used them.
Architecture (summary)
Wan2.1-14B I2V dense DiT. Conditioning is a token-axis packed stream β reference + video + pose patch-embedded (three Conv3d stems) with additive 28-channel color-coded mask embeddings, concatenated into one self-attention sequence β plus a per-source RoPE with integer T/H/W shifts (the replace_flag flips the reference H-shift, toggling animation vs. replacement). The reference image is encoded by the CLIP visual tower and injected via Wan-I2V image cross-attention. Sampling is plain CFG (guide 5.0), flow-matching UniPC/DPM++.
Runtime (Apple Silicon)
The production default β 832Γ480 / 5 s (one 81-frame driving segment) β runs the DiT in f32 compute (bf16 overflows to NaN at that packed-sequence length), with shared FFN/attention activation chunking and a temporal-tiled VAE decode, at a measured process footprint of ~70β76 GB. SceneWorks gates SCAIL-2 to 96 GB-class Macs. The Q4 DiT keeps the resident weights and the snapshot download lean (β 24 GB total).
License & attribution
This repackaging redistributes upstream weights under the license declared on the upstream model card (MIT); the upstream code repository is Apache-2.0. Please consult and cite the original:
- Model: https://huggingface.co/zai-org/SCAIL-2
- Code: https://github.com/zai-org/SCAIL-2
- Paper: SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning (arXiv:2606.10804)
- Built on Wan2.1 (Alibaba Wan team), UMT5-XXL, and OpenCLIP.
All credit for the model belongs to the original authors. This repo exists solely to make SCAIL-2 usable in SceneWorks on Apple Silicon.
- Downloads last month
- -
Quantized
Model tree for SceneWorks/scail2-mlx
Base model
zai-org/SCAIL-2