How to use from the
Use from the
Diffusers library
pip install -U diffusers transformers accelerate
import torch
from diffusers import DiffusionPipeline
from diffusers.utils import load_image, export_to_video

# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("SceneWorks/scail2-mlx", dtype=torch.bfloat16, device_map="cuda")
pipe.to("cuda")

prompt = "A man with short gray hair plays a red electric guitar."
image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png"
)

output = pipe(image=image, prompt=prompt).frames[0]
export_to_video(output, "output.mp4")

SceneWorks/scail2-mlx

Turnkey, SceneWorks-converted weights of zai-org/SCAIL-2 β€” an end-to-end controlled character-animation / motion-transfer video model β€” packaged for native Apple-Silicon (MLX) inference inside SceneWorks. This is not an original model; it is a format/dtype repackaging of the upstream release for first-class macOS use (no PyTorch at runtime).

Capabilities (from upstream): character animation from a reference image + driving video, cross-identity character replacement, zero-shot animal-driving, end-to-end and pose-rendered driving, and (experimental) multi-reference. Image output is num_frames == 1.

What changed vs. upstream

Every component is repackaged to the safetensors layout the SceneWorks Rust/MLX loaders consume β€” no PyTorch at runtime:

  • DiT (model/1/fsdp2_rank_0000_checkpoint.pt, an FSDP2/SAT checkpoint) was key-remapped to the SCAIL2Model parameter naming using the upstream convert.py contract (fused query_key_valueβ†’q/k/v, key_valueβ†’k/v, clip_feature_key_value_listβ†’k_img/v_img), cast fp32 β†’ bf16, then pre-quantized to group-wise-affine Q4 on disk β†’ dit.safetensors. The attention (q/k/v/o + I2V k_img/v_img) and FFN (ffn.0/ffn.2) Linears are packed (weight u32 codes + scales + biases via MLX quantize, byte-equal to nn.quantize, group size 64); the patch/text/time/image embeddings, norms, and output head stay dense bf16. A config.json quantization block marks the snapshot so the loader builds the quantized Linears directly from the packs (no dense bf16 materialized at load). Bit-faithful key remap (987 source keys β†’ 1307 model keys; exact key+shape match against SCAIL2Model.from_config(config-14b.json)).
  • VAE (Wan2.1_VAE.pth, the stock Wan2.1 z16 VAE) β†’ vae.safetensors (f32, channels-last conv transpose, keys unchanged β€” the sanitize_wan_vae_weights contract shared with Bernini/wan). Loaded by mlx_gen_wan::WanVae.
  • Text encoder (umt5-xxl/models_t5_umt5-xxl-enc-bf16.pth, stock UMT5-XXL) β†’ t5_encoder.safetensors (bf16, sole rename .ffn.gate.0.β†’.ffn.gate_proj.). Loaded by mlx_gen_wan::Umt5Encoder with tokenizer.json.
  • Image encoder (models_clip_...onlyvisual.pth, open-CLIP XLM-RoBERTa ViT-H/14) β†’ clip.safetensors (f32, de-prefixed visual.* keys). Loaded by mlx_gen_scail2::ScailClip (32-layer visual tower, use_31_block penultimate features).

The converted VAE/UMT5 are byte-size-identical (modulo safetensors header) to Bernini/wan's already-validated Wan2.1 VAE + umt5-xxl safetensors β€” confirming SCAIL-2 ships the stock components.

Contents (turnkey MLX snapshot)

file source loader notes
dit.safetensors converted Scail2Dit SCAIL-2 14B DiT, Q4 packed (attn + FFN) + dense bf16 (embeds/norms/head), ~8.9 GB
vae.safetensors converted WanVae Wan2.1 z16 VAE, f32, stride (4,8,8) (~0.5 GB)
t5_encoder.safetensors converted Umt5Encoder UMT5-XXL encoder, bf16 (~11 GB)
clip.safetensors converted ScailClip open-CLIP ViT-H/14 visual tower, f32, 1280-dim (~2.5 GB)
tokenizer.json upstream, stock load_tokenizer UMT5-XXL HF tokenizer (root copy)
config.json upstream configs/config-14b.json + quantization block Scail2Config model_type: i2v, dim 5120, ffn 13824, 40 layers/heads, in_dim 20, mask_dim 28, out_dim 16; quantization: {bits 4, group_size 64}
bias-aware-dpo-lora.pt upstream, stock mlx_gen_scail2 (sc-5451) optional Bias-Aware DPO refinement LoRA

The DiT ships pre-quantized to Q4 on disk (the SceneWorks worker default), so the loader reads the packs directly β€” there is no dense-bf16 load transient. The VAE / UMT5 / CLIP ship dense (f32 / bf16). This repo ships only the loadable safetensors + tokenizer + the optional DPO LoRA; the redundant raw upstream pickles (Wan2.1_VAE.pth, umt5-xxl/models_t5_...pth, models_clip_...onlyvisual.pth) have been pruned β€” they are reproducible from the upstream release and the Rust loaders never used them.

Architecture (summary)

Wan2.1-14B I2V dense DiT. Conditioning is a token-axis packed stream β€” reference + video + pose patch-embedded (three Conv3d stems) with additive 28-channel color-coded mask embeddings, concatenated into one self-attention sequence β€” plus a per-source RoPE with integer T/H/W shifts (the replace_flag flips the reference H-shift, toggling animation vs. replacement). The reference image is encoded by the CLIP visual tower and injected via Wan-I2V image cross-attention. Sampling is plain CFG (guide 5.0), flow-matching UniPC/DPM++.

Runtime (Apple Silicon)

The production default β€” 832Γ—480 / 5 s (one 81-frame driving segment) β€” runs the DiT in f32 compute (bf16 overflows to NaN at that packed-sequence length), with shared FFN/attention activation chunking and a temporal-tiled VAE decode, at a measured process footprint of ~70–76 GB. SceneWorks gates SCAIL-2 to 96 GB-class Macs. The Q4 DiT keeps the resident weights and the snapshot download lean (β‰ˆ 24 GB total).

License & attribution

This repackaging redistributes upstream weights under the license declared on the upstream model card (MIT); the upstream code repository is Apache-2.0. Please consult and cite the original:

All credit for the model belongs to the original authors. This repo exists solely to make SCAIL-2 usable in SceneWorks on Apple Silicon.

Downloads last month
11
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for SceneWorks/scail2-mlx

Base model

zai-org/SCAIL-2
Finetuned
(2)
this model

Paper for SceneWorks/scail2-mlx