SceneWorks/scail2-mlx

Turnkey, SceneWorks-converted weights of zai-org/SCAIL-2 β€” an end-to-end controlled character-animation / motion-transfer video model β€” packaged for native Apple-Silicon (MLX) inference inside SceneWorks. This is not an original model; it is a format/dtype repackaging of the upstream release for first-class macOS use (no PyTorch at runtime).

Capabilities (from upstream): character animation from a reference image + driving video, cross-identity character replacement, zero-shot animal-driving, end-to-end and pose-rendered driving, and (experimental) multi-reference. Image output is num_frames == 1.

What changed vs. upstream

Every component is repackaged to the safetensors layout the SceneWorks Rust/MLX loaders consume β€” no PyTorch at runtime:

  • DiT (model/1/fsdp2_rank_0000_checkpoint.pt, an FSDP2/SAT checkpoint) was key-remapped to the SCAIL2Model parameter naming using the upstream convert.py contract (fused query_key_valueβ†’q/k/v, key_valueβ†’k/v, clip_feature_key_value_listβ†’k_img/v_img), cast fp32 β†’ bf16, then pre-quantized to group-wise-affine Q4 on disk β†’ dit.safetensors. The attention (q/k/v/o + I2V k_img/v_img) and FFN (ffn.0/ffn.2) Linears are packed (weight u32 codes + scales + biases via MLX quantize, byte-equal to nn.quantize, group size 64); the patch/text/time/image embeddings, norms, and output head stay dense bf16. A config.json quantization block marks the snapshot so the loader builds the quantized Linears directly from the packs (no dense bf16 materialized at load). Bit-faithful key remap (987 source keys β†’ 1307 model keys; exact key+shape match against SCAIL2Model.from_config(config-14b.json)).
  • VAE (Wan2.1_VAE.pth, the stock Wan2.1 z16 VAE) β†’ vae.safetensors (f32, channels-last conv transpose, keys unchanged β€” the sanitize_wan_vae_weights contract shared with Bernini/wan). Loaded by mlx_gen_wan::WanVae.
  • Text encoder (umt5-xxl/models_t5_umt5-xxl-enc-bf16.pth, stock UMT5-XXL) β†’ t5_encoder.safetensors (bf16, sole rename .ffn.gate.0.β†’.ffn.gate_proj.). Loaded by mlx_gen_wan::Umt5Encoder with tokenizer.json.
  • Image encoder (models_clip_...onlyvisual.pth, open-CLIP XLM-RoBERTa ViT-H/14) β†’ clip.safetensors (f32, de-prefixed visual.* keys). Loaded by mlx_gen_scail2::ScailClip (32-layer visual tower, use_31_block penultimate features).

The converted VAE/UMT5 are byte-size-identical (modulo safetensors header) to Bernini/wan's already-validated Wan2.1 VAE + umt5-xxl safetensors β€” confirming SCAIL-2 ships the stock components.

Contents (turnkey MLX snapshot)

file source loader notes
dit.safetensors converted Scail2Dit SCAIL-2 14B DiT, Q4 packed (attn + FFN) + dense bf16 (embeds/norms/head), ~8.9 GB
vae.safetensors converted WanVae Wan2.1 z16 VAE, f32, stride (4,8,8) (~0.5 GB)
t5_encoder.safetensors converted Umt5Encoder UMT5-XXL encoder, bf16 (~11 GB)
clip.safetensors converted ScailClip open-CLIP ViT-H/14 visual tower, f32, 1280-dim (~2.5 GB)
tokenizer.json upstream, stock load_tokenizer UMT5-XXL HF tokenizer (root copy)
config.json upstream configs/config-14b.json + quantization block Scail2Config model_type: i2v, dim 5120, ffn 13824, 40 layers/heads, in_dim 20, mask_dim 28, out_dim 16; quantization: {bits 4, group_size 64}
bias-aware-dpo-lora.pt upstream, stock mlx_gen_scail2 (sc-5451) optional Bias-Aware DPO refinement LoRA

The DiT ships pre-quantized to Q4 on disk (the SceneWorks worker default), so the loader reads the packs directly β€” there is no dense-bf16 load transient. The VAE / UMT5 / CLIP ship dense (f32 / bf16). This repo ships only the loadable safetensors + tokenizer + the optional DPO LoRA; the redundant raw upstream pickles (Wan2.1_VAE.pth, umt5-xxl/models_t5_...pth, models_clip_...onlyvisual.pth) have been pruned β€” they are reproducible from the upstream release and the Rust loaders never used them.

Architecture (summary)

Wan2.1-14B I2V dense DiT. Conditioning is a token-axis packed stream β€” reference + video + pose patch-embedded (three Conv3d stems) with additive 28-channel color-coded mask embeddings, concatenated into one self-attention sequence β€” plus a per-source RoPE with integer T/H/W shifts (the replace_flag flips the reference H-shift, toggling animation vs. replacement). The reference image is encoded by the CLIP visual tower and injected via Wan-I2V image cross-attention. Sampling is plain CFG (guide 5.0), flow-matching UniPC/DPM++.

Runtime (Apple Silicon)

The production default β€” 832Γ—480 / 5 s (one 81-frame driving segment) β€” runs the DiT in f32 compute (bf16 overflows to NaN at that packed-sequence length), with shared FFN/attention activation chunking and a temporal-tiled VAE decode, at a measured process footprint of ~70–76 GB. SceneWorks gates SCAIL-2 to 96 GB-class Macs. The Q4 DiT keeps the resident weights and the snapshot download lean (β‰ˆ 24 GB total).

License & attribution

This repackaging redistributes upstream weights under the license declared on the upstream model card (MIT); the upstream code repository is Apache-2.0. Please consult and cite the original:

All credit for the model belongs to the original authors. This repo exists solely to make SCAIL-2 usable in SceneWorks on Apple Silicon.

Downloads last month
-
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for SceneWorks/scail2-mlx

Base model

zai-org/SCAIL-2
Finetuned
(2)
this model

Paper for SceneWorks/scail2-mlx