Walkyrie-1.3B-v2.0 Core ML (Unquantized)

This repository contains the first native Apple Silicon Core ML conversion of the Walkyrie-1.3B-v2.0 core transformer brain, an image model built on top of the Wan 2.1 Diffusion Transformer (DiT) framework.

Repository Layout

Walkyrie_1.3B_v2.0_float16.mlpackage: The complete 30-block core DiT transformer layer, fully optimized to execute on the Apple Neural Engine (ANE) and Apple Graphics Processor (GPU).

Implementation & Pipeline Notes

This asset contains only the core transformer block. To build a complete text-to-image pipeline inside a native Swift application, you will need to pair this core package with a text tokenizer and a VAE decoder:

Text Encoder (UMT5-XXL): Because compiling an 11B parameter text encoder directly to a static Core ML graph triggers high memory overhead during compilation on 16GB machines, it is highly recommended to run the UMT5 text layer as a raw weight array processed on the CPU/GPU via libraries like swift-tokenizers or mlx-swift.
VAE Decoder: Can be mapped natively via standard Core ML convolutional upsampling to translate the finished transformer latents into viewable RGB images.

🛠️ Replication & Conversion Process

If you want to re-compile or modify this setup from scratch using the silicon-alloy converter or direct coremltools tracing, you must bypass several legacy architectural structural mismatches hardcoded into older diffusion conversion scripts.

The original codebase must be patched with the following workflow modifications:

1. Alignment with Modern Diffusers Layer Naming

The newer Wan 2.1 architecture uses updated property names. Legacy scripts searching for sub-modules will throw immediate AttributeErrors unless mapped to the following properties:

Change .transformer_blocks references to .blocks
Change .patch_embed references to .patch_embedding

2. Migrating to the Unified Condition Embedder

Older models process prompt token arrays and timesteps via isolated .text_embed() and .time_embed() functions. Wan 2.1 consolidates these into a single unified block.

Remove the standalone text and time embedding calls.
Call the unified module directly: temb, timestep_proj, encoder_hidden_states, _ = self.model.condition_embedder(timestep, encoder_hidden_states, None)
Unflatten the resulting projection matrix into its multi-head layout before passing it along: timestep_proj = timestep_proj.unflatten(1, (6, -1))

3. Spatial Tensor Flattening vs. 5D RoPE Tracking

The patch embedding layer outputs a 5D spatial video matrix structured as [Batch, Hidden_Dim, Frames, Height, Width]. The transformer blocks, however, expect a flattened 3D sequence token vector [Batch, Sequence_Length, Hidden_Dim]. Crucially, the Rotary Position Embedding (.rope) module still requires the 5D spatial layout to calculate coordinates.

The correct execution sequence: Pass the 5D spatial matrix into the .rope() module first to extract your rotary embedding parameters: image_rotary_emb = self.model.rope(hidden_states_5d)
Flatten and transpose the spatial matrix into sequence tokens second, right before launching your core transformer blocks loop: hidden_states = hidden_states_5d.flatten(2).transpose(1, 2)

Acknowledgements

Original model weights trained and released by kpsss34.
Core ML compilation achieved via the silicon-alloy framework.

Downloads last month: 6

Model tree for code-and-canvas/Walkyrie-1.3B-v2.0-CoreML-float16

Base model

kpsss34/Walkyrie-1.3B-v1.0

Finetuned

kpsss34/Walkyrie-1.3B-v2.0

Quantized

(2)

this model