Qwen3-Embedding-0.6B β Core AI export
Qwen/Qwen3-Embedding-0.6B as a single static Core AI graph for macOS 27 / iOS 27: the full sentence-transformers pipeline (Qwen3-0.6B backbone β last-token pooling β L2 normalize) runs in-graph, so one call returns a normalized, MRL-truncatable 1024-d embedding. Multilingual (incl. Japanese), instruction-aware on-device semantic search / RAG.
This is an encoder β one forward over the (right-padded) input β one pooled vector. No
autoregressive loop, no KV cache, no LM head. It runs as a plain .aimodel via AIModel.run
(like the vision encoders), not the pipelined generate engine.
Graph contract
| name | shape | dtype | |
|---|---|---|---|
| input | input_ids |
[1, 512] | int32 (right-padded; pad id 151643) |
| input | attention_mask |
[1, 512] | int32 (1 = real token, 0 = padding) |
| output | embedding |
[1, 1024] | fp16, L2-normalized |
The grid (512) is an export-time choice β a smaller grid is proportionally faster for short queries. Last-token pooling under the causal mask is right-pad safe (real tokens never attend to trailing pads), so the host just right-pads to the grid.
Host recipe (everything else is in-graph)
- Query β prepend the instruction prefix:
Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:Document β no prefix. - Tokenize, right-pad to 512 (truncate longer text). Run β 1024-d unit vector.
- Similarity = cosine = dot product (vectors are unit-norm).
- Matryoshka (MRL): to shrink, take the first D dims (32 β€ D β€ 1024) and re-L2-normalize on the host. Rankings are preserved down to 256; verified to 128.
# Core AI runtime (Python), GPU delegate
import coreai.runtime as rt, numpy as np
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("tokenizer")
m = await rt.AIModel.load("qwen3-embedding-0.6b_float16_s512_static.aimodel",
rt.SpecializationOptions.from_preferred_compute_unit_kind(rt.ComputeUnitKind.gpu()))
fn = m.load_function("main")
def embed(text, is_query):
prefix = ("Instruct: Given a web search query, retrieve relevant passages that "
"answer the query\nQuery:") if is_query else ""
enc = tok(prefix + text, padding="max_length", truncation=True, max_length=512,
return_tensors="np", padding_side="right")
res = await fn({"input_ids": rt.NDArray(enc["input_ids"].astype(np.int32)),
"attention_mask": rt.NDArray(enc["attention_mask"].astype(np.int32))})
return res["embedding"].numpy()[0] # [1024], unit-norm
Swift β CoreAIKit
Downloads this repo on first use and applies the prompts in-process:
import CoreAIKitEmbeddings
let embedder = try await TextEmbedder(model: .qwen3Embedding0_6B, prompts: .qwen3Embedding)
let query = try await embedder.embed(query: "What is the capital of Japan?")
let doc = try await embedder.embed(document: "Tokyo is the capital and largest city of Japan.")
let score = TextEmbedder.cosineSimilarity(query, doc) // unit vectors β dot product = cosine
Bundle layout
qwen3-embedding-0.6b_float16_s512_static.aimodel (~1.1 GB, fp16)
tokenizer/ (HF tokenizer files)
reference.json (torch reference embeddings + cosines)
Parity
Precision fp16. Verified against the official sentence-transformers pipeline (fp32):
per-text embedding cosine 1.000000, retrieval order identical, MRL rankings preserved at
512 / 256 / 128. On the Core AI GPU delegate the .aimodel reproduces the torch reference at
cosine 0.999998 end-to-end (host tokenize β run). Measured ~25 ms (256-grid) / ~45 ms
(512-grid) per embedding on an M4 Max GPU.
License
Apache-2.0 (upstream model and code are Apache-2.0). Conversion script:
conversion/export_qwen3_embedding.py
in the coreai-model-zoo.