Qwen3-Embedding-0.6B — Core AI export

Qwen/Qwen3-Embedding-0.6B as a single static Core AI graph for macOS 27 / iOS 27: the full sentence-transformers pipeline (Qwen3-0.6B backbone → last-token pooling → L2 normalize) runs in-graph, so one call returns a normalized, MRL-truncatable 1024-d embedding. Multilingual (incl. Japanese), instruction-aware on-device semantic search / RAG.

This is an encoder — one forward over the (right-padded) input → one pooled vector. No autoregressive loop, no KV cache, no LM head. It runs as a plain .aimodel via AIModel.run (like the vision encoders), not the pipelined generate engine.

Graph contract

	name	shape	dtype
input	`input_ids`	[1, 512]	int32 (right-padded; pad id 151643)
input	`attention_mask`	[1, 512]	int32 (1 = real token, 0 = padding)
output	`embedding`	[1, 1024]	fp16, L2-normalized

The grid (512) is an export-time choice — a smaller grid is proportionally faster for short queries. Last-token pooling under the causal mask is right-pad safe (real tokens never attend to trailing pads), so the host just right-pads to the grid.

Host recipe (everything else is in-graph)

Query → prepend the instruction prefix: Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: Document → no prefix.
Tokenize, right-pad to 512 (truncate longer text). Run → 1024-d unit vector.
Similarity = cosine = dot product (vectors are unit-norm).
Matryoshka (MRL): to shrink, take the first D dims (32 ≤ D ≤ 1024) and re-L2-normalize on the host. Rankings are preserved down to 256; verified to 128.

# Core AI runtime (Python), GPU delegate
import coreai.runtime as rt, numpy as np
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("tokenizer")
m = await rt.AIModel.load("qwen3-embedding-0.6b_float16_s512_static.aimodel",
        rt.SpecializationOptions.from_preferred_compute_unit_kind(rt.ComputeUnitKind.gpu()))
fn = m.load_function("main")

def embed(text, is_query):
    prefix = ("Instruct: Given a web search query, retrieve relevant passages that "
              "answer the query\nQuery:") if is_query else ""
    enc = tok(prefix + text, padding="max_length", truncation=True, max_length=512,
              return_tensors="np", padding_side="right")
    res = await fn({"input_ids": rt.NDArray(enc["input_ids"].astype(np.int32)),
                    "attention_mask": rt.NDArray(enc["attention_mask"].astype(np.int32))})
    return res["embedding"].numpy()[0]   # [1024], unit-norm

Swift — CoreAIKit

Downloads this repo on first use and applies the prompts in-process:

import CoreAIKitEmbeddings

let embedder = try await TextEmbedder(model: .qwen3Embedding0_6B, prompts: .qwen3Embedding)
let query = try await embedder.embed(query: "What is the capital of Japan?")
let doc   = try await embedder.embed(document: "Tokyo is the capital and largest city of Japan.")
let score = TextEmbedder.cosineSimilarity(query, doc)   // unit vectors → dot product = cosine

Bundle layout

qwen3-embedding-0.6b_float16_s512_static.aimodel   (~1.1 GB, fp16)
tokenizer/                                          (HF tokenizer files)
reference.json                                      (torch reference embeddings + cosines)

Parity

Precision fp16. Verified against the official sentence-transformers pipeline (fp32): per-text embedding cosine 1.000000, retrieval order identical, MRL rankings preserved at 512 / 256 / 128. On the Core AI GPU delegate the .aimodel reproduces the torch reference at cosine 0.999998 end-to-end (host tokenize → run). Measured ~25 ms (256-grid) / ~45 ms (512-grid) per embedding on an M4 Max GPU.

License

Apache-2.0 (upstream model and code are Apache-2.0). Conversion script: conversion/export_qwen3_embedding.py in the coreai-model-zoo.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mlboydaisuke/Qwen3-Embedding-0.6B-CoreAI

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-Embedding-0.6B

Finetuned

(188)

this model