Qwen3-Reranker-0.6B β Core AI export
Qwen/Qwen3-Reranker-0.6B as a single static Core AI graph for macOS 27 / iOS 27. The cross-encoder that closes the on-device RAG loop β embed (with Qwen3-Embedding-0.6B-CoreAI) β rerank β generate, all local and private.
A cross-encoder reads one query + document sequence and asks the LM a yes/no question; the
relevance score is the softmax weight on "yes" vs "no" at the final token. So it keeps the
LM head (the embedder drops it), but it's still a plain .aimodel run via AIModel.run β one
forward, no generation. The scoring tail (gather last token β head on that one position β 2-way
softmax) is baked in-graph.
Graph contract
| name | shape | dtype | |
|---|---|---|---|
| input | input_ids |
[1, 512] | int32 (right-padded; pad id 151643) |
| input | attention_mask |
[1, 512] | int32 (1 = real, 0 = padding) |
| output | probs |
[1, 2] | fp16, softmax([no, yes]) β relevance = probs[0,1] = P(yes) |
Host recipe
Format the pair exactly like the upstream model card, then right-pad to 512:
import coreai.runtime as rt, numpy as np
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("tokenizer")
PREFIX = ("<|im_start|>system\nJudge whether the Document meets the requirements based on the "
"Query and the Instruct provided. Note that the answer can only be \"yes\" or "
"\"no\".<|im_end|>\n<|im_start|>user\n")
SUFFIX = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
INSTR = "Given a web search query, retrieve relevant passages that answer the query"
m = await rt.AIModel.load("qwen3-reranker-0.6b_float16_s512_static.aimodel",
rt.SpecializationOptions.from_preferred_compute_unit_kind(rt.ComputeUnitKind.gpu()))
fn = m.load_function("main")
def score(query, doc, S=512):
body = f"<Instruct>: {INSTR}\n<Query>: {query}\n<Document>: {doc}"
ids = (tok.encode(PREFIX, add_special_tokens=False)
+ tok.encode(body, add_special_tokens=False)
+ tok.encode(SUFFIX, add_special_tokens=False))
n = len(ids); ids = ids + [151643] * (S - n)
mask = [1] * n + [0] * (S - n)
res = await fn({"input_ids": rt.NDArray(np.asarray([ids], np.int32)),
"attention_mask": rt.NDArray(np.asarray([mask], np.int32))})
return float(res["probs"].numpy()[0, 1]) # P(yes) = relevance; sort candidates by this
The instruction is swappable per task (the model is instruction-aware). Right-pad is equivalent to
the upstream left-pad + logits[:, -1] (the graph reads the true last token from the mask).
Swift β CoreAIKit
Downloads this repo on first use and formats the pair in-process:
import CoreAIKitEmbeddings
let reranker = try await Reranker(model: .qwen3Reranker0_6B)
let ranked = try await reranker.rerank(
query: "What is the capital of Japan?",
documents: ["Tokyo is the capital of Japan.", "Python is a programming language."])
// ranked[0].document is most relevant; ranked[i].score is P(yes) in [0, 1]
Bundle layout
qwen3-reranker-0.6b_float16_s512_static.aimodel (~1.1 GB, fp16)
tokenizer/ (HF tokenizer files)
reference.json (pairs, scores, prompt scaffolding)
Parity
Precision fp16. Verified against the official AutoModelForCausalLM scoring (fp32): the
in-graph wrapper reproduces P(yes) exactly (|Ξ| = 0.00000 over 6 relevant/irrelevant pairs),
relevant pairs 0.98β1.00 vs irrelevant β 0.0000, ranking preserved. On the Core AI GPU delegate
the .aimodel matches the torch reference within |Ξ| < 0.0005 end-to-end. Measured 45.7 ms
per pair-score on an M4 Max GPU (512 grid).
License
Apache-2.0 (upstream model and code are Apache-2.0). Conversion script:
conversion/export_qwen3_reranker.py
in the coreai-model-zoo.