Qwen3-Reranker-0.6B β€” Core AI export

Qwen/Qwen3-Reranker-0.6B as a single static Core AI graph for macOS 27 / iOS 27. The cross-encoder that closes the on-device RAG loop β€” embed (with Qwen3-Embedding-0.6B-CoreAI) β†’ rerank β†’ generate, all local and private.

A cross-encoder reads one query + document sequence and asks the LM a yes/no question; the relevance score is the softmax weight on "yes" vs "no" at the final token. So it keeps the LM head (the embedder drops it), but it's still a plain .aimodel run via AIModel.run β€” one forward, no generation. The scoring tail (gather last token β†’ head on that one position β†’ 2-way softmax) is baked in-graph.

Graph contract

name shape dtype
input input_ids [1, 512] int32 (right-padded; pad id 151643)
input attention_mask [1, 512] int32 (1 = real, 0 = padding)
output probs [1, 2] fp16, softmax([no, yes]) β€” relevance = probs[0,1] = P(yes)

Host recipe

Format the pair exactly like the upstream model card, then right-pad to 512:

import coreai.runtime as rt, numpy as np
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("tokenizer")
PREFIX = ("<|im_start|>system\nJudge whether the Document meets the requirements based on the "
          "Query and the Instruct provided. Note that the answer can only be \"yes\" or "
          "\"no\".<|im_end|>\n<|im_start|>user\n")
SUFFIX = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
INSTR  = "Given a web search query, retrieve relevant passages that answer the query"

m = await rt.AIModel.load("qwen3-reranker-0.6b_float16_s512_static.aimodel",
        rt.SpecializationOptions.from_preferred_compute_unit_kind(rt.ComputeUnitKind.gpu()))
fn = m.load_function("main")

def score(query, doc, S=512):
    body = f"<Instruct>: {INSTR}\n<Query>: {query}\n<Document>: {doc}"
    ids = (tok.encode(PREFIX, add_special_tokens=False)
           + tok.encode(body, add_special_tokens=False)
           + tok.encode(SUFFIX, add_special_tokens=False))
    n = len(ids); ids = ids + [151643] * (S - n)
    mask = [1] * n + [0] * (S - n)
    res = await fn({"input_ids": rt.NDArray(np.asarray([ids], np.int32)),
                    "attention_mask": rt.NDArray(np.asarray([mask], np.int32))})
    return float(res["probs"].numpy()[0, 1])   # P(yes) = relevance; sort candidates by this

The instruction is swappable per task (the model is instruction-aware). Right-pad is equivalent to the upstream left-pad + logits[:, -1] (the graph reads the true last token from the mask).

Swift β€” CoreAIKit

Downloads this repo on first use and formats the pair in-process:

import CoreAIKitEmbeddings

let reranker = try await Reranker(model: .qwen3Reranker0_6B)
let ranked = try await reranker.rerank(
    query: "What is the capital of Japan?",
    documents: ["Tokyo is the capital of Japan.", "Python is a programming language."])
// ranked[0].document is most relevant; ranked[i].score is P(yes) in [0, 1]

Bundle layout

qwen3-reranker-0.6b_float16_s512_static.aimodel   (~1.1 GB, fp16)
tokenizer/                                          (HF tokenizer files)
reference.json                                      (pairs, scores, prompt scaffolding)

Parity

Precision fp16. Verified against the official AutoModelForCausalLM scoring (fp32): the in-graph wrapper reproduces P(yes) exactly (|Ξ”| = 0.00000 over 6 relevant/irrelevant pairs), relevant pairs 0.98–1.00 vs irrelevant β‰ˆ 0.0000, ranking preserved. On the Core AI GPU delegate the .aimodel matches the torch reference within |Ξ”| < 0.0005 end-to-end. Measured 45.7 ms per pair-score on an M4 Max GPU (512 grid).

License

Apache-2.0 (upstream model and code are Apache-2.0). Conversion script: conversion/export_qwen3_reranker.py in the coreai-model-zoo.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/Qwen3-Reranker-0.6B-CoreAI

Finetuned
(18)
this model