Qwen3-Reranker-0.6B — Core AI export

Qwen/Qwen3-Reranker-0.6B as a single static Core AI graph for macOS 27 / iOS 27. The cross-encoder that closes the on-device RAG loop — embed (with Qwen3-Embedding-0.6B-CoreAI) → rerank → generate, all local and private.

A cross-encoder reads one query + document sequence and asks the LM a yes/no question; the relevance score is the softmax weight on "yes" vs "no" at the final token. So it keeps the LM head (the embedder drops it), but it's still a plain .aimodel run via AIModel.run — one forward, no generation. The scoring tail (gather last token → head on that one position → 2-way softmax) is baked in-graph.

Graph contract

	name	shape	dtype
input	`input_ids`	[1, 512]	int32 (right-padded; pad id 151643)
input	`attention_mask`	[1, 512]	int32 (1 = real, 0 = padding)
output	`probs`	[1, 2]	fp16, `softmax([no, yes])` — relevance = `probs[0,1]` = P(yes)

Host recipe

Format the pair exactly like the upstream model card, then right-pad to 512:

import coreai.runtime as rt, numpy as np
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("tokenizer")
PREFIX = ("<|im_start|>system\nJudge whether the Document meets the requirements based on the "
          "Query and the Instruct provided. Note that the answer can only be \"yes\" or "
          "\"no\".<|im_end|>\n<|im_start|>user\n")
SUFFIX = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
INSTR  = "Given a web search query, retrieve relevant passages that answer the query"

m = await rt.AIModel.load("qwen3-reranker-0.6b_float16_s512_static.aimodel",
        rt.SpecializationOptions.from_preferred_compute_unit_kind(rt.ComputeUnitKind.gpu()))
fn = m.load_function("main")

def score(query, doc, S=512):
    body = f"<Instruct>: {INSTR}\n<Query>: {query}\n<Document>: {doc}"
    ids = (tok.encode(PREFIX, add_special_tokens=False)
           + tok.encode(body, add_special_tokens=False)
           + tok.encode(SUFFIX, add_special_tokens=False))
    n = len(ids); ids = ids + [151643] * (S - n)
    mask = [1] * n + [0] * (S - n)
    res = await fn({"input_ids": rt.NDArray(np.asarray([ids], np.int32)),
                    "attention_mask": rt.NDArray(np.asarray([mask], np.int32))})
    return float(res["probs"].numpy()[0, 1])   # P(yes) = relevance; sort candidates by this

The instruction is swappable per task (the model is instruction-aware). Right-pad is equivalent to the upstream left-pad + logits[:, -1] (the graph reads the true last token from the mask).

Swift — CoreAIKit

Downloads this repo on first use and formats the pair in-process:

import CoreAIKitEmbeddings

let reranker = try await Reranker(model: .qwen3Reranker0_6B)
let ranked = try await reranker.rerank(
    query: "What is the capital of Japan?",
    documents: ["Tokyo is the capital of Japan.", "Python is a programming language."])
// ranked[0].document is most relevant; ranked[i].score is P(yes) in [0, 1]

Bundle layout

qwen3-reranker-0.6b_float16_s512_static.aimodel   (~1.1 GB, fp16)
tokenizer/                                          (HF tokenizer files)
reference.json                                      (pairs, scores, prompt scaffolding)

Parity

Precision fp16. Verified against the official AutoModelForCausalLM scoring (fp32): the in-graph wrapper reproduces P(yes) exactly (|Δ| = 0.00000 over 6 relevant/irrelevant pairs), relevant pairs 0.98–1.00 vs irrelevant ≈ 0.0000, ranking preserved. On the Core AI GPU delegate the .aimodel matches the torch reference within |Δ| < 0.0005 end-to-end. Measured 45.7 ms per pair-score on an M4 Max GPU (512 grid).

License

Apache-2.0 (upstream model and code are Apache-2.0). Conversion script: conversion/export_qwen3_reranker.py in the coreai-model-zoo.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Text Ranking

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlboydaisuke/Qwen3-Reranker-0.6B-CoreAI

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-Reranker-0.6B

Finetuned

(18)

this model