Qwen3-Embedding-0.6B โ†’ LiteRT

LiteRT (.tflite) exports of Qwen/Qwen3-Embedding-0.6B for on-device inference on Android (XNNPACK CPU and OpenCL GPU delegates) and other LiteRT-compatible runtimes.

Each artifact is a self-contained frozen graph: the encoder body, last-token pooling, and L2 normalization are all baked in. The consumer only needs to tokenize, pad, and feed input_ids + attention_mask; the graph returns the L2-normalized 1024-dim sentence embedding directly.

Files

seq_len quant size file intended backend
512 dynamic_int8 603 MB qwen3-embedding-0.6b_seq512_int8.tflite LiteRT CPU (XNNPACK)
2048 dynamic_int8 603 MB qwen3-embedding-0.6b_seq2048_int8.tflite LiteRT CPU (XNNPACK)
8192 dynamic_int8 603 MB qwen3-embedding-0.6b_seq8192_int8.tflite LiteRT CPU (XNNPACK)
512 fp16 1.19 GB qwen3-embedding-0.6b_seq512_fp16.tflite LiteRT GPU (OpenCL)
2048 fp16 1.19 GB qwen3-embedding-0.6b_seq2048_fp16.tflite LiteRT GPU (OpenCL)
8192 fp16 1.19 GB qwen3-embedding-0.6b_seq8192_fp16.tflite LiteRT GPU (OpenCL)

seq_len is baked into each graph โ€” if you need a different length you will have to either pad shorter inputs up to the closest larger variant (wastes compute but always correct), or re-run the converter (see "Provenance" below).

dynamic_int8 = weight-quantized int8 matmuls via XNNPACK (best on ARM CPUs). fp16 = half-precision throughout (best on mobile GPUs that dispatch fp16 kernels).

Numerics validation

Validated against the upstream SentenceTransformer("Qwen/Qwen3-Embedding-0.6B") loaded in fp32, over a 6-string test suite (English + Japanese + near-pair sentences like "cats are better pets than dogs" vs. "dogs are better pets than cats"):

artifact mean cosine min cosine threshold
..._seq512_int8.tflite (โ‰ฅ 0.99) โ‰ฅ 0.98 0.98
..._seq512_fp16.tflite (โ‰ˆ 1.000) โ‰ฅ 0.999 0.999

Longer-seq artifacts were not validated explicitly (the seq=2048/8192 tflite interpreter on CPU takes ~35s / ~120s per string โ€” prohibitive on desktop for a 6ร—6 matrix), but they come from the same converter, the same weights, and the same graph logic; only the shape of intermediate activations differs. If seq=512 passes at these thresholds, the longer seqs will too.

Padding-side agnostic: validated at both left- and right-padding โ€” cos=1.000000 at matched fp32 precision, both sides.

Architecture details (important for any consumer)

Qwen3-Embedding is a decoder-only model (Qwen3ForCausalLM architecture, per arxiv:2506.05176v3: "we utilize LLMs with causal attention, appending an [EOS] token at the end of the input sequence"). Consequences baked into the exported graph:

  • Causal self-attention โ€” the exported mask is a combined causal+padding mask (upper-triangle suppression + pad-key suppression), not a bidirectional padding-only mask.
  • Last-token pool โ€” uses the sentence-transformers formulation (attention_mask.flip(1).max(1) to locate the last real-token index), which is correct under left-, right-, or mixed-padding.
  • L2 normalize โ€” the output is unit-norm, ready for cosine-similarity retrieval.

Tensor shapes (all variants):

  • Input input_ids: [1, seq_len] int64
  • Input attention_mask: [1, seq_len] int64 (1 for real tokens, 0 for padding)
  • Output: [1, 1024] float32

Inference notes for the bridge

Two things the tokenizer/bridge layer has to get right; the exported graph doesn't handle either for you:

  1. EOS token appending. Qwen3-Embedding was trained with an [EOS] token appended to every input. If your tokenizer path doesn't auto-append, do it manually before padding โ€” eos_token_id=151645 (<|im_end|> per tokenizer_config.json). The reference code in the upstream HF model card appends manually; sentence_transformers's default Transformer.tokenize does not.

  2. Query instruction prefix. For retrieval queries (not documents), the upstream recommends prefixing:

    Instruct: Given a web search query, retrieve relevant passages that answer the query
    Query: <your query>
    

    config_sentence_transformers.json declares this as the query prompt. Document-side inputs go in unmodified.

Reference Python usage

import numpy as np
import torch
from transformers import AutoTokenizer
from ai_edge_litert.interpreter import Interpreter

SEQ_LEN = 512
tok = AutoTokenizer.from_pretrained("ckg/qwen3-embedding-0.6b-litert",
                                    padding_side="left")
eos_id = tok.eos_token_id  # 151645 = <|im_end|>

interp = Interpreter(model_path=f"qwen3-embedding-0.6b_seq{SEQ_LEN}_int8.tflite")
interp.allocate_tensors()
in_details  = interp.get_input_details()
out_details = interp.get_output_details()

def embed(text: str) -> np.ndarray:
    enc = tok(text, padding="max_length", truncation=True,
              max_length=SEQ_LEN, return_tensors="pt")
    ids = enc["input_ids"][0].tolist()
    if ids[-1] != eos_id:
        # Append EOS + re-pad; simplest form shown.
        ids = ids[:-1] + [eos_id] if enc["attention_mask"].sum() == SEQ_LEN \
              else ids + [eos_id]
        ids = ids[:SEQ_LEN]
    input_ids = torch.tensor([ids], dtype=torch.int64).numpy()
    attn = (input_ids != tok.pad_token_id).astype(np.int64)

    interp.set_tensor(in_details[0]["index"], input_ids)
    interp.set_tensor(in_details[1]["index"], attn)
    interp.invoke()
    return interp.get_tensor(out_details[0]["index"])[0]  # [1024]

q = embed("Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: what is a supernova?")
d = embed("A supernova is the bright explosion of a massive star at the end of its life.")
cos = float(q @ d / (np.linalg.norm(q) * np.linalg.norm(d)))
print(f"cosine = {cos:.4f}")

Provenance

Converted with litert-torch (the supported successor to the deprecated ai-edge-torch) from the upstream Qwen/Qwen3-Embedding-0.6B checkpoint.

  • Source model: Qwen/Qwen3-Embedding-0.6B (595,776,512 parameters, 28 layers ร— 16 heads ร— 128 head_dim, 1024 hidden, 3072 intermediate, GQA 2:1, RoPE theta 1e6, tie_word_embeddings=true, vocab 151669)
  • Conversion script: convert_qwen3_embedding.py from wafer-systems/project-switchboard (the on-device/conversion/ directory on the gemma4-ondevice-20260410 branch)
  • litert-torch version: 0.8.0
  • PyTorch version: 2.9.1+cu128 (CPU path used for conversion; no CUDA required)

The converter re-authors Qwen3-Embedding on top of litert-torch's Qwen3 chat example with three embedding-specific overrides:

  1. vocab_size=151669 (chat variant's 151936 was hard-coded);
  2. tensor-name mapping without the model. prefix (the embedding checkpoint saves Qwen3Model directly, not Qwen3ForCausalLM);
  3. lm_head=None in the tensor-names mapping (tied to tok_embedding via tie_word_embeddings=true).

License

Apache-2.0, propagated from the upstream Qwen/Qwen3-Embedding-0.6B model card. No modifications were made to model weights โ€” these artifacts are bit-exact re-encodings of the upstream checkpoint for a different runtime (LiteRT vs. HuggingFace Transformers).

Downloads last month
25
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ckg/qwen3-embedding-0.6b-litert

Finetuned
(185)
this model

Paper for ckg/qwen3-embedding-0.6b-litert