Qwen3-Embedding-0.6B → LiteRT

LiteRT (.tflite) exports of Qwen/Qwen3-Embedding-0.6B for on-device inference on Android (XNNPACK CPU and OpenCL GPU delegates) and other LiteRT-compatible runtimes.

Each artifact is a self-contained frozen graph: the encoder body, last-token pooling, and L2 normalization are all baked in. The consumer only needs to tokenize, pad, and feed input_ids + attention_mask; the graph returns the L2-normalized 1024-dim sentence embedding directly.

Files

seq_len	quant	size	file	intended backend
512	dynamic_int8	603 MB	`qwen3-embedding-0.6b_seq512_int8.tflite`	LiteRT CPU (XNNPACK)
2048	dynamic_int8	603 MB	`qwen3-embedding-0.6b_seq2048_int8.tflite`	LiteRT CPU (XNNPACK)
8192	dynamic_int8	603 MB	`qwen3-embedding-0.6b_seq8192_int8.tflite`	LiteRT CPU (XNNPACK)
512	fp16	1.19 GB	`qwen3-embedding-0.6b_seq512_fp16.tflite`	LiteRT GPU (OpenCL)
2048	fp16	1.19 GB	`qwen3-embedding-0.6b_seq2048_fp16.tflite`	LiteRT GPU (OpenCL)
8192	fp16	1.19 GB	`qwen3-embedding-0.6b_seq8192_fp16.tflite`	LiteRT GPU (OpenCL)

seq_len is baked into each graph — if you need a different length you will have to either pad shorter inputs up to the closest larger variant (wastes compute but always correct), or re-run the converter (see "Provenance" below).

dynamic_int8 = weight-quantized int8 matmuls via XNNPACK (best on ARM CPUs). fp16 = half-precision throughout (best on mobile GPUs that dispatch fp16 kernels).

Numerics validation

Validated against the upstream SentenceTransformer("Qwen/Qwen3-Embedding-0.6B") loaded in fp32, over a 6-string test suite (English + Japanese + near-pair sentences like "cats are better pets than dogs" vs. "dogs are better pets than cats"):

artifact	mean cosine	min cosine	threshold
`..._seq512_int8.tflite`	(≥ 0.99)	≥ 0.98	0.98
`..._seq512_fp16.tflite`	(≈ 1.000)	≥ 0.999	0.999

Longer-seq artifacts were not validated explicitly (the seq=2048/8192 tflite interpreter on CPU takes ~35s / ~120s per string — prohibitive on desktop for a 6×6 matrix), but they come from the same converter, the same weights, and the same graph logic; only the shape of intermediate activations differs. If seq=512 passes at these thresholds, the longer seqs will too.

Padding-side agnostic: validated at both left- and right-padding — cos=1.000000 at matched fp32 precision, both sides.

Architecture details (important for any consumer)

Qwen3-Embedding is a decoder-only model (Qwen3ForCausalLM architecture, per arxiv:2506.05176v3: "we utilize LLMs with causal attention, appending an [EOS] token at the end of the input sequence"). Consequences baked into the exported graph:

Causal self-attention — the exported mask is a combined causal+padding mask (upper-triangle suppression + pad-key suppression), not a bidirectional padding-only mask.
Last-token pool — uses the sentence-transformers formulation (attention_mask.flip(1).max(1) to locate the last real-token index), which is correct under left-, right-, or mixed-padding.
L2 normalize — the output is unit-norm, ready for cosine-similarity retrieval.

Tensor shapes (all variants):

Input input_ids: [1, seq_len] int64
Input attention_mask: [1, seq_len] int64 (1 for real tokens, 0 for padding)
Output: [1, 1024] float32

Inference notes for the bridge

Two things the tokenizer/bridge layer has to get right; the exported graph doesn't handle either for you:

EOS token appending. Qwen3-Embedding was trained with an [EOS] token appended to every input. If your tokenizer path doesn't auto-append, do it manually before padding — eos_token_id=151645 (<|im_end|> per tokenizer_config.json). The reference code in the upstream HF model card appends manually; sentence_transformers's default Transformer.tokenize does not.
Query instruction prefix. For retrieval queries (not documents), the upstream recommends prefixing:
```
Instruct: Given a web search query, retrieve relevant passages that answer the query
Query: <your query>
```
config_sentence_transformers.json declares this as the query prompt. Document-side inputs go in unmodified.

Reference Python usage

import numpy as np
import torch
from transformers import AutoTokenizer
from ai_edge_litert.interpreter import Interpreter

SEQ_LEN = 512
tok = AutoTokenizer.from_pretrained("ckg/qwen3-embedding-0.6b-litert",
                                    padding_side="left")
eos_id = tok.eos_token_id  # 151645 = <|im_end|>

interp = Interpreter(model_path=f"qwen3-embedding-0.6b_seq{SEQ_LEN}_int8.tflite")
interp.allocate_tensors()
in_details  = interp.get_input_details()
out_details = interp.get_output_details()

def embed(text: str) -> np.ndarray:
    enc = tok(text, padding="max_length", truncation=True,
              max_length=SEQ_LEN, return_tensors="pt")
    ids = enc["input_ids"][0].tolist()
    if ids[-1] != eos_id:
        # Append EOS + re-pad; simplest form shown.
        ids = ids[:-1] + [eos_id] if enc["attention_mask"].sum() == SEQ_LEN \
              else ids + [eos_id]
        ids = ids[:SEQ_LEN]
    input_ids = torch.tensor([ids], dtype=torch.int64).numpy()
    attn = (input_ids != tok.pad_token_id).astype(np.int64)

    interp.set_tensor(in_details[0]["index"], input_ids)
    interp.set_tensor(in_details[1]["index"], attn)
    interp.invoke()
    return interp.get_tensor(out_details[0]["index"])[0]  # [1024]

q = embed("Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: what is a supernova?")
d = embed("A supernova is the bright explosion of a massive star at the end of its life.")
cos = float(q @ d / (np.linalg.norm(q) * np.linalg.norm(d)))
print(f"cosine = {cos:.4f}")

Provenance

Converted with litert-torch (the supported successor to the deprecated ai-edge-torch) from the upstream Qwen/Qwen3-Embedding-0.6B checkpoint.

Source model: Qwen/Qwen3-Embedding-0.6B (595,776,512 parameters, 28 layers × 16 heads × 128 head_dim, 1024 hidden, 3072 intermediate, GQA 2:1, RoPE theta 1e6, tie_word_embeddings=true, vocab 151669)
Conversion script: convert_qwen3_embedding.py from wafer-systems/project-switchboard (the on-device/conversion/ directory on the gemma4-ondevice-20260410 branch)
litert-torch version: 0.8.0
PyTorch version: 2.9.1+cu128 (CPU path used for conversion; no CUDA required)

The converter re-authors Qwen3-Embedding on top of litert-torch's Qwen3 chat example with three embedding-specific overrides:

vocab_size=151669 (chat variant's 151936 was hard-coded);
tensor-name mapping without the model. prefix (the embedding checkpoint saves Qwen3Model directly, not Qwen3ForCausalLM);
lm_head=None in the tensor-names mapping (tied to tok_embedding via tie_word_embeddings=true).

License

Apache-2.0, propagated from the upstream Qwen/Qwen3-Embedding-0.6B model card. No modifications were made to model weights — these artifacts are bit-exact re-encodings of the upstream checkpoint for a different runtime (LiteRT vs. HuggingFace Transformers).

Downloads last month: 25

Model tree for ckg/qwen3-embedding-0.6b-litert

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-Embedding-0.6B

Finetuned

(185)

this model

Paper for ckg/qwen3-embedding-0.6b-litert

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Paper • 2506.05176 • Published Jun 5, 2025 • 83