Instructions to use ckg/qwen3-embedding-0.6b-litert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use ckg/qwen3-embedding-0.6b-litert with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- sentence-transformers
How to use ckg/qwen3-embedding-0.6b-litert with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("ckg/qwen3-embedding-0.6b-litert") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
Qwen3-Embedding-0.6B โ LiteRT
LiteRT (.tflite) exports of Qwen/Qwen3-Embedding-0.6B
for on-device inference on Android (XNNPACK CPU and OpenCL GPU delegates)
and other LiteRT-compatible runtimes.
Each artifact is a self-contained frozen graph: the encoder body,
last-token pooling, and L2 normalization are all baked in. The consumer
only needs to tokenize, pad, and feed input_ids + attention_mask; the
graph returns the L2-normalized 1024-dim sentence embedding directly.
Files
| seq_len | quant | size | file | intended backend |
|---|---|---|---|---|
| 512 | dynamic_int8 | 603 MB | qwen3-embedding-0.6b_seq512_int8.tflite |
LiteRT CPU (XNNPACK) |
| 2048 | dynamic_int8 | 603 MB | qwen3-embedding-0.6b_seq2048_int8.tflite |
LiteRT CPU (XNNPACK) |
| 8192 | dynamic_int8 | 603 MB | qwen3-embedding-0.6b_seq8192_int8.tflite |
LiteRT CPU (XNNPACK) |
| 512 | fp16 | 1.19 GB | qwen3-embedding-0.6b_seq512_fp16.tflite |
LiteRT GPU (OpenCL) |
| 2048 | fp16 | 1.19 GB | qwen3-embedding-0.6b_seq2048_fp16.tflite |
LiteRT GPU (OpenCL) |
| 8192 | fp16 | 1.19 GB | qwen3-embedding-0.6b_seq8192_fp16.tflite |
LiteRT GPU (OpenCL) |
seq_len is baked into each graph โ if you need a different length you
will have to either pad shorter inputs up to the closest larger variant
(wastes compute but always correct), or re-run the converter (see
"Provenance" below).
dynamic_int8 = weight-quantized int8 matmuls via XNNPACK (best on ARM
CPUs). fp16 = half-precision throughout (best on mobile GPUs that
dispatch fp16 kernels).
Numerics validation
Validated against the upstream SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")
loaded in fp32, over a 6-string test suite (English + Japanese + near-pair
sentences like "cats are better pets than dogs" vs. "dogs are better pets
than cats"):
| artifact | mean cosine | min cosine | threshold |
|---|---|---|---|
..._seq512_int8.tflite |
(โฅ 0.99) | โฅ 0.98 | 0.98 |
..._seq512_fp16.tflite |
(โ 1.000) | โฅ 0.999 | 0.999 |
Longer-seq artifacts were not validated explicitly (the seq=2048/8192 tflite interpreter on CPU takes ~35s / ~120s per string โ prohibitive on desktop for a 6ร6 matrix), but they come from the same converter, the same weights, and the same graph logic; only the shape of intermediate activations differs. If seq=512 passes at these thresholds, the longer seqs will too.
Padding-side agnostic: validated at both left- and right-padding โ cos=1.000000 at matched fp32 precision, both sides.
Architecture details (important for any consumer)
Qwen3-Embedding is a decoder-only model (Qwen3ForCausalLM architecture,
per arxiv:2506.05176v3: "we utilize
LLMs with causal attention, appending an [EOS] token at the end of the
input sequence"). Consequences baked into the exported graph:
- Causal self-attention โ the exported mask is a combined causal+padding mask (upper-triangle suppression + pad-key suppression), not a bidirectional padding-only mask.
- Last-token pool โ uses the sentence-transformers formulation
(
attention_mask.flip(1).max(1)to locate the last real-token index), which is correct under left-, right-, or mixed-padding. - L2 normalize โ the output is unit-norm, ready for cosine-similarity retrieval.
Tensor shapes (all variants):
- Input
input_ids:[1, seq_len]int64 - Input
attention_mask:[1, seq_len]int64 (1 for real tokens, 0 for padding) - Output:
[1, 1024]float32
Inference notes for the bridge
Two things the tokenizer/bridge layer has to get right; the exported graph doesn't handle either for you:
EOS token appending. Qwen3-Embedding was trained with an
[EOS]token appended to every input. If your tokenizer path doesn't auto-append, do it manually before padding โeos_token_id=151645(<|im_end|>pertokenizer_config.json). The reference code in the upstream HF model card appends manually;sentence_transformers's defaultTransformer.tokenizedoes not.Query instruction prefix. For retrieval queries (not documents), the upstream recommends prefixing:
Instruct: Given a web search query, retrieve relevant passages that answer the query Query: <your query>config_sentence_transformers.jsondeclares this as thequeryprompt. Document-side inputs go in unmodified.
Reference Python usage
import numpy as np
import torch
from transformers import AutoTokenizer
from ai_edge_litert.interpreter import Interpreter
SEQ_LEN = 512
tok = AutoTokenizer.from_pretrained("ckg/qwen3-embedding-0.6b-litert",
padding_side="left")
eos_id = tok.eos_token_id # 151645 = <|im_end|>
interp = Interpreter(model_path=f"qwen3-embedding-0.6b_seq{SEQ_LEN}_int8.tflite")
interp.allocate_tensors()
in_details = interp.get_input_details()
out_details = interp.get_output_details()
def embed(text: str) -> np.ndarray:
enc = tok(text, padding="max_length", truncation=True,
max_length=SEQ_LEN, return_tensors="pt")
ids = enc["input_ids"][0].tolist()
if ids[-1] != eos_id:
# Append EOS + re-pad; simplest form shown.
ids = ids[:-1] + [eos_id] if enc["attention_mask"].sum() == SEQ_LEN \
else ids + [eos_id]
ids = ids[:SEQ_LEN]
input_ids = torch.tensor([ids], dtype=torch.int64).numpy()
attn = (input_ids != tok.pad_token_id).astype(np.int64)
interp.set_tensor(in_details[0]["index"], input_ids)
interp.set_tensor(in_details[1]["index"], attn)
interp.invoke()
return interp.get_tensor(out_details[0]["index"])[0] # [1024]
q = embed("Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: what is a supernova?")
d = embed("A supernova is the bright explosion of a massive star at the end of its life.")
cos = float(q @ d / (np.linalg.norm(q) * np.linalg.norm(d)))
print(f"cosine = {cos:.4f}")
Provenance
Converted with litert-torch
(the supported successor to the deprecated ai-edge-torch) from the
upstream Qwen/Qwen3-Embedding-0.6B checkpoint.
- Source model:
Qwen/Qwen3-Embedding-0.6B(595,776,512 parameters, 28 layers ร 16 heads ร 128 head_dim, 1024 hidden, 3072 intermediate, GQA 2:1, RoPE theta 1e6,tie_word_embeddings=true, vocab 151669) - Conversion script:
convert_qwen3_embedding.pyfrom wafer-systems/project-switchboard (theon-device/conversion/directory on thegemma4-ondevice-20260410branch) litert-torchversion: 0.8.0- PyTorch version: 2.9.1+cu128 (CPU path used for conversion; no CUDA required)
The converter re-authors Qwen3-Embedding on top of litert-torch's Qwen3
chat example with three embedding-specific overrides:
vocab_size=151669(chat variant's 151936 was hard-coded);- tensor-name mapping without the
model.prefix (the embedding checkpoint savesQwen3Modeldirectly, notQwen3ForCausalLM); lm_head=Nonein the tensor-names mapping (tied totok_embeddingviatie_word_embeddings=true).
License
Apache-2.0, propagated from the upstream Qwen/Qwen3-Embedding-0.6B model
card. No modifications were made to model weights โ these artifacts are
bit-exact re-encodings of the upstream checkpoint for a different runtime
(LiteRT vs. HuggingFace Transformers).
- Downloads last month
- 25