Instructions to use runanywhere/embeddinggemma_300m_HNPU with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use runanywhere/embeddinggemma_300m_HNPU with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("runanywhere/embeddinggemma_300m_HNPU") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
EmbeddingGemma-300M — Hexagon NPU bundle (QHexRT, v79 + v81)
On-device text embeddings on the Qualcomm Hexagon NPU. google/embeddinggemma-300m — a Gemma-3-270M backbone
turned into a bidirectional text encoder producing a 768-d sentence vector — compiled to run with
QHexRT via the qhx_embed tool. This is the first embedding
family bundle (alongside the LLM/VLM/ASR/TTS *_HNPU repos).
Arch-pinned, multi-arch: v79/ = Hexagon v79 (Snapdragon 8 Elite / SM8750, soc_model 69); v81/ = Hexagon
v81 (Snapdragon 8 Elite Gen 5 / SM8850, soc_model 87). A context binary loads only on its target arch — pick the
<arch>/ dir matching your device. Both ship the identical file set (same int16-acts route).
What's optimized / how it was built
- int16 activations + int8 per-channel weights, with the graph I/O boundary kept FLOAT_32. Plain fp16
activations overflow to non-finite on this bidirectional encoder (the upstream ONNX card warns of it),
so the bundle is int16-acts quantized (
--act_bitwidth 16 --weights_bitwidth 8 --use_per_channel_quantization --preserve_io datatype, calibrated on a dozen query/document samples). int8 weights ⇒ the 108 MB bin. - Single encoder graph (no KV cache, no decode loop), O3 + 8 MB VTCM,
spill_bytes=0. The pool → 2× Dense → L2 sentence head + the sqrt(768) embed scaling run host-side (embed_gemma_encode); the 262k-row embed table is a host gather.
Files (v79/)
| file | role |
|---|---|
embeddinggemma-300m.json |
QHexRT manifest (the declarative run plan) |
embgemma_enc_f16.bin |
the int16-acts encoder context binary (graph name embgemma_enc_f16) |
embeddinggemma_embed_f16.bin |
host embed table [262144, 768] fp16, pre-scaled by sqrt(768) |
embeddinggemma_dense1_f32.bin |
sentence-head Dense 1 [3072, 768] f32 |
embeddinggemma_dense2_f32.bin |
sentence-head Dense 2 [768, 3072] f32 |
tokenizer.json |
Gemma tokenizer |
embedding_gold.json |
fp32 reference vectors (query/pos/neg) for the device cosine gate |
The QNN runtime libs (libQnnHtp.so / libQnnSystem.so + the v79 HTP skel) come from the QAIRT SDK, not
this repo.
Run
hf download runanywhere/embeddinggemma_300m_HNPU --local-dir eg
adb push eg/v79 /data/local/tmp/wq/eg # PowerShell + native paths on Windows
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq;/vendor/dsp/cdsp'; \
LD_LIBRARY_PATH=. ./qhx_embed eg/embeddinggemma-300m.json libQnnHtp.so libQnnSystem.so eg \
'What is the capital of France?' 'Paris is the capital of France.'"
qhx_embed prepends the task prefix (task: search result | query: for queries, title: none | text: for
documents), tokenizes (BOS + text + EOS), runs the encoder, and prints the L2-normalized 768-d vector (EMB:)
plus, with two texts, their cosine.
Measured (Samsung S25 / SM8750 / Hexagon v79)
- Cosine 0.985 vs the fp32 sentence-transformers gold (int16-acts quantization noise; the offline fp32 export gate is 1.000).
- Retrieval ranking preserved: sim(query, relevant)=0.595 ≫ sim(query, irrelevant)=0.054 — matches the gold. Ranking is what matters for retrieval, and it is intact.
- ~22 ms end-to-end @ 256 tokens (NPU encoder 14.6 ms; the rest is host tokenize/gather/pool/Dense/L2).
Caveats (honest)
- Quantized, not bit-exact. 0.985 cosine ≠ 1.0 — expected for a quantized embedder; use it for retrieval/ similarity, not as a drop-in fp32 oracle.
- Fixed 256-token window. Inputs are padded/truncated to 256; longer documents are truncated (chunk to ≤256 tokens, or wait for the 512/1024/2048 buckets).
- Matryoshka (512/256/128) is supported by the model but this bundle emits the full 768-d vector.
License
Derived from google/embeddinggemma-300m, under the Gemma Terms of Use
(NOT Apache). Modified: converted to a Hexagon-v79 QHexRT bundle (int16-activation quantized; host-side
pooling/Dense head). Use is subject to the Gemma Terms of Use and the
Gemma Prohibited Use Policy.
Measured (SM8850 / Hexagon v81)
- Cosine 0.9836 vs the fp32 sentence-transformers gold (matches v79's 0.985 — int16-acts quantization noise; the offline fp32 export gate is 1.000000).
- Retrieval ranking preserved: sim(query, relevant "Paris is the capital…")=0.618 ≫ sim(query, irrelevant "Photosynthesis…")=0.056. Device-validated on SM8850 (8977b1dd).
- NPU encoder ~17 ms @ 256 tokens. Same int16-acts bin route as v79, recompiled for v81 (soc_model 87) with
the
{O:3,vtcm_mb:8,dlbc:1}graph-config.
- Downloads last month
- 37
Model tree for runanywhere/embeddinggemma_300m_HNPU
Base model
google/embeddinggemma-300m