siglip2-base-256 β QHexRT NPU bundle (Hexagon v79 + v81)
Precompiled SigLIP2-base-patch16-256 imageβtext embedding for the QHexRT runtime on Qualcomm Hexagon.
A dual-tower contrastive embedder: a vision encoder and a text encoder map an image and a caption into the
same 768-d space, so cosine(image, text) ranks photos against a text query β the photo-search /
zero-shot-classification building block.
Both towers are Qualcomm-converted qnn_context_binary (fp16) graphs running entirely on the NPU with zero
custom ops. The text tower's 256k token-embedding gather (HTP-hostile) is removed from the graph and done
host-side.
Arches shipped:
v79/β Snapdragon 8 Elite / SM8750 (e.g. Galaxy S25). Device-validated: per-tower cosine 0.9998 vs the HFonnx-community/siglip2-base-patch16-256-ONNXreference; cross-modalcats.jpgcosine [0.134, 0.030, β0.006] for cats / dog / car (HF gold [0.135, 0.029, β0.008]), cats ranks #1.v81/β Snapdragon 8 Elite Gen 5 / SM8850 (soc_model 87). Device-validated on SM8850: cross-modal discrimination correct both ways β "a photo of a flower" β flower 0.131 vs china 0.022; "the great wall of china" β china 0.091 vs flower 0.016. Encoder latency β 3.6 ms (text) / 6.1 ms (vision). Compiled via the DLC route with the{"O":3,"vtcm_mb":8,"dlbc":1}HTP graph-config (required on v81 β its absence finalizes the graph degraded β all-zero output on v81).
Arch-pinned: a context binary will not load on a different arch (the soc_model + dsp_arch are baked in). Pick the
<arch>/dir matching your device.
Contents (v79/ and v81/ β identical file set, arch-pinned bins)
| file | what |
|---|---|
siglip2-base-256.json |
QHexRT manifest (embedding family, two self-skipping host steps) |
sig_vision.bin |
vision encoder β pixel_values[1,3,256,256] f16 NCHW β pooler_output[1,768] |
sig_text_ne.bin |
text encoder (256k embed Gather removed) β inputs_embeds[1,64,768] f16 + input_ids[1,64] i32 β pooler_output[1,768] |
text_embed_f16.raw |
[256000,768] f16 text-embed table (the host gathers it; the removed Gather) |
tokenizer.json |
Gemma tokenizer (vocab 256000) |
Run (QHexRT CLI β qhx_clip photo search)
huggingface-cli download runanywhere/siglip2_base_HNPU --local-dir siglip2_base_HNPU
# QNN libs come from the QAIRT SDK (lib/aarch64-android) + the HTP skel for your arch; push them next to qhx_clip.
ARCH=v81 # or v79 β must match your device
adb push siglip2_base_HNPU/$ARCH /data/local/tmp/wq/siglip
adb push photo1.jpg photo2.jpg /data/local/tmp/wq/siglip/
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq/dsp;/data/local/tmp/wq;/vendor/dsp/cdsp'; \
LD_LIBRARY_PATH=. ./qhx_clip siglip/siglip2-base-256.json libQnnHtp.so libQnnSystem.so siglip \
'a photo of two cats' siglip/photo1.jpg siglip/photo2.jpg"
# -> ranks the images by cosine to the query; prints the best match
qhx_clip <manifest> libQnnHtp.so libQnnSystem.so <root> "<query>" <img1> [img2 ...]. The text query embeds
via the text tower; each image via the vision tower (the two host-ops self-skip on the other modality), then
the images are ranked by cosine to the query (both vectors are L2-normalized).
Notes
- Fixed text window 64 tokens, image 256Γ256 (SigLIP2-base-256). Captions are tokenized with the Gemma tokenizer (no BOS, trailing EOS, right-padded).
- fp16 weights + graph I/O; ZERO custom ops.
- Recipe (download + text-tower surgery + conversion) + the validated record: QHexRT
forge/recipes/siglip2-base/.
Base model: google/siglip2-base-patch16-256 (Apache-2.0). Runtime: QHexRT.
- Downloads last month
- 31
Model tree for runanywhere/siglip2_base_HNPU
Base model
google/siglip2-base-patch16-256