siglip2-base-256 — QHexRT NPU bundle (Hexagon v79 + v81)

Precompiled SigLIP2-base-patch16-256 image↔text embedding for the QHexRT runtime on Qualcomm Hexagon. A dual-tower contrastive embedder: a vision encoder and a text encoder map an image and a caption into the same 768-d space, so cosine(image, text) ranks photos against a text query — the photo-search / zero-shot-classification building block.

Both towers are Qualcomm-converted qnn_context_binary (fp16) graphs running entirely on the NPU with zero custom ops. The text tower's 256k token-embedding gather (HTP-hostile) is removed from the graph and done host-side.

Arches shipped:

v79/ — Snapdragon 8 Elite / SM8750 (e.g. Galaxy S25). Device-validated: per-tower cosine 0.9998 vs the HF onnx-community/siglip2-base-patch16-256-ONNX reference; cross-modal cats.jpg cosine [0.134, 0.030, −0.006] for cats / dog / car (HF gold [0.135, 0.029, −0.008]), cats ranks #1.
v81/ — Snapdragon 8 Elite Gen 5 / SM8850 (soc_model 87). Device-validated on SM8850: cross-modal discrimination correct both ways — "a photo of a flower" → flower 0.131 vs china 0.022; "the great wall of china" → china 0.091 vs flower 0.016. Encoder latency ≈ 3.6 ms (text) / 6.1 ms (vision). Compiled via the DLC route with the {"O":3,"vtcm_mb":8,"dlbc":1} HTP graph-config (required on v81 — its absence finalizes the graph degraded → all-zero output on v81).

Arch-pinned: a context binary will not load on a different arch (the soc_model + dsp_arch are baked in). Pick the <arch>/ dir matching your device.

Contents (`v79/` and `v81/` — identical file set, arch-pinned bins)

file	what
`siglip2-base-256.json`	QHexRT manifest (embedding family, two self-skipping host steps)
`sig_vision.bin`	vision encoder — `pixel_values`[1,3,256,256] f16 NCHW → `pooler_output`[1,768]
`sig_text_ne.bin`	text encoder (256k embed Gather removed) — `inputs_embeds`[1,64,768] f16 + `input_ids`[1,64] i32 → `pooler_output`[1,768]
`text_embed_f16.raw`	`[256000,768]` f16 text-embed table (the host gathers it; the removed Gather)
`tokenizer.json`	Gemma tokenizer (vocab 256000)

Run (QHexRT CLI — `qhx_clip` photo search)

huggingface-cli download runanywhere/siglip2_base_HNPU --local-dir siglip2_base_HNPU
# QNN libs come from the QAIRT SDK (lib/aarch64-android) + the HTP skel for your arch; push them next to qhx_clip.
ARCH=v81   # or v79 — must match your device
adb push siglip2_base_HNPU/$ARCH /data/local/tmp/wq/siglip
adb push photo1.jpg photo2.jpg /data/local/tmp/wq/siglip/
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq/dsp;/data/local/tmp/wq;/vendor/dsp/cdsp'; \
  LD_LIBRARY_PATH=. ./qhx_clip siglip/siglip2-base-256.json libQnnHtp.so libQnnSystem.so siglip \
  'a photo of two cats' siglip/photo1.jpg siglip/photo2.jpg"
# -> ranks the images by cosine to the query; prints the best match

qhx_clip <manifest> libQnnHtp.so libQnnSystem.so <root> "<query>" <img1> [img2 ...]. The text query embeds via the text tower; each image via the vision tower (the two host-ops self-skip on the other modality), then the images are ranked by cosine to the query (both vectors are L2-normalized).

Notes

Fixed text window 64 tokens, image 256×256 (SigLIP2-base-256). Captions are tokenized with the Gemma tokenizer (no BOS, trailing EOS, right-padded).
fp16 weights + graph I/O; ZERO custom ops.
Recipe (download + text-tower surgery + conversion) + the validated record: QHexRT forge/recipes/siglip2-base/.

Base model: google/siglip2-base-patch16-256 (Apache-2.0). Runtime: QHexRT.

Downloads last month: 31

Model tree for runanywhere/siglip2_base_HNPU

Base model

google/siglip2-base-patch16-256

Finetuned

(4)

this model

siglip2-base-256 — QHexRT NPU bundle (Hexagon v79 + v81)

Contents (v79/ and v81/ — identical file set, arch-pinned bins)

Run (QHexRT CLI — qhx_clip photo search)

Notes

Model tree for runanywhere/siglip2_base_HNPU

Contents (`v79/` and `v81/` — identical file set, arch-pinned bins)

Run (QHexRT CLI — `qhx_clip` photo search)