siglip2-base-256 β€” QHexRT NPU bundle (Hexagon v79 + v81)

Precompiled SigLIP2-base-patch16-256 image↔text embedding for the QHexRT runtime on Qualcomm Hexagon. A dual-tower contrastive embedder: a vision encoder and a text encoder map an image and a caption into the same 768-d space, so cosine(image, text) ranks photos against a text query β€” the photo-search / zero-shot-classification building block.

Both towers are Qualcomm-converted qnn_context_binary (fp16) graphs running entirely on the NPU with zero custom ops. The text tower's 256k token-embedding gather (HTP-hostile) is removed from the graph and done host-side.

Arches shipped:

  • v79/ β€” Snapdragon 8 Elite / SM8750 (e.g. Galaxy S25). Device-validated: per-tower cosine 0.9998 vs the HF onnx-community/siglip2-base-patch16-256-ONNX reference; cross-modal cats.jpg cosine [0.134, 0.030, βˆ’0.006] for cats / dog / car (HF gold [0.135, 0.029, βˆ’0.008]), cats ranks #1.
  • v81/ β€” Snapdragon 8 Elite Gen 5 / SM8850 (soc_model 87). Device-validated on SM8850: cross-modal discrimination correct both ways β€” "a photo of a flower" β†’ flower 0.131 vs china 0.022; "the great wall of china" β†’ china 0.091 vs flower 0.016. Encoder latency β‰ˆ 3.6 ms (text) / 6.1 ms (vision). Compiled via the DLC route with the {"O":3,"vtcm_mb":8,"dlbc":1} HTP graph-config (required on v81 β€” its absence finalizes the graph degraded β†’ all-zero output on v81).

Arch-pinned: a context binary will not load on a different arch (the soc_model + dsp_arch are baked in). Pick the <arch>/ dir matching your device.

Contents (v79/ and v81/ β€” identical file set, arch-pinned bins)

file what
siglip2-base-256.json QHexRT manifest (embedding family, two self-skipping host steps)
sig_vision.bin vision encoder β€” pixel_values[1,3,256,256] f16 NCHW β†’ pooler_output[1,768]
sig_text_ne.bin text encoder (256k embed Gather removed) β€” inputs_embeds[1,64,768] f16 + input_ids[1,64] i32 β†’ pooler_output[1,768]
text_embed_f16.raw [256000,768] f16 text-embed table (the host gathers it; the removed Gather)
tokenizer.json Gemma tokenizer (vocab 256000)

Run (QHexRT CLI β€” qhx_clip photo search)

huggingface-cli download runanywhere/siglip2_base_HNPU --local-dir siglip2_base_HNPU
# QNN libs come from the QAIRT SDK (lib/aarch64-android) + the HTP skel for your arch; push them next to qhx_clip.
ARCH=v81   # or v79 β€” must match your device
adb push siglip2_base_HNPU/$ARCH /data/local/tmp/wq/siglip
adb push photo1.jpg photo2.jpg /data/local/tmp/wq/siglip/
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq/dsp;/data/local/tmp/wq;/vendor/dsp/cdsp'; \
  LD_LIBRARY_PATH=. ./qhx_clip siglip/siglip2-base-256.json libQnnHtp.so libQnnSystem.so siglip \
  'a photo of two cats' siglip/photo1.jpg siglip/photo2.jpg"
# -> ranks the images by cosine to the query; prints the best match

qhx_clip <manifest> libQnnHtp.so libQnnSystem.so <root> "<query>" <img1> [img2 ...]. The text query embeds via the text tower; each image via the vision tower (the two host-ops self-skip on the other modality), then the images are ranked by cosine to the query (both vectors are L2-normalized).

Notes

  • Fixed text window 64 tokens, image 256Γ—256 (SigLIP2-base-256). Captions are tokenized with the Gemma tokenizer (no BOS, trailing EOS, right-padded).
  • fp16 weights + graph I/O; ZERO custom ops.
  • Recipe (download + text-tower surgery + conversion) + the validated record: QHexRT forge/recipes/siglip2-base/.

Base model: google/siglip2-base-patch16-256 (Apache-2.0). Runtime: QHexRT.

Downloads last month
31
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for runanywhere/siglip2_base_HNPU

Finetuned
(4)
this model