Qwen3-VL-2B on Hexagon NPU (QHexRT) — v79 / 512px

Prebuilt QNN context binaries that run Qwen3-VL-2B (image→text VLM) fully on the Qualcomm Hexagon v79 NPU (Snapdragon 8 Elite / SM8750) via the QHexRT runtime. The NPU does vision encode + prefill + decode + lm-head; the host does only tokenize / image-resize / sampling. No Python in the hot path.

Arch-pinned: these binaries are compiled for dsp_arch v79. They will not load on v75/v81 — those need a re-finalize (same model libs, different qnn-context-binary-generator target).

What's optimized

Native HMX vision — the 24-block ViT runs on the matrix engine (native QNN attention), not a custom HVX op-package. ~161 ms for the full encoder (256 image tokens).
Weight-shared LLM — prefill (AR-320) + decode (AR-1, KV-cache) share one weight set in a single context (−1.4 GB).
In-graph int8 lm-head — the [2048×151936] projection runs on the NPU (HMX), not the host (~11 ms).
O3 + 8 MB VTCM + DLBC finalize; weight-only int8 weights, fp16 activations.
512×512 input → 32×32 patch grid → 256 vision tokens (good detail/perf balance). The app resizes any image to 512².

Files (v79/)

File	Role
`vis_native_512.bin`	vision encoder (native HMX, 256 image tokens)
`llm_shared_512.bin`	weight-shared prefill (P=320, KV-emitting) + decode
`lmhead_wqo.bin`	in-graph int8 lm-head
`embed_f16.bin`	token embedding table (f16)
`tokenizer.json`	Qwen3 BPE tokenizer
`vis_patch_embed_f32.bin`, `vis_patch_embed_bias_f32.bin`, `vis_pos_embed_f32.bin`	Stage-F host preprocessing weights (patch-embed + pos-embed)
`qwen3vl-2b-vlm-512.json`	QHexRT manifest (the declarative run plan)

Run (QHexRT, on-device)

adb push v79/* /data/local/tmp/wq/
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='.;./dsp;/vendor/dsp/cdsp' && \
  LD_LIBRARY_PATH=. ./qhx_generate qwen3vl-2b-vlm-512.json libQnnHtp.so libQnnSystem.so . 32 'Describe this image.' photo.jpg"

(QNN runtime libs libQnnHtp.so/libQnnSystem.so + the v79 HTP skel come from the QAIRT SDK, not this repo.)

Measured (Snapdragon 8 Elite, v79)

Vision encode: ~161 ms
Decode: ~9 tok/s (107 ms/tok; 96 ms decode graph + 11 ms in-graph lm-head)
End-to-end image→caption: vision + prefill (~one-time) then steady decode

Output is coherent and reads real image structure (objects, scenes, large text). Vision parity is exact vs the HF fp32 reference (cosine 1.000000 on the gold reference).

Text-chat (LLM) usage

The same weights run as a pure text LLM (no image) via the added qwen3vl-2b-text-512.json manifest (decode-over-prompt on the shared decode graph + in-graph int8 lm-head). Device-validated on S25/v79.

adb shell "cd /data/local/tmp/wq && LD_LIBRARY_PATH=. ADSP_LIBRARY_PATH=. \
  ./qhx_generate v79/qwen3vl-2b-text-512.json libQnnHtp.so libQnnSystem.so v79 40 'What is the capital of France?'"
# -> text: "The capital of France is Paris."   (~362 ms/tok)

Use v79/qwen3vl-2b-vlm-512.json for image captioning (VLM), v79/qwen3vl-2b-text-512.json for text chat (LLM) — same downloaded bundle.

v81 (SM8850 / soc_model 87) — TEXT path

Device-validated on SM8850: "The capital of France is" -> "The capital of France is Paris." (greedy first token 12095 == HF gold), ~119 ms/tok, decode-over-prompt. The decode graph was re-exported with 64-wide rope cos/sin to match QHexRT's generic rope_decode_f16 (forge's [1,128] duplicated-rope left the runtime's upper half stale -> repetition; recipes/qwen3-vl-2b/export_text_decode.py, gated greedy==12095) + a tied matmul lm-head (export_text_lmhead.py). v81/ ships the text LLM path; the v81 VLM image path additionally needs a merger+deepstack-in-graph vision graph (not yet exported). Files: qwen3vl-2b-text-512.json, llm_text_dec_w8.bin, qwen3vl_lmh_w8.bin, qwen3vl_embed_f16.bin, tokenizer.json.

Downloads last month: 87

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support