YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Qwen3-VL-2B on Hexagon NPU (QHexRT) — v79 / 512px
Prebuilt QNN context binaries that run Qwen3-VL-2B (image→text VLM) fully on the Qualcomm Hexagon v79 NPU (Snapdragon 8 Elite / SM8750) via the QHexRT runtime. The NPU does vision encode + prefill + decode + lm-head; the host does only tokenize / image-resize / sampling. No Python in the hot path.
Arch-pinned: these binaries are compiled for dsp_arch v79. They will not load on v75/v81 — those need a re-finalize (same model libs, different
qnn-context-binary-generatortarget).
What's optimized
- Native HMX vision — the 24-block ViT runs on the matrix engine (native QNN attention), not a custom HVX op-package. ~161 ms for the full encoder (256 image tokens).
- Weight-shared LLM — prefill (AR-320) + decode (AR-1, KV-cache) share one weight set in a single context (−1.4 GB).
- In-graph int8 lm-head — the
[2048×151936]projection runs on the NPU (HMX), not the host (~11 ms). - O3 + 8 MB VTCM + DLBC finalize; weight-only int8 weights, fp16 activations.
- 512×512 input → 32×32 patch grid → 256 vision tokens (good detail/perf balance). The app resizes any image to 512².
Files (v79/)
| File | Role |
|---|---|
vis_native_512.bin |
vision encoder (native HMX, 256 image tokens) |
llm_shared_512.bin |
weight-shared prefill (P=320, KV-emitting) + decode |
lmhead_wqo.bin |
in-graph int8 lm-head |
embed_f16.bin |
token embedding table (f16) |
tokenizer.json |
Qwen3 BPE tokenizer |
vis_patch_embed_f32.bin, vis_patch_embed_bias_f32.bin, vis_pos_embed_f32.bin |
Stage-F host preprocessing weights (patch-embed + pos-embed) |
qwen3vl-2b-vlm-512.json |
QHexRT manifest (the declarative run plan) |
Run (QHexRT, on-device)
adb push v79/* /data/local/tmp/wq/
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='.;./dsp;/vendor/dsp/cdsp' && \
LD_LIBRARY_PATH=. ./qhx_generate qwen3vl-2b-vlm-512.json libQnnHtp.so libQnnSystem.so . 32 'Describe this image.' photo.jpg"
(QNN runtime libs libQnnHtp.so/libQnnSystem.so + the v79 HTP skel come from the QAIRT SDK, not this repo.)
Measured (Snapdragon 8 Elite, v79)
- Vision encode: ~161 ms
- Decode: ~9 tok/s (107 ms/tok; 96 ms decode graph + 11 ms in-graph lm-head)
- End-to-end image→caption: vision + prefill (~one-time) then steady decode
Output is coherent and reads real image structure (objects, scenes, large text). Vision parity is exact vs the HF fp32 reference (cosine 1.000000 on the gold reference).
Text-chat (LLM) usage
The same weights run as a pure text LLM (no image) via the added qwen3vl-2b-text-512.json manifest
(decode-over-prompt on the shared decode graph + in-graph int8 lm-head). Device-validated on S25/v79.
adb shell "cd /data/local/tmp/wq && LD_LIBRARY_PATH=. ADSP_LIBRARY_PATH=. \
./qhx_generate v79/qwen3vl-2b-text-512.json libQnnHtp.so libQnnSystem.so v79 40 'What is the capital of France?'"
# -> text: "The capital of France is Paris." (~362 ms/tok)
Use v79/qwen3vl-2b-vlm-512.json for image captioning (VLM), v79/qwen3vl-2b-text-512.json for text chat (LLM) — same downloaded bundle.
v81 (SM8850 / soc_model 87) — TEXT path
Device-validated on SM8850: "The capital of France is" -> "The capital of France is Paris." (greedy first
token 12095 == HF gold), ~119 ms/tok, decode-over-prompt. The decode graph was re-exported with 64-wide rope
cos/sin to match QHexRT's generic rope_decode_f16 (forge's [1,128] duplicated-rope left the runtime's upper
half stale -> repetition; recipes/qwen3-vl-2b/export_text_decode.py, gated greedy==12095) + a tied matmul
lm-head (export_text_lmhead.py). v81/ ships the text LLM path; the v81 VLM image path additionally needs
a merger+deepstack-in-graph vision graph (not yet exported). Files: qwen3vl-2b-text-512.json,
llm_text_dec_w8.bin, qwen3vl_lmh_w8.bin, qwen3vl_embed_f16.bin, tokenizer.json.
- Downloads last month
- 87