Qwen3.5-0.8B β Hexagon NPU (QHexRT) β v81
A prebuilt QHexRT bundle of Qwen/Qwen3.5-0.8B (the Qwen3-Next "GatedDeltaNet" hybrid lineage) running on the Qualcomm Hexagon v81 NPU (SM8850 / Snapdragon 8 Elite Gen-class). textβtext.
The text decoder is a hybrid: 3 of every 4 layers are a recurrent gated delta-rule linear attention
(Qwen3_5GatedDeltaNet: a short causal-conv FIFO + an SSM state + gated RMSNorm), and every 4th layer is
gated softmax attention (q-proj output gate, per-head q/k RMSNorm, partial-rotary M-RoPE ΞΈ=1e7). The
runtime drives it with the qwen3_5_generate host-op (decode-over-prompt; conv + SSM + KV state carried
host-side), the lm-head as a separate graph.
Two precision bundles β pick the dir
| dir | precision | decode bin | decode | parity | when to use |
|---|---|---|---|---|---|
v81/ |
fp16 | 961 MB | ~22.7 tok/s | strict greedy-exact (suite 10/10) | bit-faithful HF parity |
v81-w8/ |
W8A16 (decode W8 + fp16 lm-head) | 489 MB | ~24.2 tok/s | single-prompt greedy-exact 64/64; length-swept suite 8/10 under a tolerant greedy_tol gate |
~half the decode size |
Both are device-validated on SM8850 / v81 (QAIRT 2.47). For the 0.8B the W8 win is mainly SIZE (decode 961 MB β 489 MB); the speedup is marginal (~1.07Γ) because a model this small is not yet weight-bandwidth-bound.
What's here
v81/ (fp16 β strict parity)
| file | role | ~size |
|---|---|---|
qwen3.5-0.8b-1024.json |
QHexRT manifest (the declarative run plan) | 2 KB |
qwen3508b_decode_f16.bin |
decode context binary (24 layers, fp16) | 961 MB |
qwen3508b_lmhead_f16.bin |
lm-head context binary (tied embed, fp16) | 487 MB |
qwen3508b_embed_f16.bin |
embedding table (host lookup, fp16) | 485 MB |
tokenizer.json |
Qwen2-style BPE tokenizer | 20 MB |
v81-w8/ (W8A16 β half the decode size)
| file | role | ~size |
|---|---|---|
qwen3.5-0.8b-1024.json |
QHexRT manifest (references the W8 decode) | 2 KB |
qwen3508b_decode_w8.bin |
decode context binary (24 layers, W8 weight-only) | 489 MB |
qwen3508b_lmhead_f16.bin |
lm-head context binary (tied embed, fp16) | 487 MB |
qwen3508b_embed_f16.bin |
embedding table (host lookup, fp16) | 485 MB |
tokenizer.json |
Qwen2-style BPE tokenizer | 20 MB |
Arch-pinned: a v81 binary will not load on another Hexagon arch (the soc_model/dsp_arch are baked in).
The QNN runtime libs (libQnnHtp.so/libQnnSystem.so + the v81 HTP skel) come from the QAIRT SDK, not this repo.
Run
# fp16 (strict parity):
hf download runanywhere/qwen3_5_0_8b_HNPU --include "v81/*" --local-dir q35_08b
adb push q35_08b/v81 /data/local/tmp/wq/qwen35
adb shell "cd /data/local/tmp/wq && LD_LIBRARY_PATH=. \
./qhx_generate qwen35/qwen3.5-0.8b-1024.json libQnnHtp.so libQnnSystem.so qwen35 64 'The capital of France is'"
# W8 (half the decode size):
hf download runanywhere/qwen3_5_0_8b_HNPU --include "v81-w8/*" --local-dir q35_08b
adb push q35_08b/v81-w8 /data/local/tmp/wq/qwen35-w8
adb shell "cd /data/local/tmp/wq && LD_LIBRARY_PATH=. \
./qhx_generate qwen35-w8/qwen3.5-0.8b-1024.json libQnnHtp.so libQnnSystem.so qwen35-w8 64 'The capital of France is'"
Tool arg order is invariant: <tool> <manifest> libQnnHtp.so libQnnSystem.so <artifacts_root> <max_new> "<prompt>".
Measured (device: SM8850 / v81, QAIRT 2.47)
- fp16 (
v81/): ~22.7 tok/s; greedy-EXACT vs HFQwen/Qwen3.5-0.8B(64/64). The fp16 SSM state did not drift. - W8 (
v81-w8/):24.2 tok/s (1.07Γ); single-prompt greedy-exact 64/64; length-swept parity suite 8/10 under a tolerantgreedy_tolgate (2 W8 near-tie argmax-drifts, coherent). The W8 benefit here is size, not speed β the decode halves (961 MB β 489 MB).
Caveats
- Precision is your choice:
v81/fp16 for strict bit-faithful parity,v81-w8/for half the decode size at a tolerant (8/10 suite) parity. W8 quantizes only weights β the fp32 SSM scan + host conv/SSM/KV state are untouched. W4 is HTP-toolchain-blocked on v81. - Text-LLM path only β the vision tower + MTP head of the multimodal
qwen3_5checkpoint are not exported. - Needs a
qhx_generatethat includes theqwen3_5_generatehost-op (QHexRT branchsmonga/qwen_fam; the GatedDeltaNet decode loop is family-specific). - Built by the in-repo
forgepipeline (oracle-gated export 10/10 β QAIRT-2.47 O3 compile β device greedy gate).
- Downloads last month
- 43