Qwen3.5-4B β Hexagon NPU (QHexRT) β v81
A prebuilt QHexRT bundle of Qwen/Qwen3.5-4B (the Qwen3-Next "GatedDeltaNet" hybrid lineage) running on the Qualcomm Hexagon v81 NPU (SM8850). textβtext.
The text decoder is a hybrid: 3 of every 4 layers are a recurrent gated delta-rule linear attention
(Qwen3_5GatedDeltaNet: a short causal-conv FIFO + an SSM state + gated RMSNorm), and every 4th layer is
gated softmax attention (q-proj output gate, per-head q/k RMSNorm, partial-rotary M-RoPE ΞΈ=1e7). The
runtime drives it with the qwen3_5_generate host-op (decode-over-prompt; conv + SSM + KV state carried
host-side), the lm-head as a separate graph.
Multi-context decode. The 32-layer W8 decode (3.6 GB) exceeds the v81 cDSP per-context memory ceiling
(contextCreateFromBinary β 0x3ea MEM_ALLOC). So the decode is split into 2 context shards by layer range
(0β15 / 16β31, ~1.81 GB each); the host-op runs them in sequence, threading the residual stream shardβshard
(only the last shard applies the final norm). Each shard updates only its own layers' conv/SSM/KV state. This
is numerically identical to a single decode graph β parity stays greedy-exact.
What's here (v81/ β the flat artifacts root)
| file | role | ~size |
|---|---|---|
qwen3.5-4b-1024.json |
QHexRT manifest (the declarative run plan) | 2 KB |
qwen354b_decode_s0_w8.bin |
decode shard 0 (layers 0β15, W8) | 1.81 GB |
qwen354b_decode_s1_w8.bin |
decode shard 1 (layers 16β31, W8) | 1.81 GB |
qwen354b_lmhead_f16.bin |
lm-head context binary (tied embed, fp16) | 1.27 GB |
qwen354b_embed_f16.bin |
embedding table (host lookup, fp16) | 1.27 GB |
tokenizer.json |
Qwen2-style BPE tokenizer | 20 MB |
Arch-pinned: a v81 binary will not load on another Hexagon arch. The QNN runtime libs come from the QAIRT SDK, not this repo.
Run
hf download runanywhere/qwen3_5_4b_HNPU --local-dir q35_4b
adb push q35_4b/v81 /data/local/tmp/wq/qwen35-4b
# also stage the QAIRT runtime libs + a qhx_generate built with the qwen3_5_generate host-op (see caveats)
adb shell "cd /data/local/tmp/wq && LD_LIBRARY_PATH=. \
./qhx_generate qwen35-4b/qwen3.5-4b-1024.json libQnnHtp.so libQnnSystem.so qwen35-4b 64 'The capital of France is'"
Measured (device: SM8850 / v81, QAIRT 2.47)
- Decode: ~5.5 tok/s (W8 weight-only, 2 context shards).
- Parity: greedy-EXACT β on-device tokens match the HF
Qwen/Qwen3.5-4Bgreedy continuation 64/64 (prompt "The capital of France is").
Caveats
- W8 weight-only decode (the ~7 GB fp16 decode exceeds device RAM; W8 ~3.6 GB) + fp16 lm-head. Greedy parity is exact under W8 (64/64) for this model.
- Multi-context: the decode is 2 shards (
decode_s0+decode_s1) plus the lm-head β 3 context binaries loaded together. A single decode context this size fails on v81 (per-context alloc ceiling,0x3ea). - The W8 decode
.so(per shard ~1.8 GB) is built with a large-code-model path (weight blob β.lrodata-mcmodel=large); the lm-head ONNX is built directly (MatMul + external fp16 weight) to dodge the 2 GB protobuf limit. Both are baked into the published bins.
- Text-LLM path only β the vision tower + MTP head are not exported.
- Needs a
qhx_generatethat includes theqwen3_5_generatehost-op (QHexRT branchsmonga/qwen_fam). - Built by the in-repo
forgepipeline (oracle-gated export 10/10 β QAIRT-2.47 O3 compile β device greedy gate).
- Downloads last month
- 27