Qwen3.5-4B β€” Hexagon NPU (QHexRT) β€” v81

A prebuilt QHexRT bundle of Qwen/Qwen3.5-4B (the Qwen3-Next "GatedDeltaNet" hybrid lineage) running on the Qualcomm Hexagon v81 NPU (SM8850). text→text.

The text decoder is a hybrid: 3 of every 4 layers are a recurrent gated delta-rule linear attention (Qwen3_5GatedDeltaNet: a short causal-conv FIFO + an SSM state + gated RMSNorm), and every 4th layer is gated softmax attention (q-proj output gate, per-head q/k RMSNorm, partial-rotary M-RoPE ΞΈ=1e7). The runtime drives it with the qwen3_5_generate host-op (decode-over-prompt; conv + SSM + KV state carried host-side), the lm-head as a separate graph.

Multi-context decode. The 32-layer W8 decode (3.6 GB) exceeds the v81 cDSP per-context memory ceiling (contextCreateFromBinary β†’ 0x3ea MEM_ALLOC). So the decode is split into 2 context shards by layer range (0–15 / 16–31, ~1.81 GB each); the host-op runs them in sequence, threading the residual stream shardβ†’shard (only the last shard applies the final norm). Each shard updates only its own layers' conv/SSM/KV state. This is numerically identical to a single decode graph β€” parity stays greedy-exact.

What's here (v81/ β€” the flat artifacts root)

file role ~size
qwen3.5-4b-1024.json QHexRT manifest (the declarative run plan) 2 KB
qwen354b_decode_s0_w8.bin decode shard 0 (layers 0–15, W8) 1.81 GB
qwen354b_decode_s1_w8.bin decode shard 1 (layers 16–31, W8) 1.81 GB
qwen354b_lmhead_f16.bin lm-head context binary (tied embed, fp16) 1.27 GB
qwen354b_embed_f16.bin embedding table (host lookup, fp16) 1.27 GB
tokenizer.json Qwen2-style BPE tokenizer 20 MB

Arch-pinned: a v81 binary will not load on another Hexagon arch. The QNN runtime libs come from the QAIRT SDK, not this repo.

Run

hf download runanywhere/qwen3_5_4b_HNPU --local-dir q35_4b
adb push q35_4b/v81 /data/local/tmp/wq/qwen35-4b
# also stage the QAIRT runtime libs + a qhx_generate built with the qwen3_5_generate host-op (see caveats)
adb shell "cd /data/local/tmp/wq && LD_LIBRARY_PATH=. \
  ./qhx_generate qwen35-4b/qwen3.5-4b-1024.json libQnnHtp.so libQnnSystem.so qwen35-4b 64 'The capital of France is'"

Measured (device: SM8850 / v81, QAIRT 2.47)

  • Decode: ~5.5 tok/s (W8 weight-only, 2 context shards).
  • Parity: greedy-EXACT β€” on-device tokens match the HF Qwen/Qwen3.5-4B greedy continuation 64/64 (prompt "The capital of France is").

Caveats

  • W8 weight-only decode (the ~7 GB fp16 decode exceeds device RAM; W8 ~3.6 GB) + fp16 lm-head. Greedy parity is exact under W8 (64/64) for this model.
  • Multi-context: the decode is 2 shards (decode_s0 + decode_s1) plus the lm-head β€” 3 context binaries loaded together. A single decode context this size fails on v81 (per-context alloc ceiling, 0x3ea).
  • The W8 decode .so (per shard ~1.8 GB) is built with a large-code-model path (weight blob β†’ .lrodata
    • -mcmodel=large); the lm-head ONNX is built directly (MatMul + external fp16 weight) to dodge the 2 GB protobuf limit. Both are baked into the published bins.
  • Text-LLM path only β€” the vision tower + MTP head are not exported.
  • Needs a qhx_generate that includes the qwen3_5_generate host-op (QHexRT branch smonga/qwen_fam).
  • Built by the in-repo forge pipeline (oracle-gated export 10/10 β†’ QAIRT-2.47 O3 compile β†’ device greedy gate).
Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for runanywhere/qwen3_5_4b_HNPU

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(343)
this model