Qwen3.5-4B — Hexagon NPU (QHexRT) — v81

A prebuilt QHexRT bundle of Qwen/Qwen3.5-4B (the Qwen3-Next "GatedDeltaNet" hybrid lineage) running on the Qualcomm Hexagon v81 NPU (SM8850). text→text.

The text decoder is a hybrid: 3 of every 4 layers are a recurrent gated delta-rule linear attention (Qwen3_5GatedDeltaNet: a short causal-conv FIFO + an SSM state + gated RMSNorm), and every 4th layer is gated softmax attention (q-proj output gate, per-head q/k RMSNorm, partial-rotary M-RoPE θ=1e7). The runtime drives it with the qwen3_5_generate host-op (decode-over-prompt; conv + SSM + KV state carried host-side), the lm-head as a separate graph.

Multi-context decode. The 32-layer W8 decode (3.6 GB) exceeds the v81 cDSP per-context memory ceiling (contextCreateFromBinary → 0x3ea MEM_ALLOC). So the decode is split into 2 context shards by layer range (0–15 / 16–31, ~1.81 GB each); the host-op runs them in sequence, threading the residual stream shard→shard (only the last shard applies the final norm). Each shard updates only its own layers' conv/SSM/KV state. This is numerically identical to a single decode graph — parity stays greedy-exact.

What's here (`v81/` — the flat artifacts root)

file	role	~size
`qwen3.5-4b-1024.json`	QHexRT manifest (the declarative run plan)	2 KB
`qwen354b_decode_s0_w8.bin`	decode shard 0 (layers 0–15, W8)	1.81 GB
`qwen354b_decode_s1_w8.bin`	decode shard 1 (layers 16–31, W8)	1.81 GB
`qwen354b_lmhead_f16.bin`	lm-head context binary (tied embed, fp16)	1.27 GB
`qwen354b_embed_f16.bin`	embedding table (host lookup, fp16)	1.27 GB
`tokenizer.json`	Qwen2-style BPE tokenizer	20 MB

Arch-pinned: a v81 binary will not load on another Hexagon arch. The QNN runtime libs come from the QAIRT SDK, not this repo.

Run

hf download runanywhere/qwen3_5_4b_HNPU --local-dir q35_4b
adb push q35_4b/v81 /data/local/tmp/wq/qwen35-4b
# also stage the QAIRT runtime libs + a qhx_generate built with the qwen3_5_generate host-op (see caveats)
adb shell "cd /data/local/tmp/wq && LD_LIBRARY_PATH=. \
  ./qhx_generate qwen35-4b/qwen3.5-4b-1024.json libQnnHtp.so libQnnSystem.so qwen35-4b 64 'The capital of France is'"

Measured (device: SM8850 / v81, QAIRT 2.47)

Decode: ~5.5 tok/s (W8 weight-only, 2 context shards).
Parity: greedy-EXACT — on-device tokens match the HF Qwen/Qwen3.5-4B greedy continuation 64/64 (prompt "The capital of France is").

Caveats

W8 weight-only decode (the ~7 GB fp16 decode exceeds device RAM; W8 ~3.6 GB) + fp16 lm-head. Greedy parity is exact under W8 (64/64) for this model.
Multi-context: the decode is 2 shards (decode_s0 + decode_s1) plus the lm-head — 3 context binaries loaded together. A single decode context this size fails on v81 (per-context alloc ceiling, 0x3ea).
The W8 decode .so (per shard ~1.8 GB) is built with a large-code-model path (weight blob → .lrodata
- -mcmodel=large); the lm-head ONNX is built directly (MatMul + external fp16 weight) to dodge the 2 GB protobuf limit. Both are baked into the published bins.
Text-LLM path only — the vision tower + MTP head are not exported.
Needs a qhx_generate that includes the qwen3_5_generate host-op (QHexRT branch smonga/qwen_fam).
Built by the in-repo forge pipeline (oracle-gated export 10/10 → QAIRT-2.47 O3 compile → device greedy gate).

Downloads last month: 27

Model tree for runanywhere/qwen3_5_4b_HNPU

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Finetuned

(343)

this model

Qwen3.5-4B — Hexagon NPU (QHexRT) — v81

What's here (v81/ — the flat artifacts root)

Run

Measured (device: SM8850 / v81, QAIRT 2.47)

Caveats

Model tree for runanywhere/qwen3_5_4b_HNPU

What's here (`v81/` — the flat artifacts root)