Qwen3.5-0.8B — Hexagon NPU (QHexRT) — v81

A prebuilt QHexRT bundle of Qwen/Qwen3.5-0.8B (the Qwen3-Next "GatedDeltaNet" hybrid lineage) running on the Qualcomm Hexagon v81 NPU (SM8850 / Snapdragon 8 Elite Gen-class). text→text.

The text decoder is a hybrid: 3 of every 4 layers are a recurrent gated delta-rule linear attention (Qwen3_5GatedDeltaNet: a short causal-conv FIFO + an SSM state + gated RMSNorm), and every 4th layer is gated softmax attention (q-proj output gate, per-head q/k RMSNorm, partial-rotary M-RoPE θ=1e7). The runtime drives it with the qwen3_5_generate host-op (decode-over-prompt; conv + SSM + KV state carried host-side), the lm-head as a separate graph.

Two precision bundles — pick the dir

dir	precision	decode bin	decode	parity	when to use
`v81/`	fp16	961 MB	~22.7 tok/s	strict greedy-exact (suite 10/10)	bit-faithful HF parity
`v81-w8/`	W8A16 (decode W8 + fp16 lm-head)	489 MB	~24.2 tok/s	single-prompt greedy-exact 64/64; length-swept suite 8/10 under a tolerant `greedy_tol` gate	~half the decode size

Both are device-validated on SM8850 / v81 (QAIRT 2.47). For the 0.8B the W8 win is mainly SIZE (decode 961 MB → 489 MB); the speedup is marginal (~1.07×) because a model this small is not yet weight-bandwidth-bound.

What's here

`v81/` (fp16 — strict parity)

file	role	~size
`qwen3.5-0.8b-1024.json`	QHexRT manifest (the declarative run plan)	2 KB
`qwen3508b_decode_f16.bin`	decode context binary (24 layers, fp16)	961 MB
`qwen3508b_lmhead_f16.bin`	lm-head context binary (tied embed, fp16)	487 MB
`qwen3508b_embed_f16.bin`	embedding table (host lookup, fp16)	485 MB
`tokenizer.json`	Qwen2-style BPE tokenizer	20 MB

`v81-w8/` (W8A16 — half the decode size)

file	role	~size
`qwen3.5-0.8b-1024.json`	QHexRT manifest (references the W8 decode)	2 KB
`qwen3508b_decode_w8.bin`	decode context binary (24 layers, W8 weight-only)	489 MB
`qwen3508b_lmhead_f16.bin`	lm-head context binary (tied embed, fp16)	487 MB
`qwen3508b_embed_f16.bin`	embedding table (host lookup, fp16)	485 MB
`tokenizer.json`	Qwen2-style BPE tokenizer	20 MB

Arch-pinned: a v81 binary will not load on another Hexagon arch (the soc_model/dsp_arch are baked in). The QNN runtime libs (libQnnHtp.so/libQnnSystem.so + the v81 HTP skel) come from the QAIRT SDK, not this repo.

Run

# fp16 (strict parity):
hf download runanywhere/qwen3_5_0_8b_HNPU --include "v81/*" --local-dir q35_08b
adb push q35_08b/v81 /data/local/tmp/wq/qwen35
adb shell "cd /data/local/tmp/wq && LD_LIBRARY_PATH=. \
  ./qhx_generate qwen35/qwen3.5-0.8b-1024.json libQnnHtp.so libQnnSystem.so qwen35 64 'The capital of France is'"

# W8 (half the decode size):
hf download runanywhere/qwen3_5_0_8b_HNPU --include "v81-w8/*" --local-dir q35_08b
adb push q35_08b/v81-w8 /data/local/tmp/wq/qwen35-w8
adb shell "cd /data/local/tmp/wq && LD_LIBRARY_PATH=. \
  ./qhx_generate qwen35-w8/qwen3.5-0.8b-1024.json libQnnHtp.so libQnnSystem.so qwen35-w8 64 'The capital of France is'"

Tool arg order is invariant: <tool> <manifest> libQnnHtp.so libQnnSystem.so <artifacts_root> <max_new> "<prompt>".

Measured (device: SM8850 / v81, QAIRT 2.47)

fp16 (v81/): ~22.7 tok/s; greedy-EXACT vs HF Qwen/Qwen3.5-0.8B (64/64). The fp16 SSM state did not drift.
W8 (v81-w8/): ~~24.2 tok/s (~~1.07×); single-prompt greedy-exact 64/64; length-swept parity suite 8/10 under a tolerant greedy_tol gate (2 W8 near-tie argmax-drifts, coherent). The W8 benefit here is size, not speed — the decode halves (961 MB → 489 MB).

Caveats

Precision is your choice: v81/ fp16 for strict bit-faithful parity, v81-w8/ for half the decode size at a tolerant (8/10 suite) parity. W8 quantizes only weights — the fp32 SSM scan + host conv/SSM/KV state are untouched. W4 is HTP-toolchain-blocked on v81.
Text-LLM path only — the vision tower + MTP head of the multimodal qwen3_5 checkpoint are not exported.
Needs a qhx_generate that includes the qwen3_5_generate host-op (QHexRT branch smonga/qwen_fam; the GatedDeltaNet decode loop is family-specific).
Built by the in-repo forge pipeline (oracle-gated export 10/10 → QAIRT-2.47 O3 compile → device greedy gate).

Downloads last month: 43

Model tree for runanywhere/qwen3_5_0_8b_HNPU

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Finetuned

(246)

this model