Qwen3.5-0.8B β€” Hexagon NPU (QHexRT) β€” v81

A prebuilt QHexRT bundle of Qwen/Qwen3.5-0.8B (the Qwen3-Next "GatedDeltaNet" hybrid lineage) running on the Qualcomm Hexagon v81 NPU (SM8850 / Snapdragon 8 Elite Gen-class). text→text.

The text decoder is a hybrid: 3 of every 4 layers are a recurrent gated delta-rule linear attention (Qwen3_5GatedDeltaNet: a short causal-conv FIFO + an SSM state + gated RMSNorm), and every 4th layer is gated softmax attention (q-proj output gate, per-head q/k RMSNorm, partial-rotary M-RoPE ΞΈ=1e7). The runtime drives it with the qwen3_5_generate host-op (decode-over-prompt; conv + SSM + KV state carried host-side), the lm-head as a separate graph.

Two precision bundles β€” pick the dir

dir precision decode bin decode parity when to use
v81/ fp16 961 MB ~22.7 tok/s strict greedy-exact (suite 10/10) bit-faithful HF parity
v81-w8/ W8A16 (decode W8 + fp16 lm-head) 489 MB ~24.2 tok/s single-prompt greedy-exact 64/64; length-swept suite 8/10 under a tolerant greedy_tol gate ~half the decode size

Both are device-validated on SM8850 / v81 (QAIRT 2.47). For the 0.8B the W8 win is mainly SIZE (decode 961 MB β†’ 489 MB); the speedup is marginal (~1.07Γ—) because a model this small is not yet weight-bandwidth-bound.

What's here

v81/ (fp16 β€” strict parity)

file role ~size
qwen3.5-0.8b-1024.json QHexRT manifest (the declarative run plan) 2 KB
qwen3508b_decode_f16.bin decode context binary (24 layers, fp16) 961 MB
qwen3508b_lmhead_f16.bin lm-head context binary (tied embed, fp16) 487 MB
qwen3508b_embed_f16.bin embedding table (host lookup, fp16) 485 MB
tokenizer.json Qwen2-style BPE tokenizer 20 MB

v81-w8/ (W8A16 β€” half the decode size)

file role ~size
qwen3.5-0.8b-1024.json QHexRT manifest (references the W8 decode) 2 KB
qwen3508b_decode_w8.bin decode context binary (24 layers, W8 weight-only) 489 MB
qwen3508b_lmhead_f16.bin lm-head context binary (tied embed, fp16) 487 MB
qwen3508b_embed_f16.bin embedding table (host lookup, fp16) 485 MB
tokenizer.json Qwen2-style BPE tokenizer 20 MB

Arch-pinned: a v81 binary will not load on another Hexagon arch (the soc_model/dsp_arch are baked in). The QNN runtime libs (libQnnHtp.so/libQnnSystem.so + the v81 HTP skel) come from the QAIRT SDK, not this repo.

Run

# fp16 (strict parity):
hf download runanywhere/qwen3_5_0_8b_HNPU --include "v81/*" --local-dir q35_08b
adb push q35_08b/v81 /data/local/tmp/wq/qwen35
adb shell "cd /data/local/tmp/wq && LD_LIBRARY_PATH=. \
  ./qhx_generate qwen35/qwen3.5-0.8b-1024.json libQnnHtp.so libQnnSystem.so qwen35 64 'The capital of France is'"

# W8 (half the decode size):
hf download runanywhere/qwen3_5_0_8b_HNPU --include "v81-w8/*" --local-dir q35_08b
adb push q35_08b/v81-w8 /data/local/tmp/wq/qwen35-w8
adb shell "cd /data/local/tmp/wq && LD_LIBRARY_PATH=. \
  ./qhx_generate qwen35-w8/qwen3.5-0.8b-1024.json libQnnHtp.so libQnnSystem.so qwen35-w8 64 'The capital of France is'"

Tool arg order is invariant: <tool> <manifest> libQnnHtp.so libQnnSystem.so <artifacts_root> <max_new> "<prompt>".

Measured (device: SM8850 / v81, QAIRT 2.47)

  • fp16 (v81/): ~22.7 tok/s; greedy-EXACT vs HF Qwen/Qwen3.5-0.8B (64/64). The fp16 SSM state did not drift.
  • W8 (v81-w8/): 24.2 tok/s (1.07Γ—); single-prompt greedy-exact 64/64; length-swept parity suite 8/10 under a tolerant greedy_tol gate (2 W8 near-tie argmax-drifts, coherent). The W8 benefit here is size, not speed β€” the decode halves (961 MB β†’ 489 MB).

Caveats

  • Precision is your choice: v81/ fp16 for strict bit-faithful parity, v81-w8/ for half the decode size at a tolerant (8/10 suite) parity. W8 quantizes only weights β€” the fp32 SSM scan + host conv/SSM/KV state are untouched. W4 is HTP-toolchain-blocked on v81.
  • Text-LLM path only β€” the vision tower + MTP head of the multimodal qwen3_5 checkpoint are not exported.
  • Needs a qhx_generate that includes the qwen3_5_generate host-op (QHexRT branch smonga/qwen_fam; the GatedDeltaNet decode loop is family-specific).
  • Built by the in-repo forge pipeline (oracle-gated export 10/10 β†’ QAIRT-2.47 O3 compile β†’ device greedy gate).
Downloads last month
43
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for runanywhere/qwen3_5_0_8b_HNPU

Finetuned
(246)
this model