Llama-3.2-1B — QHexRT NPU bundle (Hexagon v79)

Precompiled Llama-3.2-1B for the QHexRT runtime on Qualcomm Hexagon v79 (Snapdragon 8 Elite / SM8750, e.g. Galaxy S25). Runs entirely on the NPU: the full 16-layer transformer and the lm-head execute on HTP/HMX via the llama_generate host-op (batched prefill + resident-KV decode); the host does only embed lookup, llama3-scaled RoPE, the cache mask, and sampling. Device-validated, coherent output.

Beats Genie. On the same device, QHexRT decode is ~16 tok/s vs Genie's 14.69 (fp16), and prefill 1983 tok/s vs 1538 — because QHexRT runs a prepared QNN context binary with an in-graph NPU lm-head (the host lm-head was the old bottleneck) and W8A16 weights (half fp16's memory bandwidth).

Contents (`v79/`)

file	what	size
`llama-3.2-1b.json`	QHexRT manifest (llm family, `llama_generate` plan)	—
`llama_full_wqo_o3.bin`	W8A16 batched prefill graph (AR=128)	982 MB
`llama_dec_wqo_o3.bin`	W8A16 decode graph (MAXCTX=512, GQA 32q/8kv, head_dim 64)	981 MB
`llama_lmhead.bin`	NPU lm-head graph — `hidden[1,2048] → logits[1,128256]`, fp16 on HMX	526 MB
`llama_embed_f16.bin`	token embedding table `[128256,2048]` f16 (tied; host lookup)	525 MB
`tokenizer.json`	Llama-3 tokenizer (vocab 128256)	—

≈ 3 GB on disk; ~2.5 GB peak device RSS.

Run (QHexRT CLI)

hf download runanywhere/llama3_2_1b_HNPU --local-dir llama3_2_1b_HNPU
adb push llama3_2_1b_HNPU/v79 /data/local/tmp/wq/llama
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq/dsp;/data/local/tmp/wq;/vendor/dsp/cdsp'; \
  LD_LIBRARY_PATH=. ./qhx_generate llama/llama-3.2-1b.json libQnnHtp.so libQnnSystem.so llama 64 'The capital of France is'"
# -> "Paris. The Eiffel Tower is located in Paris. ..."  (decode ~16 tok/s, prefill ~1983 tok/s)

Performance (measured, v79 / S25)

metric	QHexRT	Genie (fp16)
decode	~16 tok/s	14.69
prefill	1983 tok/s	1538

Notes

Arch: v79 only — context binaries are dsp-arch-pinned.
No custom op-package — pure-native HTP graphs (W8A16 matmuls on HMX).
Prompt ≤128 tokens (the batched-prefill graph is AR=128); generation extends to MAXCTX=512.
Precision: W8A16 decode/prefill + fp16 lm-head. W4A16 was tried but collapses a 1B model's coherence (uniform 4-bit is too coarse here); W8A16 is the coherent sweet spot and still beats Genie.
Source: meta-llama/Llama-3.2-1B, compiled with QAIRT 2.45 for qualcomm-snapdragon-8-elite-for-galaxy.

Downloads last month: 37

Model tree for runanywhere/llama3_2_1b_HNPU

Base model

meta-llama/Llama-3.2-1B

Finetuned

(937)