Llama-3.2-1B β QHexRT NPU bundle (Hexagon v79)
Precompiled Llama-3.2-1B for the QHexRT runtime on Qualcomm Hexagon v79 (Snapdragon 8 Elite /
SM8750, e.g. Galaxy S25). Runs entirely on the NPU: the full 16-layer transformer and the lm-head
execute on HTP/HMX via the llama_generate host-op (batched prefill + resident-KV decode); the host does
only embed lookup, llama3-scaled RoPE, the cache mask, and sampling. Device-validated, coherent output.
Beats Genie. On the same device, QHexRT decode is ~16 tok/s vs Genie's 14.69 (fp16), and prefill 1983 tok/s vs 1538 β because QHexRT runs a prepared QNN context binary with an in-graph NPU lm-head (the host lm-head was the old bottleneck) and W8A16 weights (half fp16's memory bandwidth).
Contents (v79/)
| file | what | size |
|---|---|---|
llama-3.2-1b.json |
QHexRT manifest (llm family, llama_generate plan) |
β |
llama_full_wqo_o3.bin |
W8A16 batched prefill graph (AR=128) | 982 MB |
llama_dec_wqo_o3.bin |
W8A16 decode graph (MAXCTX=512, GQA 32q/8kv, head_dim 64) | 981 MB |
llama_lmhead.bin |
NPU lm-head graph β hidden[1,2048] β logits[1,128256], fp16 on HMX |
526 MB |
llama_embed_f16.bin |
token embedding table [128256,2048] f16 (tied; host lookup) |
525 MB |
tokenizer.json |
Llama-3 tokenizer (vocab 128256) | β |
β 3 GB on disk; ~2.5 GB peak device RSS.
Run (QHexRT CLI)
hf download runanywhere/llama3_2_1b_HNPU --local-dir llama3_2_1b_HNPU
adb push llama3_2_1b_HNPU/v79 /data/local/tmp/wq/llama
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq/dsp;/data/local/tmp/wq;/vendor/dsp/cdsp'; \
LD_LIBRARY_PATH=. ./qhx_generate llama/llama-3.2-1b.json libQnnHtp.so libQnnSystem.so llama 64 'The capital of France is'"
# -> "Paris. The Eiffel Tower is located in Paris. ..." (decode ~16 tok/s, prefill ~1983 tok/s)
Performance (measured, v79 / S25)
| metric | QHexRT | Genie (fp16) |
|---|---|---|
| decode | ~16 tok/s | 14.69 |
| prefill | 1983 tok/s | 1538 |
Notes
- Arch: v79 only β context binaries are dsp-arch-pinned.
- No custom op-package β pure-native HTP graphs (W8A16 matmuls on HMX).
- Prompt β€128 tokens (the batched-prefill graph is AR=128); generation extends to MAXCTX=512.
- Precision: W8A16 decode/prefill + fp16 lm-head. W4A16 was tried but collapses a 1B model's coherence (uniform 4-bit is too coarse here); W8A16 is the coherent sweet spot and still beats Genie.
- Source:
meta-llama/Llama-3.2-1B, compiled with QAIRT 2.45 forqualcomm-snapdragon-8-elite-for-galaxy.
- Downloads last month
- 37
Model tree for runanywhere/llama3_2_1b_HNPU
Base model
meta-llama/Llama-3.2-1B