Llama-3.2-1B β€” QHexRT NPU bundle (Hexagon v79)

Precompiled Llama-3.2-1B for the QHexRT runtime on Qualcomm Hexagon v79 (Snapdragon 8 Elite / SM8750, e.g. Galaxy S25). Runs entirely on the NPU: the full 16-layer transformer and the lm-head execute on HTP/HMX via the llama_generate host-op (batched prefill + resident-KV decode); the host does only embed lookup, llama3-scaled RoPE, the cache mask, and sampling. Device-validated, coherent output.

Beats Genie. On the same device, QHexRT decode is ~16 tok/s vs Genie's 14.69 (fp16), and prefill 1983 tok/s vs 1538 β€” because QHexRT runs a prepared QNN context binary with an in-graph NPU lm-head (the host lm-head was the old bottleneck) and W8A16 weights (half fp16's memory bandwidth).

Contents (v79/)

file what size
llama-3.2-1b.json QHexRT manifest (llm family, llama_generate plan) β€”
llama_full_wqo_o3.bin W8A16 batched prefill graph (AR=128) 982 MB
llama_dec_wqo_o3.bin W8A16 decode graph (MAXCTX=512, GQA 32q/8kv, head_dim 64) 981 MB
llama_lmhead.bin NPU lm-head graph β€” hidden[1,2048] β†’ logits[1,128256], fp16 on HMX 526 MB
llama_embed_f16.bin token embedding table [128256,2048] f16 (tied; host lookup) 525 MB
tokenizer.json Llama-3 tokenizer (vocab 128256) β€”

β‰ˆ 3 GB on disk; ~2.5 GB peak device RSS.

Run (QHexRT CLI)

hf download runanywhere/llama3_2_1b_HNPU --local-dir llama3_2_1b_HNPU
adb push llama3_2_1b_HNPU/v79 /data/local/tmp/wq/llama
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq/dsp;/data/local/tmp/wq;/vendor/dsp/cdsp'; \
  LD_LIBRARY_PATH=. ./qhx_generate llama/llama-3.2-1b.json libQnnHtp.so libQnnSystem.so llama 64 'The capital of France is'"
# -> "Paris. The Eiffel Tower is located in Paris. ..."  (decode ~16 tok/s, prefill ~1983 tok/s)

Performance (measured, v79 / S25)

metric QHexRT Genie (fp16)
decode ~16 tok/s 14.69
prefill 1983 tok/s 1538

Notes

  • Arch: v79 only β€” context binaries are dsp-arch-pinned.
  • No custom op-package β€” pure-native HTP graphs (W8A16 matmuls on HMX).
  • Prompt ≀128 tokens (the batched-prefill graph is AR=128); generation extends to MAXCTX=512.
  • Precision: W8A16 decode/prefill + fp16 lm-head. W4A16 was tried but collapses a 1B model's coherence (uniform 4-bit is too coarse here); W8A16 is the coherent sweet spot and still beats Genie.
  • Source: meta-llama/Llama-3.2-1B, compiled with QAIRT 2.45 for qualcomm-snapdragon-8-elite-for-galaxy.
Downloads last month
37
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for runanywhere/llama3_2_1b_HNPU

Finetuned
(937)
this model