gemma4_e4b_HNPU — Gemma-4-E4B on the Qualcomm Hexagon NPU (v81)
Prebuilt QHexRT bundle that runs google/gemma-4-E4B on the
Hexagon v81 NPU (SM8850 / Snapdragon 8 Elite Gen-class, soc_model 87) — text, vision, and the audio
encoder — entirely on-device (no Python in the compute loop). Download → adb push → run.
Arch-pinned. These context binaries are finalized for v81 (
soc_model 87). A v81 binary will not load on another Hexagon arch (the soc/arch are baked in). Each arch is its ownvXX/dir.
Modalities & status
| Modality | Status on v81 | Notes |
|---|---|---|
| Text (LLM) | ✅ Working, device-validated | greedy-exact vs HF; ~4.4 tok/s, W8A16 |
| Vision (image→caption) | ✅ Working, device-validated | encoder cos 0.999992, soft-token 0.999996 vs HF; image-grounded captions |
| Audio encoder (conformer) | ✅ Working, device-validated | 12-layer conformer cos 0.999919 vs HF |
| Audio transcription (speech→text) | ⚠️ Experimental / not functional on v81 | see the caveat below |
⚠️ Audio transcription caveat (read before using the audio path)
The audio front is correct — the encoder is cos 0.9999, the soft tokens are cos 0.996 vs HF, and HF fed the
device's own soft tokens transcribes the clip perfectly. But on v81 the decode emits numbers/dates, not the
transcript. Root cause (conclusively bisected): the decode MLP's (gelu(gate)·up) @ down_proj runs an f16
accumulation that drifts on this HTP; over 42 layers it flips the sensitive audio-conditioned greedy chain
(TEXT greedy is unambiguous so LLM/VLM are unaffected). The one clean fix — int16/int32-accumulation on only
the down_proj — compose-fails on the v81 HTP (mixed int16/float in one context is unsupported), and
int16-everything is worse than f16. This is a hardware/toolchain limit, not a port bug. Full diagnosis, the
complete what-was-tried table, and the ranked future-work to fix it:
AUDIO_FINDINGS.md.
The audio encoder + pipeline are included here as a validated, experimental base for that work.
What's optimized
- W8A16 weight-only quant on the decode + lm-head (the v81 floor; W4 is HTP-blocked), fp16 vision/encoder.
- 3-part split decode
[0,12) [12,24) [24,42)(each ~1.1–1.7 GB, under the HTP serialize budget), chained by the C++ host-opgemma4_split_generatewith an f16 hidden hand-off. - Cross-layer KV-sharing (shared layers 24–41 reuse a same-type donor's K/V; donors 22 sliding / 23 full), dual head_dim (sliding 256 / full 512) + partial-0.25 RoPE on full-attn layers, per-layer PLE, in-graph lm-head.
- Vision: 16-layer ViT (768px → 256 soft tokens) with host patch-embed/pos-embed/projector.
- Audio: 12-layer conformer encoder (max-abs RMSNorm + 2D per-head attention + dim-1-gather lightconv — three v81 device fixes; see the recipe).
Files (v81/)
| file | role |
|---|---|
gemma-4-E4B.json / gemma-4-E4B-vlm.json / gemma-4-E4B-audio.json |
QHexRT manifests (text / vision / audio) |
gemma4e4b_decode_p{0,1,2}_w8.bin |
W8A16 decode, 3 parts |
gemma4e4b_lmh_f16.bin |
tied lm-head (f16) |
gemma4e4b_embed_f16.bin |
token embedding table |
gemma4e4b_ple_table_f16.bin / ple_proj_f16.bin / ple_norm_f16.bin |
per-layer-input (PLE) tables |
gemma4_vis_16blk_r2_f16.bin + g4v_patch_embed_f32.bin / g4v_pos_table_f32.bin / g4v_proj_w_f32.bin / g4v_rope_inv.bin |
vision ViT graph + host weights |
gemma4_audio_enc_f16.bin + audio_host/ + audio_fix/ |
audio conformer encoder + host fixtures + staged encoder_features (experimental) |
tokenizer.json |
tokenizer |
Run (after one-time QAIRT runtime-lib staging — see QHexRT docs/DEPLOY.md)
hf download runanywhere/gemma4_e4b_HNPU --local-dir gemma4_e4b_HNPU
adb push gemma4_e4b_HNPU/v81 /data/local/tmp/wq/g4 # PowerShell + native paths on Windows
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq/dsp;/data/local/tmp/wq;/vendor/dsp/cdsp'; \
LD_LIBRARY_PATH=. ./qhx_generate g4/gemma-4-E4B.json libQnnHtp.so libQnnSystem.so g4 40 'The capital of France is'"
# VLM: ./qhx_generate g4/gemma-4-E4B-vlm.json libQnnHtp.so libQnnSystem.so g4 60 'Describe this image.' g4/my.jpg
# Audio (EXPERIMENTAL — emits numbers, see caveat): ./qhx_generate g4/gemma-4-E4B-audio.json libQnnHtp.so libQnnSystem.so g4 48 $'\nTranscription: '
Tool arg order is invariant: <tool> <manifest> libQnnHtp.so libQnnSystem.so <artifacts_root> …. QNN runtime
libs come from the QAIRT SDK (lib/aarch64-android/) + the v81 HTP skel — not in this repo.
Caveats
- Base model (not
-it): use the completion style shown; chat templating degenerates on the base. - Audio transcription is experimental/blocked on v81 (above). Raw-wav input also needs a host mel+subsample
frontend (not yet shipped);
audio_fix/stagesencoder_featuresfor one test clip. - Built + validated with QAIRT 2.45/2.47, serial 8977b1dd (v81). Conversion SSOT:
recipes/gemma-4-e4b/.
- Downloads last month
- 37
Model tree for runanywhere/gemma4_e4b_HNPU
Base model
google/gemma-4-E4B