LFM2.5-230M — Hexagon v81 NPU (QHexRT)
LiquidAI/LFM2.5-230M running fully on the Qualcomm Hexagon v81 NPU (Snapdragon 8 Elite Gen-2 / SM8850) via QHexRT — RunAnywhere's inference engine for Qualcomm NPUs. 100% on the HTP. No Python in the hot path. W8 weight-only, GQA-native decode, batched prefill, on-NPU lm-head.
QHexRT is the first engine built to run LLM, VLM, STT, TTS, and embeddings fully on Qualcomm Hexagon NPUs. LFM 2.5 230M is the first model in the catalog.
Why the NPU — measured on SM8850 (vs llama.cpp CPU, same device)
| metric | Hexagon v81 NPU | CPU (llama.cpp Q8_0) | NPU advantage |
|---|---|---|---|
| Prefill | 12,540 tok/s | 871 tok/s | ~14× faster |
| Time-to-first-token (512-token prompt) | ~36 ms (flat) | 588 ms | ~16× lower |
| End-to-end (512-token prompt + 128 new) | 0.77 s | 1.13 s | ~1.5× faster |
Batched O(1) prefill holds TTFT flat at ~36 ms regardless of prompt length, so the NPU pulls further ahead the longer the context — at far lower power than driving 8 CPU cores at max clock.
Full launch write-up: runanywhere.ai/blog · article draft in this repo
Run
hf download runanywhere/lfm2_5_230m_HNPU --local-dir lfm2_5_230m_HNPU
adb push lfm2_5_230m_HNPU/v81 /data/local/tmp/lfm230 # PowerShell + native paths on Windows
adb shell "cd /data/local/tmp/lfm230 && LD_LIBRARY_PATH=. \
./qhx_generate lfm2-5-230m.json libQnnHtp.so libQnnSystem.so . 64 'The capital of France is'"
Stage the QAIRT v81 runtime libs (libQnnHtp.so, libQnnSystem.so, libQnnHtpV81Skel.so/Stub.so) + the
qhx_generate tool into the same dir (from the QAIRT SDK). Context binaries are arch-pinned to v81.
Contact san@runanywhere.ai for QHexRT deployment access.
v81/
lfm2-5-230m.json (manifest) · lfm230_dec_512_w8.bin (decode) · lfm230_pf_512_w8.bin (prefill) ·
lfm230_lmh_w8.bin (lm-head) · lfm_embed_f16.bin (embeddings) · tokenizer.json