Moonshine-tiny → Qualcomm Hexagon NPU (QHexRT)

Prebuilt QNN context binaries for running UsefulSensors/moonshine-tiny speech-to-text on-device on the Qualcomm Hexagon NPU via the QHexRT runtime. Arch: v81 (SM8850 / Snapdragon 8 Elite Gen 5). Device-validated at WER = 0 vs HF.

A context binary is arch-pinned (the dsp_arch + soc_model are baked in) — these v81/ bins won't load on another Hexagon arch. Other arches are sibling <arch>/ dirs (re-converted), added to this same repo.

How it runs

Moonshine is an encoder–decoder ASR model on the raw 16 kHz waveform (no mel spectrogram). The pipeline:

Host runs the 3-conv raw-audio stem (conv1 k127/s64 + tanh → GroupNorm → conv2 k7/s3 + gelu → conv3 k3/s2 + gelu) — the conv is HTP-hostile so it stays on the CPU; the encoder graph starts at the post-conv features.
NPU encoder (bidirectional, partial interleaved RoPE) → per-decoder-layer cross-attention K/V.
NPU decoder (one autoregressive step: causal self-attn + RoPE in-graph, cross-attn to the cached encoder states, gated SwiGLU; tied lm-head) → tokens, detokenized on the host.

Variable-length audio is handled on a fixed graph (n_audio = 415, ~10 s window): the host pads/truncates the features and masks the padding (encoder_mask / cross_mask). Precision fp16 (encoder + decoder).

Files (`v81/`)

file	role	size
`moonshine-tiny.json`	QHexRT manifest (declarative run plan; `moonshine_transcribe` host-op)	~1 KB
`moonshinetiny_enc_f16.bin`	encoder context binary	17.5 MB
`moonshinetiny_dec_f16.bin`	decoder context binary	56.5 MB
`moonshine_conv_stem.bin`	host raw-audio conv-stem weights `[c1w,c2w,c2b,c3w,c3b,gnw,gnb]`	6.8 MB
`tokenizer.json`	SentencePiece-style BPE (byte_fallback; metaspace detok)	3.8 MB

The QNN runtime libs (libQnnHtp.so / libQnnSystem.so + the v81 HTP skel) come from the QAIRT SDK, not this repo. The qhx_asr tool comes from a QHexRT build.

Run

hf download runanywhere/moonshine_tiny_HNPU --local-dir moonshine_tiny_HNPU
# Windows: adb push from PowerShell with native paths.
adb push moonshine_tiny_HNPU/v81 /data/local/tmp/wq/moonshine
adb push my_audio_16k_mono.wav /data/local/tmp/wq/moonshine/
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq/dsp;/data/local/tmp/wq;/vendor/dsp/cdsp'; \
  LD_LIBRARY_PATH=. ./qhx_asr moonshine/moonshine-tiny.json libQnnHtp.so libQnnSystem.so moonshine moonshine/my_audio_16k_mono.wav"

Tool arg order is invariant: qhx_asr <manifest> libQnnHtp.so libQnnSystem.so <artifacts_root> <audio16k.wav>. Input audio must be 16 kHz mono (PCM16 or float32).

Measured (v81, SM8850, soc_model 87, QAIRT 2.47)

metric	value
Parity	WER = 0.0000 vs HF `UsefulSensors/moonshine-tiny` (LibriSpeech sample)
Latency	26 tokens in 551 ms for a 5.9 s clip
Precision	fp16 encoder + decoder

Example: a 5.9 s clip → "Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel." (matches HF exactly).

Caveats

v81 only here (arch-pinned). fp16 weights. Audio window ~10 s (n_audio = 415); longer clips are truncated to the window.
Parity is greedy (temperature 0) vs the HF reference. WER measured on a standard LibriSpeech sample.

Converted + device-validated with the QHexRT forge pipeline (recipes/moonshine-tiny).

Downloads last month: 31

Model tree for runanywhere/moonshine_tiny_HNPU

Base model

UsefulSensors/moonshine-tiny

Finetuned

(6)

this model