Moonshine-tiny โ†’ Qualcomm Hexagon NPU (QHexRT)

Prebuilt QNN context binaries for running UsefulSensors/moonshine-tiny speech-to-text on-device on the Qualcomm Hexagon NPU via the QHexRT runtime. Arch: v81 (SM8850 / Snapdragon 8 Elite Gen 5). Device-validated at WER = 0 vs HF.

A context binary is arch-pinned (the dsp_arch + soc_model are baked in) โ€” these v81/ bins won't load on another Hexagon arch. Other arches are sibling <arch>/ dirs (re-converted), added to this same repo.

How it runs

Moonshine is an encoderโ€“decoder ASR model on the raw 16 kHz waveform (no mel spectrogram). The pipeline:

  • Host runs the 3-conv raw-audio stem (conv1 k127/s64 + tanh โ†’ GroupNorm โ†’ conv2 k7/s3 + gelu โ†’ conv3 k3/s2 + gelu) โ€” the conv is HTP-hostile so it stays on the CPU; the encoder graph starts at the post-conv features.
  • NPU encoder (bidirectional, partial interleaved RoPE) โ†’ per-decoder-layer cross-attention K/V.
  • NPU decoder (one autoregressive step: causal self-attn + RoPE in-graph, cross-attn to the cached encoder states, gated SwiGLU; tied lm-head) โ†’ tokens, detokenized on the host.

Variable-length audio is handled on a fixed graph (n_audio = 415, ~10 s window): the host pads/truncates the features and masks the padding (encoder_mask / cross_mask). Precision fp16 (encoder + decoder).

Files (v81/)

file role size
moonshine-tiny.json QHexRT manifest (declarative run plan; moonshine_transcribe host-op) ~1 KB
moonshinetiny_enc_f16.bin encoder context binary 17.5 MB
moonshinetiny_dec_f16.bin decoder context binary 56.5 MB
moonshine_conv_stem.bin host raw-audio conv-stem weights [c1w,c2w,c2b,c3w,c3b,gnw,gnb] 6.8 MB
tokenizer.json SentencePiece-style BPE (byte_fallback; metaspace detok) 3.8 MB

The QNN runtime libs (libQnnHtp.so / libQnnSystem.so + the v81 HTP skel) come from the QAIRT SDK, not this repo. The qhx_asr tool comes from a QHexRT build.

Run

hf download runanywhere/moonshine_tiny_HNPU --local-dir moonshine_tiny_HNPU
# Windows: adb push from PowerShell with native paths.
adb push moonshine_tiny_HNPU/v81 /data/local/tmp/wq/moonshine
adb push my_audio_16k_mono.wav /data/local/tmp/wq/moonshine/
adb shell "cd /data/local/tmp/wq && export ADSP_LIBRARY_PATH='/data/local/tmp/wq/dsp;/data/local/tmp/wq;/vendor/dsp/cdsp'; \
  LD_LIBRARY_PATH=. ./qhx_asr moonshine/moonshine-tiny.json libQnnHtp.so libQnnSystem.so moonshine moonshine/my_audio_16k_mono.wav"

Tool arg order is invariant: qhx_asr <manifest> libQnnHtp.so libQnnSystem.so <artifacts_root> <audio16k.wav>. Input audio must be 16 kHz mono (PCM16 or float32).

Measured (v81, SM8850, soc_model 87, QAIRT 2.47)

metric value
Parity WER = 0.0000 vs HF UsefulSensors/moonshine-tiny (LibriSpeech sample)
Latency 26 tokens in 551 ms for a 5.9 s clip
Precision fp16 encoder + decoder

Example: a 5.9 s clip โ†’ "Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel." (matches HF exactly).

Caveats

  • v81 only here (arch-pinned). fp16 weights. Audio window ~10 s (n_audio = 415); longer clips are truncated to the window.
  • Parity is greedy (temperature 0) vs the HF reference. WER measured on a standard LibriSpeech sample.

Converted + device-validated with the QHexRT forge pipeline (recipes/moonshine-tiny).

Downloads last month
31
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for runanywhere/moonshine_tiny_HNPU

Finetuned
(6)
this model