moonshine-streaming-medium: transcribe.cpp GGUF

GGUF conversions of UsefulSensors/moonshine-streaming-medium for use with transcribe.cpp.

Ported from upstream commit 57b8436, pinned 2026-05-06. Validated against the HF Transformers v5.7.0 reference at transcribe.cpp commit 0d312ce on 2026-05-06.

Offline English speech-to-text. A 245M-parameter encoder-decoder ASR model designed for streaming use (ergodic encoder + sliding-window attention, 50 Hz time-domain frontend). Same family as moonshine-streaming-tiny and moonshine-streaming-small; deepest of the three (14 / 14 layers) and widest hidden dims (encoder 768 / decoder 640). Takes a 16 kHz mono WAV and produces a transcript. No translation, no multilingual capability, no timestamps.

Downloads

Quantization	Download	Size	WER (LibriSpeech test-clean)
F32	moonshine-streaming-medium-F32.gguf	1015 MB	2.16%
F16	moonshine-streaming-medium-F16.gguf	509 MB	2.16%
Q8_0	moonshine-streaming-medium-Q8_0.gguf	282 MB	2.16%

WER measured on the full LibriSpeech test-clean split (2620 utterances) with greedy decoding (num_beams=1, do_sample=False). F32 reference baseline: 2.16%. Quants are numerically indistinguishable from F32 on this manifest. Useful Sensors' self-reported number on this split is 2.08% from the Open ASR Leaderboard table; the +0.08pp residual matches the same scoring / text-normalization difference seen across the tiny and small variants (cross-checked against HF Transformers on tiny and found to be at 99.6% identical hypotheses to our port), and is not a numerical drift. Q6_K / Q5_K_M / Q4_K_M GGUFs are not currently shipped for this variant.

Usage

Build transcribe.cpp from source:

git clone git@github.com:handy-computer/transcribe.cpp.git
cd transcribe.cpp
cmake -B build && cmake --build build

Run on a 16 kHz mono WAV:

build/bin/transcribe-cli \
  -m moonshine-streaming-medium-Q8_0.gguf \
  input.wav

If your audio isn't already 16 kHz mono WAV, convert it first:

ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

See the transcribe.cpp model page for performance numbers, numerical validation, and reproduction steps.

License

Inherited from the base model: MIT. See the upstream model card for full terms.

Original Model Card

The section below is reproduced from UsefulSensors/moonshine-streaming-medium at commit 57b8436 for offline reference. The upstream card is the authoritative source.

Moonshine Streaming

[Paper]

This is the model card for the Moonshine Streaming automatic speech recognition (ASR) models trained and released by Useful Sensors. Moonshine Streaming pairs a lightweight 50~Hz audio frontend with a sliding-window Transformer encoder to deliver low-latency streaming ASR on edge-class hardware. The encoder uses bounded local attention and no positional embeddings (an "ergodic" encoder), while an adapter injects positional information before a standard autoregressive decoder.

This model card follows the recommendations from Model Cards for Model Reporting (Mitchell et al.). See the paper draft in this repository for full details.

Usage

Moonshine Streaming is supported by the Moonshine Voice framework for edge devices and in Hugging Face Transformers. The following example matches the standard seq2seq ASR API and uses the streaming model checkpoint:

pip install --upgrade pip
pip install --upgrade git+https://github.com/huggingface/transformers.git#egg=transformers datasets[audio]

from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor
from datasets import load_dataset, Audio
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = MoonshineStreamingForConditionalGeneration.from_pretrained(
    "usefulsensors/moonshine-streaming-small"
).to(device).to(torch_dtype)
processor = AutoProcessor.from_pretrained("usefulsensors/moonshine-streaming-small")

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]

inputs = processor(
    sample["array"],
    return_tensors="pt",
    sampling_rate=processor.feature_extractor.sampling_rate,
)
inputs = inputs.to(device, torch_dtype)

# Limit max output length to avoid hallucination loops.
token_limit_factor = 6.5 / processor.feature_extractor.sampling_rate
seq_lens = inputs.attention_mask.sum(dim=-1)
max_length = int((seq_lens * token_limit_factor).max().item())

generated_ids = model.generate(**inputs, max_length=max_length)
print(processor.decode(generated_ids[0], skip_special_tokens=True))

Note: the current Transformers code path does not yet implement fully efficient streaming for these models. It uses the flash-attention backend's sliding-window attention when available.

Model Details

Model type

Sequence-to-sequence ASR model with a streaming, sliding-window Transformer encoder and an autoregressive Transformer decoder.

Supported languages

English (trained and evaluated on English datasets).

Model sizes

Size	Parameters	Encoder / Decoder layers	Encoder dim	Decoder dim
Tiny	34M	6 / 6	320	320
Small	123M	10 / 10	620	512
Medium	245M	14 / 14	768	640

Architecture summary

Audio frontend: 50~Hz features using simple time-domain operations, CMVN, and two causal stride-2 convolutions.
Encoder: sliding-window self-attention with no positional embeddings (ergodic encoder). Windowing uses $(16,4)$ for the first two and last two layers and $(16,0)$ for intermediate layers, giving an 80~ms lookahead in the lookahead layers.
Adapter: adds learned positional embeddings and aligns dimensions before the decoder.
Decoder: causal Transformer with RoPE, autoregressively generating text.

Model Use

Intended use

These models are intended for low-latency, on-device English speech transcription on memory- and compute-constrained platforms (roughly 0.1--1~~TOPS and sub-1~~GB memory budgets). Typical applications include live captioning, voice commands, and real-time transcription.

Out-of-scope use

These models are not intended for non-consensual surveillance, speaker identification, or high-stakes decision-making contexts. They have not been robustly evaluated for tasks outside English ASR.

Training Data

Moonshine Streaming was trained on roughly 300K hours of speech data. This includes the original Moonshine training sources (about 200K hours of public web data and open datasets) plus an additional 100K hours of internally prepared speech data. See the paper for details and dataset sources.

Performance and Limitations

Open ASR benchmark results (WER %)

Dataset	Tiny (34M)	Small (123M)	Medium (245M)
AMI	19.03	12.54	10.68
Earnings-22	20.27	13.53	11.90
GigaSpeech	13.90	10.41	9.46
LibriSpeech (clean)	4.49	2.49	2.08
LibriSpeech (other)	12.09	6.78	5.00
SPGISpeech	6.16	3.19	2.58
TED-LIUM	6.12	3.77	2.99
VoxPopuli	14.02	9.98	8.54
Average	12.01	7.84	6.65

Known limitations

The decoder is autoregressive, so full-output latency grows with transcript length even when TTFT is low.
The Transformers implementation does not yet perform fully efficient streaming; it relies on the flash-attention backend for sliding-window attention.
Like other seq2seq ASR models, Moonshine Streaming can hallucinate words that are not present in the audio, and may repeat phrases, especially on short or noisy segments.

Broader Implications

Moonshine Streaming enables low-cost, low-latency transcription, which benefits accessibility and user interaction on edge devices. At the same time, ASR capabilities can be misused for surveillance or other harmful purposes. Users should consider consent, privacy, and domain-specific evaluation before deployment.

Citation

TBD

Downloads last month: 13

GGUF

Model size

0.3B params

Architecture

moonshine_streaming

Hardware compatibility

8-bit

16-bit

32-bit

Model tree for handy-computer/moonshine-streaming-medium-gguf

Base model

UsefulSensors/moonshine-streaming-medium

Quantized

(4)

this model

Collection including handy-computer/moonshine-streaming-medium-gguf

Moonshine GGUF

Collection

5 items • Updated May 7