SenseVoiceSmall โ GGUF (ggml-quantised)
GGUF / ggml conversion of FunAudioLLM/SenseVoiceSmall for use with the sensevoice backend in CrispStrobe/CrispASR.
SenseVoiceSmall is Alibaba's multi-task encoder-only ASR: one forward pass through a 70-block SANM encoder emits the full transcript plus the spoken language ID, emotion, and audio-event tags through a single CTC head. Non-autoregressive design โ 15ร faster than Whisper-Large (70 ms for 10 s of audio in upstream's measurements).
- 70-block SenseVoiceEncoderSmall (1 entry block @ 560โ512 + 49 main blocks + 20 tp blocks, all 512-dim, 4 heads, FSMN k=11 depthwise convolution branch โ the same encoder body Fun-ASR-Nano-2512 ships, just here paired with a CTC head instead of an LLM decoder)
- 4 query embeddings (language / event / emotion / textnorm) prepended to the LFR fbank features so the encoder can emit rich annotations at those positions
- CTC head (
ctc.ctc_lo, 25055 SentencePiece pieces) - 50+ languages with native LID (no whisper-tiny pre-step needed)
- Three quants shipped (May 2026): F16 (448 MB), Q8_0 (240 MB), Q4_K (129 MB โ recommended default). All three produce byte-identical transcripts on English (JFK) and Japanese (JSUT) clips end-to-end on M1 Metal. 72 tensors stay F16 in the Q4_K/Q8_0 quants because their leading dim isn't quant-block-aligned: 70ร
attn.fsmn.w(kernel=11 depthwise convolution) and 2รattn.qkv.w(560-dim input from the SANM context concat); the other ~280 weight matrices quantize cleanly.
What you get in the output
By default, stdout shows the clean transcript:
And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.
With -oj the JSON output exposes the four rich-annotation tags as
explicit fields:
{
"text": "And so my fellow Americans...",
"language": "en",
"audio_event": "Speech",
"emotion": "ANGRY",
"itn_flag": "withitn"
}
The legacy sensevoice_transcribe() C ABI still returns the original
prefixed string for callers that want it that way:
<|en|><|HAPPY|><|Speech|><|withitn|>And so my fellow Americans...
<|zh|><|NEUTRAL|><|Speech|><|withitn|>ๅผ้ฅญๆถ้ดๆฉไธ9็น่ณไธๅ5็นใ
New callers should use sensevoice_transcribe_structured() which
returns the same six fields as a struct sensevoice_result.
Tag value sets:
- Languages:
zh/en/yue/ja/ko/nospeech - Emotions:
HAPPY/SAD/ANGRY/NEUTRAL/EMO_UNKNOWN - Audio events:
Speech/Music/Applause/Laughter/Cry/BGM(and more โ the upstream set is open-ended) - Text norm:
withitn(Arabic digits, punctuation) orwoitn(raw)
Files
| File | Size | Notes |
|---|---|---|
sensevoice-small-q4_k.gguf |
129 MB | Recommended default. 2ร faster on M1 vs F16; byte-identical transcript on tested clips. Auto-download target for --backend sensevoice -m auto. |
sensevoice-small-q8_0.gguf |
240 MB | Larger but slightly closer to F16 numerically on borderline emotion-tag argmax cases. |
sensevoice-small-f16.gguf |
448 MB | F16 reference weights. Use when you want bit-stability against the upstream PyTorch reference for diff testing. |
Quick Start
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build --target crispasr-cli
./build/bin/crispasr \
--backend sensevoice \
-m /path/to/sensevoice-small-q4_k.gguf \
-f samples/jfk.wav -l en
# Or auto-download (resolves to Q4_K by default):
./build/bin/crispasr --backend sensevoice -m auto -f samples/jfk.wav -l en
Verification
crispasr-diff sensevoice is 76/76 PASS, byte-identical generated_text,
on Alibaba's own example zh.mp3; 75/76 PASS on samples/jfk.wav with
the single difference being the emotion-tag argmax flipping between
<|ANGRY|> and <|EMO_UNKNOWN|> (F16/op-order pushes that one slot
across a near-tied boundary; the transcript itself is byte-identical
in both runs). On Apple M1 Metal the runtime hits 15-22ร realtime.
Licence + attribution
Upstream FunAudioLLM/SenseVoiceSmall:
- Code (the
funasrPython package): Apache-2.0. - Model weights: FunASR Model License v1.1 (Alibaba) โ commercial use OK with attribution. Confirmed on the upstream-tracking discussion in CrispStrobe/CrispASR#99.
These GGUF files are a quantised / repackaged distribution of the upstream weights and inherit the FunASR Model License v1.1. Please attribute Alibaba / FunAudioLLM in downstream products.
If you use this model, please also cite the upstream FunASR work. See the upstream model card for the canonical citation.
- Downloads last month
- 178
8-bit
16-bit
Model tree for cstr/sensevoice-small-GGUF
Base model
FunAudioLLM/SenseVoiceSmall