Paraformer-zh โ GGUF (ggml-quantised)
GGUF / ggml conversion of funasr/paraformer-zh for use with the paraformer backend in CrispStrobe/CrispASR.
Paraformer-zh is Alibaba's non-autoregressive ASR model (~220M params): a single forward pass through 50 SANM encoder blocks, a CIF (continuous integrate-and-fire) predictor, and 16 NAR decoder blocks produces the full transcript โ no autoregressive token-by-token generation. Primarily Mandarin Chinese with English support. Character-level tokenizer (8404 vocab).
Architecture
Audio (16 kHz mono)
โ Kaldi fbank (80 mel, 25 ms / 10 ms)
โ LFR: stack 7, stride 6 โ (T_lfr, 560)
โ CMVN (AddShift + Rescale, 560-dim)
โ SANMEncoder: 1 entry block (560โ512) + 49 main blocks (512โ512)
each: LayerNorm โ fused QKV โ FSMN(k=11) + MHA(4 heads) โ FFN(2048)
โ CifPredictorV2:
Conv1d(512,512,k=3) โ ReLU โ Linear(512,1) โ sigmoid
โ CIF accumulation (fire when alpha โฅ 1.0)
โ acoustic_embeds: (N_tokens, 512)
โ ParaformerSANMDecoder: 16 blocks
each: norm1 โ FFN โ norm2 โ FSMN(k=11) โ norm3 โ cross-attn(Q=dec, KV=enc)
โ decoders3: 1 post-processing block (FFN only)
โ after_norm โ output_layer(512โ8404) โ argmax โ characters
- Encoder reuses the same SANM block as Fun-ASR-Nano and SenseVoice
- Decoder block order is unusual: FFN โ FSMN โ cross-attn (not the more common self-attn โ cross-attn โ FFN)
- FSMN = depthwise conv (no Q/K/V self-attention in the decoder)
- Cross-attention uses a fused K+V projection from encoder output
Files
| File | Size | Notes |
|---|---|---|
paraformer-zh-q4_k.gguf |
123 MB | Recommended default. Byte-identical transcript to F16 on both Chinese and English test clips. Auto-download target for --backend paraformer -m auto. |
paraformer-zh-q8_0.gguf |
227 MB | Byte-identical transcript to F16. |
paraformer-zh-f16.gguf |
421 MB | F16 reference weights (956 tensors). Use for diff testing against the upstream PyTorch reference. |
Quick Start
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build --target crispasr-cli
# Chinese:
./build/bin/crispasr \
--backend paraformer \
-m /path/to/paraformer-zh-q4_k.gguf \
-f chinese_audio.wav --no-prints
# โ ๆญฃๆฏๅ ไธบๅญๅจ็ปๅฏนๆญฃไนๆไปฅๆไปฌๆฅๅ็ฐๅฎ็็ธๅฏนๆญฃไน...
# English:
./build/bin/crispasr \
--backend paraformer \
-m /path/to/paraformer-zh-q4_k.gguf \
-f samples/jfk.wav --no-prints
# โ and so my fellow americans ask not what your country can do for you ask what you can do for your country
# Or auto-download (resolves to Q4_K by default):
./build/bin/crispasr --backend paraformer -m auto -f audio.wav
Output format
The output is raw character-level text:
- Chinese: characters concatenated directly (no spaces) โ standard for Chinese text
- English: word-level tokens with spaces inserted between consecutive English words; BPE continuation markers (
@@) handled internally - No punctuation or casing โ the model's character vocabulary has only lowercase English. Use
--punc-modelfor punctuation restoration if needed.
Verification
All three quants (F16, Q4_K, Q8_0) produce byte-identical transcripts vs the upstream Python reference (funasr.AutoModel.generate()) on:
- Chinese (13 s
asr_example.wav): 66 characters, exact match - English (11 s JFK
samples/jfk.wav): 26 tokens, exact match
The crispasr-diff paraformer harness captures 73 intermediate stages (mel features, 50 encoder layers, CIF alphas, acoustic embeds, 16 decoder layers, decoder output, generated text) for element-wise cosine-similarity comparison.
Converting from upstream
If you want to convert from the upstream PyTorch model yourself:
# Download upstream model
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download('funasr/paraformer-zh',
local_dir='paraformer-zh-upstream',
local_dir_use_symlinks=False)
"
# Convert to GGUF
python3 models/convert-paraformer-to-gguf.py \
--input paraformer-zh-upstream \
--output paraformer-zh-f16.gguf
# Quantize
./build/bin/crispasr-quantize paraformer-zh-f16.gguf paraformer-zh-q4_k.gguf q4_k
./build/bin/crispasr-quantize paraformer-zh-f16.gguf paraformer-zh-q8_0.gguf q8_0
Licence + attribution
Upstream funasr/paraformer-zh:
- Code (the
funasrPython package): Apache-2.0. - Model weights: FunASR Model License (Alibaba) โ commercial use OK with attribution.
These GGUF files are a quantised / repackaged distribution of the upstream weights and inherit the FunASR Model License. Please attribute Alibaba / FunAudioLLM in downstream products.
If you use this model, please also cite the upstream FunASR work. See the upstream model card for the canonical citation.
- Downloads last month
- 271
8-bit
16-bit
Model tree for cstr/paraformer-zh-GGUF
Base model
funasr/paraformer-zh