Model support per CrispASR — pure C++ inference with GGUF quantisation (no Python needed)

#12

by cstr - opened Apr 16

Apr 16

We've built a complete C++ runtime for Qwen3-ASR in CrispASR, a multi-backend ASR tool based on ggml. One binary, one GGUF file — no Python, no PyTorch, no pip install.

What works:

Full pipeline: mel → Whisper-style audio encoder → Qwen3 0.6B LLM decode
Q4_K / Q5_0 / Q8_0 / F16 quantisation (513 MB Q4_K vs 1.8 GB F16)
30 languages + 22 Chinese dialects
Faster-than-realtime on CPU (10.5s for 11s audio on a 4-core Xeon)
Word-level timestamps via Qwen3-ForcedAligner (-am qwen3-forced-aligner.gguf)
Temperature sampling + best-of-N decoding (--best-of 5 -tp 0.3)
Streaming from mic/stdin (--stream, --mic, --live)
Speaker diarisation, language ID, SRT/VTT/JSON output
GPU acceleration via CUDA / Metal / Vulkan (ggml backends)

Quick start:

git clone https://github.com/CrispStrobe/CrispASR && cd CrispASR
cmake -S . -B build && cmake --build build -j8

# Auto-download and transcribe
./build/bin/crispasr --backend qwen3 -m auto -f audio.wav

# Or use pre-quantised GGUF
./build/bin/crispasr -m qwen3-asr-0.6b-q4_k.gguf -f audio.wav -osrt

Pre-quantised GGUFs: cstr/qwen3-asr-0.6b-GGUF

CrispASR supports 11 ASR backends in the same binary (Whisper, Parakeet, Canary, Cohere, Granite, Voxtral 3B/4B, wav2vec2, and Qwen3-ASR).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment