Parakeet GGUF โ€” models for parakeet.cpp

GGUF-format weights for parakeet.cpp, a C++/ggml port of NVIDIA NeMo Parakeet that matches the upstream PyTorch models on CPU. This single repo collects every supported model ร— quantization as a flat set of .gguf files โ€” download just the one you need.

F16 is the recommended default โ€” same accuracy as F32, ~1.7ร— smaller, and typically the fastest on modern CPUs via ggml's F32ร—F16 matmul fast path.

Models

tdt_ctc-110m

Source: nvidia/parakeet-tdt_ctc-110m ยท Hybrid TDT+CTC (FastConformer) ยท heads: TDT + CTC

File Variant Size WER vs NeMo
tdt_ctc-110m-f16.gguf โ† recommended F16 267.5 MB 0.0000
tdt_ctc-110m-q8_0.gguf Q8_0 177.8 MB 0.0000
tdt_ctc-110m-q6_k.gguf Q6_K 155.9 MB not measured
tdt_ctc-110m-q5_k.gguf Q5_K 143.3 MB not measured
tdt_ctc-110m-q4_k.gguf Q4_K 131.4 MB 0.0000

realtime_eou_120m-v1

Source: nvidia/parakeet_realtime_eou_120m-v1 ยท Cache-aware streaming RNNT (FastConformer, EOU/EOB) ยท heads: RNNT (streaming)

File Variant Size WER vs NeMo
realtime_eou_120m-v1-f16.gguf โ† recommended F16 266.5 MB not measured
realtime_eou_120m-v1-q8_0.gguf Q8_0 176.0 MB not measured
realtime_eou_120m-v1-q6_k.gguf Q6_K 153.9 MB not measured
realtime_eou_120m-v1-q5_k.gguf Q5_K 141.2 MB not measured
realtime_eou_120m-v1-q4_k.gguf Q4_K 129.1 MB not measured

ctc-0.6b

Source: nvidia/parakeet-ctc-0.6b ยท CTC (FastConformer) ยท heads: CTC

File Variant Size WER vs NeMo
ctc-0.6b-f16.gguf โ† recommended F16 1373.4 MB 0.0000
ctc-0.6b-q8_0.gguf Q8_0 875.4 MB 0.0000
ctc-0.6b-q6_k.gguf Q6_K 746.8 MB not measured
ctc-0.6b-q5_k.gguf Q5_K 676.3 MB not measured
ctc-0.6b-q4_k.gguf Q4_K 609.9 MB not measured

rnnt-0.6b

Source: nvidia/parakeet-rnnt-0.6b ยท RNNT transducer (FastConformer) ยท heads: RNNT

File Variant Size WER vs NeMo
rnnt-0.6b-f16.gguf โ† recommended F16 1402.8 MB 0.0000
rnnt-0.6b-q8_0.gguf Q8_0 903.9 MB 0.0000
rnnt-0.6b-q6_k.gguf Q6_K 776.3 MB not measured
rnnt-0.6b-q5_k.gguf Q5_K 705.7 MB not measured
rnnt-0.6b-q4_k.gguf Q4_K 639.2 MB not measured

tdt-0.6b-v2

Source: nvidia/parakeet-tdt-0.6b-v2 ยท TDT transducer (FastConformer) ยท heads: TDT

File Variant Size WER vs NeMo
tdt-0.6b-v2-f16.gguf โ† recommended F16 1404.2 MB 0.0000
tdt-0.6b-v2-q8_0.gguf Q8_0 903.8 MB 0.0000
tdt-0.6b-v2-q6_k.gguf Q6_K 775.9 MB not measured
tdt-0.6b-v2-q5_k.gguf Q5_K 705.0 MB not measured
tdt-0.6b-v2-q4_k.gguf Q4_K 638.4 MB not measured

tdt-0.6b-v3

Source: nvidia/parakeet-tdt-0.6b-v3 ยท TDT transducer (FastConformer) ยท heads: TDT

File Variant Size WER vs NeMo
tdt-0.6b-v3-f16.gguf โ† recommended F16 1441.0 MB 0.0000
tdt-0.6b-v3-q8_0.gguf Q8_0 940.7 MB 0.0000
tdt-0.6b-v3-q6_k.gguf Q6_K 812.7 MB not measured
tdt-0.6b-v3-q5_k.gguf Q5_K 741.9 MB not measured
tdt-0.6b-v3-q4_k.gguf Q4_K 675.2 MB not measured

ctc-1.1b

Source: nvidia/parakeet-ctc-1.1b ยท CTC (FastConformer) ยท heads: CTC

File Variant Size WER vs NeMo
ctc-1.1b-f16.gguf โ† recommended F16 2395.8 MB 0.0000
ctc-1.1b-q8_0.gguf Q8_0 1526.3 MB 0.0000
ctc-1.1b-q6_k.gguf Q6_K 1301.7 MB not measured
ctc-1.1b-q5_k.gguf Q5_K 1178.5 MB not measured
ctc-1.1b-q4_k.gguf Q4_K 1062.6 MB not measured

rnnt-1.1b

Source: nvidia/parakeet-rnnt-1.1b ยท RNNT transducer (FastConformer) ยท heads: RNNT

File Variant Size WER vs NeMo
rnnt-1.1b-f16.gguf โ† recommended F16 2425.2 MB 0.0000
rnnt-1.1b-q8_0.gguf Q8_0 1554.7 MB 0.0000
rnnt-1.1b-q6_k.gguf Q6_K 1331.2 MB not measured
rnnt-1.1b-q5_k.gguf Q5_K 1207.9 MB not measured
rnnt-1.1b-q4_k.gguf Q4_K 1091.9 MB not measured

tdt-1.1b

Source: nvidia/parakeet-tdt-1.1b ยท TDT transducer (FastConformer) ยท heads: TDT

File Variant Size WER vs NeMo
tdt-1.1b-f16.gguf โ† recommended F16 2425.3 MB 0.0000
tdt-1.1b-q8_0.gguf Q8_0 1554.8 MB 0.0000
tdt-1.1b-q6_k.gguf Q6_K 1331.2 MB not measured
tdt-1.1b-q5_k.gguf Q5_K 1207.9 MB not measured
tdt-1.1b-q4_k.gguf Q4_K 1091.9 MB not measured

tdt_ctc-1.1b

Source: nvidia/parakeet-tdt_ctc-1.1b ยท Hybrid TDT+CTC (FastConformer) ยท heads: TDT + CTC

File Variant Size WER vs NeMo
tdt_ctc-1.1b-f16.gguf โ† recommended F16 2429.5 MB 0.0000
tdt_ctc-1.1b-q8_0.gguf Q8_0 1559.0 MB 0.0000
tdt_ctc-1.1b-q6_k.gguf Q6_K 1335.4 MB not measured
tdt_ctc-1.1b-q5_k.gguf Q5_K 1212.1 MB not measured
tdt_ctc-1.1b-q4_k.gguf Q4_K 1096.1 MB not measured

WER (word error rate) is computed against the upstream NeMo reference on tests/fixtures/speech.wav (LibriSpeech 2086-149220-0033, ~7.4 s, English). 0.0 = byte-for-byte identical transcript. See parity.md and quantization.md.

Quantization notes

Quantization is applied only to the large linear weights fed directly into ggml_mul_mat (encoder FFN + attention projections, subsampling output projection, joint enc/pred projections). All other tensors (mel filterbank, LSTM prediction net, conv kernels, batch_norm stats, norms, biases, embeddings) stay F32.

Usage

# 1. Clone + build parakeet.cpp
git clone https://github.com/mudler/parakeet.cpp
cd parakeet.cpp
cmake -B build -DPARAKEET_BUILD_CLI=ON && cmake --build build -j

# 2. Download one quant (F16 recommended)
huggingface-cli download mudler/parakeet-cpp-gguf tdt_ctc-110m-f16.gguf --local-dir models/

# 3. Transcribe
build/examples/cli/parakeet-cli transcribe \
    --model models/tdt_ctc-110m-f16.gguf \
    --input audio.wav

License

The GGUF weights are derived from the NVIDIA NeMo Parakeet checkpoints, released under the CC-BY-4.0 license. The parakeet.cpp runtime is MIT-licensed.

Downloads last month
192
GGUF
Model size
0.6B params
Architecture
parakeet
Hardware compatibility
Log In to add your hardware

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mudler/parakeet-cpp-gguf

Quantized
(9)
this model