Granite Speech 4.1 2B NAR — MLX (5-bit)

5-bit quantized MLX port of ibm-granite/granite-speech-4.1-2b-nar for Apple Silicon. Runs via mlx-audio.

The bf16 baseline lives at mlx-community/granite-speech-4.1-2b-nar-mlx. This 5-bit variant trades a tiny amount of punctuation precision for ~47% size reduction and a comparable runtime.

Size and performance

Variant	`model.safetensors`	Inference (24.9s audio, M-series)	RTF
bf16 (baseline)	4.51 GB	1.33 s	18.7× real-time
5-bit (this repo)	2.37 GB	1.48 s	16.8× real-time

Quantization is applied only to the editor's Linear projections (1.6 B params; attention_multiplier / embedding_multiplier / logits_scaling / residual_multiplier are preserved as configured). The Conformer encoder (~~540 M params) and Q-Former projector (~~80 M params) stay at full precision because they are noise-sensitive and small enough that quantizing them would yield little memory benefit.

Quality drift vs bf16 on the multilingual reference sample is limited to:

One added comma (la nuit suivante, vs la nuit suivante) — arguably more correct French.
Spacing inside a quoted dialogue ("si vous vs " si vous).

Word-level transcript is otherwise identical to the bf16 baseline. No characters or accents are corrupted (paraîtra, pêcheur, soeur all intact).

Architecture

Non-autoregressive ASR via CTC + bidirectional LM editing:

16-layer Conformer encoder (543 M params) produces an initial BPE CTC hypothesis.
2-layer windowed Q-Former projector (80 M params) converts multi-layer encoder states into audio embeddings.
40-layer bidirectional Granite editor (1.6 B params) takes [audio | hypothesis_tokens] and emits edited logits in a single forward pass — no autoregression, no KV cache.
Final CTC collapse on text-position logits yields the transcript.

Total: ~2.25 B params. Editor quantized to 5-bit; encoder + projector remain bf16.

Quickstart

from pathlib import Path
from mlx_audio.stt.utils import load_model

model = load_model(Path("mlx-community/granite-speech-4.1-2b-nar-mlx-5bit"))
out = model.generate("audio.wav")
print(out.text)

Limitations

Batch size 1 only.
No streaming inference.
macOS 14+, Apple Silicon (M-series).

Reference

Upstream model card: https://huggingface.co/ibm-granite/granite-speech-4.1-2b-nar
bf16 MLX variant: https://huggingface.co/mlx-community/granite-speech-4.1-2b-nar-mlx

Validated against the upstream PyTorch reference: exact transcript match on the bf16 baseline. The 5-bit variant matches the bf16 baseline at the word level (with the two minor punctuation differences listed above).

License

Apache-2.0, matching the upstream model.

Downloads last month: 43

Safetensors

Model size

0.9B params

Tensor type

BF16

U32

MLX

Hardware compatibility

5-bit

Model tree for mouddane/granite-speech-4.1-2b-nar-mlx-5bit

Base model

ibm-granite/granite-4.0-1b-base

Finetuned

ibm-granite/granite-speech-4.1-2b-nar

Quantized

(3)

this model