Instructions to use mouddane/granite-speech-4.1-2b-nar-mlx-5bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mouddane/granite-speech-4.1-2b-nar-mlx-5bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir granite-speech-4.1-2b-nar-mlx-5bit mouddane/granite-speech-4.1-2b-nar-mlx-5bit
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Granite Speech 4.1 2B NAR — MLX (5-bit)
5-bit quantized MLX port of ibm-granite/granite-speech-4.1-2b-nar for Apple Silicon. Runs via mlx-audio.
The bf16 baseline lives at mlx-community/granite-speech-4.1-2b-nar-mlx. This 5-bit variant trades a tiny amount of punctuation precision for ~47% size reduction and a comparable runtime.
Size and performance
| Variant | model.safetensors |
Inference (24.9s audio, M-series) | RTF |
|---|---|---|---|
| bf16 (baseline) | 4.51 GB | 1.33 s | 18.7× real-time |
| 5-bit (this repo) | 2.37 GB | 1.48 s | 16.8× real-time |
Quantization is applied only to the editor's Linear projections (1.6 B params; attention_multiplier / embedding_multiplier / logits_scaling / residual_multiplier are preserved as configured). The Conformer encoder (540 M params) and Q-Former projector (80 M params) stay at full precision because they are noise-sensitive and small enough that quantizing them would yield little memory benefit.
Quality drift vs bf16 on the multilingual reference sample is limited to:
- One added comma (
la nuit suivante,vsla nuit suivante) — arguably more correct French. - Spacing inside a quoted dialogue (
"si vousvs" si vous).
Word-level transcript is otherwise identical to the bf16 baseline. No characters or accents are corrupted (paraîtra, pêcheur, soeur all intact).
Architecture
Non-autoregressive ASR via CTC + bidirectional LM editing:
- 16-layer Conformer encoder (543 M params) produces an initial BPE CTC hypothesis.
- 2-layer windowed Q-Former projector (80 M params) converts multi-layer encoder states into audio embeddings.
- 40-layer bidirectional Granite editor (1.6 B params) takes
[audio | hypothesis_tokens]and emits edited logits in a single forward pass — no autoregression, no KV cache. - Final CTC collapse on text-position logits yields the transcript.
Total: ~2.25 B params. Editor quantized to 5-bit; encoder + projector remain bf16.
Quickstart
from pathlib import Path
from mlx_audio.stt.utils import load_model
model = load_model(Path("mlx-community/granite-speech-4.1-2b-nar-mlx-5bit"))
out = model.generate("audio.wav")
print(out.text)
Limitations
- Batch size 1 only.
- No streaming inference.
- macOS 14+, Apple Silicon (M-series).
Reference
- Upstream model card: https://huggingface.co/ibm-granite/granite-speech-4.1-2b-nar
- bf16 MLX variant: https://huggingface.co/mlx-community/granite-speech-4.1-2b-nar-mlx
Validated against the upstream PyTorch reference: exact transcript match on the bf16 baseline. The 5-bit variant matches the bf16 baseline at the word level (with the two minor punctuation differences listed above).
License
Apache-2.0, matching the upstream model.
- Downloads last month
- 43
5-bit
Model tree for mouddane/granite-speech-4.1-2b-nar-mlx-5bit
Base model
ibm-granite/granite-4.0-1b-base