Instructions to use interfaze-ai/diffusion-gemma-asr-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use interfaze-ai/diffusion-gemma-asr-small with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="interfaze-ai/diffusion-gemma-asr-small")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("interfaze-ai/diffusion-gemma-asr-small", dtype="auto") - Notebooks
- Google Colab
- Kaggle
diffusion-gemma-asr-small
Audio-native, multilingual speech recognition that transcribes through DiffusionGemma's own discrete-diffusion decoder β not autoregressive, not an external ASR decoder. Audio is projected directly into the Gemma embedding space, and the transcript is produced by parallel diffusion denoising (~8β16 steps), giving real-time-plus throughput where cost is set by the number of denoising steps, not the length of the transcript.
This repo ships the trained adapter only (projector + LoRA, ~42M params β 0.16% of the model). The frozen 26B DiffusionGemma backbone and the frozen whisper-small encoder load from their own repos.
How it works
raw audio ββΊ whisper-small encoder (frozen) ββΊ projector (trained, ~19M)
ββΊ scatter into <audio> token slots of DiffusionGemma's encoder
ββΊ DiffusionGemma decoder denoises a 192-token canvas (bidirectional, cross-attends audio)
ββΊ transcript
- Backbone:
google/diffusiongemma-26B-A4B-itβ frozen, small LoRA adapters on encoder/decoder attention. - Audio frontend:
openai/whisper-smallencoder β frozen feature extractor (NOT a decoder). - Grounding: trained with three losses β uniform-diffusion (the generator), an AR auxiliary,
and a CTC loss on the projector via the frozen
lm_head(the key unlock that makes the audio embeddings transcript-predictive).
Usage
Install
pip install torch peft soundfile librosa huggingface_hub \
"transformers @ git+https://github.com/huggingface/transformers.git" # DiffusionGemma support
Transcribe in Python
import sys, soundfile as sf
from huggingface_hub import snapshot_download
repo = snapshot_download("interfaze-ai/diffusion-gemma-asr-small") # this adapter (~170 MB)
sys.path.insert(0, repo)
from inference import load, transcribe # bundled in this repo
# Loads frozen DiffusionGemma-26B + whisper-small + this adapter (downloads bases on first run).
model, tok, fe = load(f"{repo}/diffusion_asr_small.pt", device="cuda")
wav, sr = sf.read("audio.wav") # 16 kHz mono float32 (inference.py resamples if needed)
print(transcribe(wav, model, tok, fe, max_steps=16))
Or from the command line
python inference.py audio.wav # run inside the downloaded repo dir
Long audio is split at silence (the encoder has a 30 s window, like Whisper). max_steps trades
speed for accuracy β 8 is near-best and fastest, 16 is the default.
Languages & accuracy
Trained on FLEURS (6 languages) + LibriSpeech (en) + VoxPopuli (en/de/fr/es). WER/CER are Whisper-normalized (Open-ASR / Artificial-Analysis convention), 16 diffusion steps:
| benchmark | metric | score |
|---|---|---|
| LibriSpeech test-clean (en) | WER | 6.6% |
| FLEURS English | WER | 15.7% |
| VoxPopuli English | WER | 18.5% |
| FLEURS Hindi | CER | 15.8% |
| FLEURS Mandarin | CER | 29.6% |
Among diffusion / non-autoregressive ASR it leads (6.6% on LibriSpeech vs Whisfusion's 8.3%, with a smaller encoder). It trails autoregressive Whisper β a training-data gap (~219 h seen), not architecture.
Files
diffusion_asr_small.ptβ trained adapter ({"projector": ..., "lora": ...})model.py,audio.pyβ model definition (self-contained)inference.pyβ runnable example (load + segment + transcribe)requirements.txt
Requirements / licensing
- Needs
transformersfrom main (DiffusionGemma support) +torch,peft. - Base models load from their own repos under their licenses:
google/diffusiongemma-26B-A4B-it(Gemma terms) andopenai/whisper-small(MIT). - This adapter: Apache-2.0.
Limitations
- Per-segment window is β€30 s (encoder limit) β long audio is chunked at silence, same as Whisper.
- Mandarin is the weakest language; more data is the lever.
Model tree for interfaze-ai/diffusion-gemma-asr-small
Base model
google/diffusiongemma-26B-A4B-it