You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

diffusion-gemma-asr-small

Audio-native, multilingual speech recognition that transcribes through DiffusionGemma's own discrete-diffusion decoder — not autoregressive, not an external ASR decoder. Audio is projected directly into the Gemma embedding space, and the transcript is produced by parallel diffusion denoising (~8–16 steps), giving real-time-plus throughput where cost is set by the number of denoising steps, not the length of the transcript.

This repo ships the trained adapter only (projector + LoRA, ~42M params — 0.16% of the model). The frozen 26B DiffusionGemma backbone and the frozen whisper-small encoder load from their own repos.

How it works

raw audio ─► whisper-small encoder (frozen) ─► projector (trained, ~19M)
          ─► scatter into <audio> token slots of DiffusionGemma's encoder
          ─► DiffusionGemma decoder denoises a 192-token canvas (bidirectional, cross-attends audio)
          ─► transcript

Backbone: google/diffusiongemma-26B-A4B-it — frozen, small LoRA adapters on encoder/decoder attention.
Audio frontend: openai/whisper-small encoder — frozen feature extractor (NOT a decoder).
Grounding: trained with three losses — uniform-diffusion (the generator), an AR auxiliary, and a CTC loss on the projector via the frozen lm_head (the key unlock that makes the audio embeddings transcript-predictive).

Usage

Install

pip install torch peft soundfile librosa huggingface_hub \
  "transformers @ git+https://github.com/huggingface/transformers.git"   # DiffusionGemma support

Transcribe in Python

import sys, soundfile as sf
from huggingface_hub import snapshot_download

repo = snapshot_download("interfaze-ai/diffusion-gemma-asr-small")   # this adapter (~170 MB)
sys.path.insert(0, repo)
from inference import load, transcribe                       # bundled in this repo

# Loads frozen DiffusionGemma-26B + whisper-small + this adapter (downloads bases on first run).
model, tok, fe = load(f"{repo}/diffusion_asr_small.pt", device="cuda")

wav, sr = sf.read("audio.wav")        # 16 kHz mono float32 (inference.py resamples if needed)
print(transcribe(wav, model, tok, fe, max_steps=16))

Or from the command line

python inference.py audio.wav        # run inside the downloaded repo dir

Long audio is split at silence (the encoder has a 30 s window, like Whisper). max_steps trades speed for accuracy — 8 is near-best and fastest, 16 is the default.

Languages & accuracy

Trained on FLEURS (6 languages) + LibriSpeech (en) + VoxPopuli (en/de/fr/es). WER/CER are Whisper-normalized (Open-ASR / Artificial-Analysis convention), 16 diffusion steps:

benchmark	metric	score
LibriSpeech test-clean (en)	WER	6.6%
FLEURS English	WER	15.7%
VoxPopuli English	WER	18.5%
FLEURS Hindi	CER	15.8%
FLEURS Mandarin	CER	29.6%

Among diffusion / non-autoregressive ASR it leads (6.6% on LibriSpeech vs Whisfusion's 8.3%, with a smaller encoder). It trails autoregressive Whisper — a training-data gap (~219 h seen), not architecture.

Files

diffusion_asr_small.pt — trained adapter ({"projector": ..., "lora": ...})
model.py, audio.py — model definition (self-contained)
inference.py — runnable example (load + segment + transcribe)
requirements.txt

Requirements / licensing

Needs transformers from main (DiffusionGemma support) + torch, peft.
Base models load from their own repos under their licenses: google/diffusiongemma-26B-A4B-it (Gemma terms) and openai/whisper-small (MIT).
This adapter: Apache-2.0.

Limitations

Per-segment window is ≤30 s (encoder limit) — long audio is chunked at silence, same as Whisper.
Mandarin is the weakest language; more data is the lever.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for interfaze-ai/diffusion-gemma-asr-small

Base model

google/diffusiongemma-26B-A4B-it

Finetuned

(13)

this model

interfaze-ai
/

diffusion-gemma-asr-small