Instructions to use google/gemma-4-12B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-4-12B with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("google/gemma-4-12B") model = AutoModelForImageTextToText.from_pretrained("google/gemma-4-12B") - Notebooks
- Google Colab
- Kaggle
Gemma 4 12B Unified zero-shot ASR underperforms Gemma 4 E4B on FLEURS
Gemma 4 12B Unified zero-shot ASR underperforms Gemma 4 E4B on FLEURS
Hi, we are trying to reproduce and understand the ASR behavior of google/gemma-4-12B-it.
The model card reports strong FLEURS performance for Gemma 4 12B Unified, with the note that Chinese languages are excluded. However, in our local zero-shot ASR tests, Gemma 4 12B Unified performs substantially worse than Gemma 4 E4B, even when Chinese/Cantonese are excluded.
Environment
- Model:
google/gemma-4-12B-it - Transformers:
5.10.1 - PyTorch:
2.12.0+cu130 - Processor class:
Gemma4UnifiedProcessor - Model class:
AutoModelForMultimodalLM - Hardware: NVIDIA A800 80GB
- Audio: FLEURS
.wav, 16 kHz, mono, under 30 seconds
Reproduction Path
We tested the exact Hugging Face model-card audio example path:
from transformers import AutoProcessor, AutoModelForMultimodalLM
processor = AutoProcessor.from_pretrained("google/gemma-4-12B-it")
model = AutoModelForMultimodalLM.from_pretrained(
"google/gemma-4-12B-it",
dtype="auto",
device_map="auto",
)
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.",
},
{"type": "audio", "audio": "/path/to/audio.wav"},
],
}
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
prediction = processor.parse_response(response)
We also tested the ASR prompt from the Best Practices section:
Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.
Follow these specific instructions for formatting the answer:
* Only output the transcription, with no newlines.
* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.
Observed Results
On a 5-language FLEURS subset:
| Setup | CMN CER | YUE CER | EN WER | JA CER | KO CER |
|---|---|---|---|---|---|
| Gemma 4 12B, exact HF audio example | 1.4648 | 3.2516 | 0.2906 | 1.1045 | 1.4916 |
| Gemma 4 12B, ASR prompt + output cleanup | 1.0795 | 1.2376 | 0.1854 | 0.2603 | 0.0763 |
| Gemma 4 E4B fine-tuned baseline | 0.0885 | 0.2283 | 0.0641 | 0.0693 | 0.0471 |
Even excluding Chinese/Cantonese and evaluating only English/Japanese/Korean, Gemma 4 12B remains much worse than E4B in our tests.
Failure Patterns
We observed frequent non-ASR outputs from 12B:
- Empty or near-empty output such as
. - Refusals such as
I cannot fulfill this request - Mistaking audio for image/video/silent input
- Emitting
<channel|>,<turn|>, or thinking-style text - Long repeated suffixes, especially for Chinese/Cantonese/Japanese
- Occasionally outputting phonetic/IPA-like text instead of Korean transcription
The any-to-any pipeline showed similar issues on some samples, so this does not seem to be only caused by our custom evaluation wrapper.
Questions
Could you clarify the official FLEURS ASR evaluation recipe for Gemma 4 12B Unified?
Specifically:
- Which exact languages are included or excluded?
- What prompt is used for ASR evaluation?
- What generation parameters are used?
- Is
enable_thinkingdisabled during ASR evaluation? - What output post-processing / normalization is applied?
- Is the public Transformers
AutoModelForMultimodalLMpath expected to reproduce the reported FLEURS number? - Is Gemma 4 12B Unified expected to be stronger than E4B for zero-shot ASR, or is E4B’s dedicated audio encoder expected to be more stable?
Thanks. We are happy to provide specific sample IDs and outputs if useful.