Gemma 4 12B Unified zero-shot ASR underperforms Gemma 4 E4B on FLEURS

#6
by amor123hyk - opened

Gemma 4 12B Unified zero-shot ASR underperforms Gemma 4 E4B on FLEURS

Hi, we are trying to reproduce and understand the ASR behavior of google/gemma-4-12B-it.

The model card reports strong FLEURS performance for Gemma 4 12B Unified, with the note that Chinese languages are excluded. However, in our local zero-shot ASR tests, Gemma 4 12B Unified performs substantially worse than Gemma 4 E4B, even when Chinese/Cantonese are excluded.

Environment

  • Model: google/gemma-4-12B-it
  • Transformers: 5.10.1
  • PyTorch: 2.12.0+cu130
  • Processor class: Gemma4UnifiedProcessor
  • Model class: AutoModelForMultimodalLM
  • Hardware: NVIDIA A800 80GB
  • Audio: FLEURS .wav, 16 kHz, mono, under 30 seconds

Reproduction Path

We tested the exact Hugging Face model-card audio example path:

from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("google/gemma-4-12B-it")
model = AutoModelForMultimodalLM.from_pretrained(
    "google/gemma-4-12B-it",
    dtype="auto",
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.",
            },
            {"type": "audio", "audio": "/path/to/audio.wav"},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
prediction = processor.parse_response(response)

We also tested the ASR prompt from the Best Practices section:

Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.

Follow these specific instructions for formatting the answer:
* Only output the transcription, with no newlines.
* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.

Observed Results

On a 5-language FLEURS subset:

Setup CMN CER YUE CER EN WER JA CER KO CER
Gemma 4 12B, exact HF audio example 1.4648 3.2516 0.2906 1.1045 1.4916
Gemma 4 12B, ASR prompt + output cleanup 1.0795 1.2376 0.1854 0.2603 0.0763
Gemma 4 E4B fine-tuned baseline 0.0885 0.2283 0.0641 0.0693 0.0471

Even excluding Chinese/Cantonese and evaluating only English/Japanese/Korean, Gemma 4 12B remains much worse than E4B in our tests.

Failure Patterns

We observed frequent non-ASR outputs from 12B:

  • Empty or near-empty output such as .
  • Refusals such as I cannot fulfill this request
  • Mistaking audio for image/video/silent input
  • Emitting <channel|>, <turn|>, or thinking-style text
  • Long repeated suffixes, especially for Chinese/Cantonese/Japanese
  • Occasionally outputting phonetic/IPA-like text instead of Korean transcription

The any-to-any pipeline showed similar issues on some samples, so this does not seem to be only caused by our custom evaluation wrapper.

Questions

Could you clarify the official FLEURS ASR evaluation recipe for Gemma 4 12B Unified?

Specifically:

  1. Which exact languages are included or excluded?
  2. What prompt is used for ASR evaluation?
  3. What generation parameters are used?
  4. Is enable_thinking disabled during ASR evaluation?
  5. What output post-processing / normalization is applied?
  6. Is the public Transformers AutoModelForMultimodalLM path expected to reproduce the reported FLEURS number?
  7. Is Gemma 4 12B Unified expected to be stronger than E4B for zero-shot ASR, or is E4B’s dedicated audio encoder expected to be more stable?

Thanks. We are happy to provide specific sample IDs and outputs if useful.

Sign up or log in to comment