openai/whisper-large-v3 · Suddenly all my transcriptions are in English

WWCF

Jan 23

All my transcriptions are outputted in English now regardless of the source language when no language parameter is provided in the generation_kwargs.
I didn't change anything in the pipeline function code for at least two weeks.
I have to manually set the generation_kwargs to the source language to get the right transcription now.

antonioanerao

Jan 23

•

edited Jan 23

The same just happened to me. Not sure why.

If anyone here is not sure how set the output to a default language, set the generation_kwargs in your pipeline method, just like that:

pipe = pipeline(
  # another stuff
  generate_kwargs={"language": "portuguese"},
)

Just change portuguese for the language you want

WWCF

Jan 23

•

edited Jan 23

I fixed it by adding generate_kwargs={"task": "transcribe"} without the language param. This way it works as before.

I think it somehow defaults to "task": "translate".

syifashf

Jan 26

•

edited Jan 26

The same problem happens to the Inference API. I tried the inference API of this model to my website, now the output always translated to English eventhough the audio doesn't speak in English. Do you have any suggestion to solve this problem?

patrickvonplaten

Jan 29

I think I can reproduce it, but even Transformers 4.35 defaults to English when transcribing:

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
from datasets import load_dataset, Audio

model_id = "openai/whisper-small"

processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16)
model.to("cuda")

ds = load_dataset("mozilla-foundation/common_voice_11_0", "de", streaming=True, split="test")
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
sample = next(iter(ds))["audio"]

input_features = processor(sample["array"], sampling_rate=16_000, return_tensors="pt").input_features
input_features = input_features.to("cuda", dtype=torch.float16)

predicted_ids = model.generate(input_features)

# decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)

print(transcription)

patrickvonplaten

Jan 29

Do we expect the above code snippet to default to German?

cc @sanchit-gandhi

patrickvonplaten

Jan 29

Ah ok it's not related to Transformers, but to this commit: https://huggingface.co/openai/whisper-large-v3/commit/acdaf855f7daa392007ed7f001b30bb7a859eb69.

What happened here is the following. When we added Whisper-v3, in my opinion we had a bug in generation_config.json because the default behavior didn't align with v2 or any other Whisper checkpoint (which is why I changed the config here: https://huggingface.co/openai/whisper-large-v3/commit/acdaf855f7daa392007ed7f001b30bb7a859eb69).
This however led to changes in the default behavior when transcribing a non-English language, hence the issue here by the community.

In my opinion both default behaviors makes sense:
a) Translating non-English audio to English text (which is the default behavior of all whisper-tiny - whisper-large-v2 and since this commit also the default behavior of v3)
b) Transcribing to the original language of the audio (which was the default behavior only of whisper-v3 before this commit).

To me both a) and b) would make sense, but I'd tend to stick to a) because this way all whisper checkpoints behave the same way even if it means a change in default behavior of whisper v3.

Thoughts?

patrickvonplaten

Jan 29

To have the same behavior as before, @WWCF is correct, one should pass task="transcribe" to make sure the audio is transcribed to the language of the audio (no need to specify the target language in this case, whisper will detect it automatically). This is the same behavior one should do for whisper-v2, whisper-medium, ..., whisper-tiny

sanchit-gandhi

Jan 29

•

edited Jan 29

The behaviour in the original codebase is to always transcribe in the original language: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/transcribe.py#L154

The forced decoder ids in the generation config for large-v2 are set for automatic language detection and transcription: https://huggingface.co/openai/whisper-large-v2/blob/a3710f8deb6f932b0c5e5b213ab1f11f736fdc70/generation_config.json#L106

=> so these forced decoder ids are correct. However, the forced decoder ids in the config file are incorrect, forcing speech translation into English: https://huggingface.co/openai/whisper-large-v2/blob/a3710f8deb6f932b0c5e5b213ab1f11f736fdc70/config.json#L29

When we migrated the forced ids from the config.json to the generation_config.json, we were careful to respect the Transformers convention of prioritising the generation config over the config: https://github.com/huggingface/transformers/pull/21965

=> this would have preserved the correct behaviour of speech transcription, since we'd use the forced ids from the generation config

E.g. if you run transformers==4.26.0, you get the correct behaviour:

pip install transformers==4.26.0

And then:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset, Audio

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")

dataset = load_dataset("mozilla-foundation/common_voice_16_1", "de", split="validation", streaming=True)
dataset = dataset.cast_column("audio", Audio(16_000))

sample = next(iter(dataset))
input_features = processor(sample["audio"]["array"], return_tensors="pt").input_features

pred_ids = model.generate(input_features)
pred_text = processor.batch_decode(pred_ids)
print(pred_text)

Gives:

['<|startoftranscript|><|de|><|transcribe|><|notimestamps|> Seine Gebine rohnen heute auf dem Friedhof von Alpenminister bei Gemunden.<|endoftext|>']

It looks like the priority of generation config over config was swapped mistakenly in this PR: https://github.com/huggingface/transformers/pull/22496
Since then, we favour the config over the generation config, meaning we take the forced decoder ids that give English translation (rather than detected transcription) (edited)

So there are two bugs going on here:

In transformers, we should prioritise generation config over config
In large-v3, we should change the forced decoder ids in the generation config to match those from large-v2 (i.e. automatic language detection)

patrickvonplaten

Jan 31

Ok after some discussion, it was decided to revert the behavior back to the original Whisper-v3 behavior which is to automatically detect the language followed by transcribing the audio to the detected language. This will become the default behavior for all Whisper models starting with Transformers v4.38

See: https://huggingface.co/openai/whisper-large-v3/discussions/75

cik009

May 4

•

edited May 5

FYI, @patrickvonplaten this problem still appears to be present with v4.39.3, when models are converted to TFLite for native inference.
They always produce translations. Perhaps because the refactoring commit (https://github.com/huggingface/transformers/pull/28687) did not address that scenario?