`decoder_start_token_id` discrepancy

#1
by b0r3k - opened

Hey,

thanks for the model!

I have noticed that while both your mikr/whisper-large-v3-czech-cv13/generation_config.json and original openai/whisper-large-v3/config.json state:

"decoder_start_token_id": 50258,

Your mikr/whisper-large-v3-czech-cv13/config.json states:

"decoder_start_token_id": 50257,

Is that intentional? Which token should be used as the decoder_start_token_id? I believe the decoder_start_token_id values in the two configs should correspond.

If 50257 is the correct value, maybe it should also be fixed somewhere in the tokenizer, because:

from transformers import WhisperTokenizerFast
tokenizer = WhisperTokenizerFast.from_pretrained("mikr/whisper-large-v3-czech-cv13")
ids = tokenizer("Hello")["input_ids"]
print(ids)
print(tokenizer.decode(ids))

prints:

[50258, 50283, 50360, 50364, 15947, 50257]
<|startoftranscript|><|cs|><|transcribe|><|notimestamps|>Hello<|endoftext|>

Meaning the ids are in a wrong format for use e.g. with teacher forcing. I've checked and tokenizer.build_inputs_with_special_tokens also prepends the 50258 token.

Hola,

I did some more testing and apparently 50257 is a wrong decoder_start_token_id. The following code:

from transformers import WhisperForConditionalGeneration, WhisperProcessor
model_path = "mikr/whisper-large-v3-czech-cv13"
processor = WhisperProcessor.from_pretrained(model_path)
processor.tokenizer.set_prefix_tokens(task="transcribe", language="cs", predict_timestamps=False)
model = WhisperForConditionalGeneration.from_pretrained(model_path)
audio = dataset[1]["audio"]
in_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
labels = processor.tokenizer(dataset[1]["normalized"], return_tensors="pt", add_special_tokens=True).input_ids
labels = labels[:,1:]  # Remove the 50258 added by tokenizer
res = model(input_features=in_features, labels=labels)
print(torch.argmax(res.logits, dim=-1))

Produces just a sequence of 50258 (<|startoftranscript|>) tokens:

tensor([[50258, 50258, 50258, 50258, 50258, 50258, 50258, 50258]])

Whisper when provided with labels prepends the self.config.decoder_start_token_id to the input (snippet from HF GitHub):

decoder_input_ids = shift_tokens_right(labels, self.config.pad_token_id, self.config.decoder_start_token_id)

So in this case it prepends 50257 as per your config and produces wrong output.

If I manually set the decoder_start_token_id first:

model.config.decoder_start_token_id = 50258

and run the same code, I get the correct output.

I believe it would be desirable to fix the decoder_start_token_id to 50258 in the model config.

Sign up or log in to comment