Add "<|startoftranscript|>" to forced decoder ids

#14
by sanchit-gandhi HF staff - opened

Replacing <|translate|><|notimestamps|> with <|startoftranscript|><|en|><|transcribe|><|notimestamps|>

That's a pretty big change, you are also adding more tokens.
I think the reason why, by default we only have the 2 tokens is for testing purposes. I agree that depending on the usage we should rather hard-code them in the tests

Also the reason why we don't have <|startoftranscript|> in the forced_decoder_ids is because it is set in decoder_start_token_id

sanchit-gandhi changed pull request status to closed

We should set the language though in the forced decoder ids no? As we do for say the medium checkpoint:
https://huggingface.co/openai/whisper-medium/blob/main/config.json#L26-L39

For the large, we're currently setting <|translate|><|notimestamps|>

For all the other multilingual checkpoints, we're setting <|en|><|transcribe|><|notimestamps|>

Sign up or log in to comment