Whisper Tokenizer bos_token maybe wrong

#108
by moncefbenaicha - opened

I was checking this commit 9ef08b92d5117306d28a6bcba2ca4ba407155f65 in config.json. I noticed that the bos_token_id is 50257 --> '<|endoftext|>' I believe this should be 50258 "<|startoftranscript|>"

@patrickvonplaten

Weirdly version of config.json gets both bos and eos token as 50257, which is "<|endoftext|>", however the decoder_start_token_id is 50258, which could cause the prompt being improperly handle.
image.png

However, the tokenizer encoder added correct bos token as "<|startoftext|>".

image.png

image.png

Yes, that's correct. WhisperTokenizer uses <|endoftext|> as the default value for eos/bos_token. But it's not used anywhere. In fact, there is a property prefix_tokens that returns the correct prefix IDs:

@property
def prefix_tokens(self) -> List[int]:
        bos_token_id = self.convert_tokens_to_ids("<|startoftranscript|>")
        translate_token_id = self.convert_tokens_to_ids("<|translate|>")
        transcribe_token_id = self.convert_tokens_to_ids("<|transcribe|>")
        notimestamps_token_id = self.convert_tokens_to_ids("<|notimestamps|>")

That probably also explains why no one raised this issue before as the values in config.json related to token_id are ignored.

moncefbenaicha changed discussion status to closed

Sign up or log in to comment