Whisper Tokenizer bos_token maybe wrong
#108
by
moncefbenaicha
- opened
I was checking this commit 9ef08b92d5117306d28a6bcba2ca4ba407155f65 in config.json. I noticed that the bos_token_id is 50257 --> '<|endoftext|>' I believe this should be 50258 "<|startoftranscript|>"
Yes, that's correct. WhisperTokenizer uses <|endoftext|>
as the default value for eos/bos_token. But it's not used anywhere. In fact, there is a property prefix_tokens
that returns the correct prefix IDs:
@property
def prefix_tokens(self) -> List[int]:
bos_token_id = self.convert_tokens_to_ids("<|startoftranscript|>")
translate_token_id = self.convert_tokens_to_ids("<|translate|>")
transcribe_token_id = self.convert_tokens_to_ids("<|transcribe|>")
notimestamps_token_id = self.convert_tokens_to_ids("<|notimestamps|>")
That probably also explains why no one raised this issue before as the values in config.json related to token_id are ignored.
moncefbenaicha
changed discussion status to
closed