Changes for fast tokenizer

#5

See https://github.com/huggingface/transformers/pull/21222.

This adds tokenizer.json to allow loading fast whisper tokenizer directly. It also changes the configured unknown token from "" to "<|endoftext|>, which matches the English checkpoints and addresses some issues with "" as a token.

jonatanklosko changed pull request status to open
jonatanklosko changed pull request title from jk-whisper-tokenizer-fast to Changes for fast tokenizer

Hmm, changing the unknown token seems to break special token ids when loading slow tokenizer:

from transformers import WhisperTokenizer

# Add tokenizer.json
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny", revision="dcca07232bfb1028e499333730f868b87fd3d043")
print(tokenizer.eos_token_id) #=> 50257

# Update unknown token
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny", revision="8852c40b30c9b7b981faf4fa77167fd862fd5fdb")
print(tokenizer.eos_token_id) #=> None

Please await merging until we figure this out.

Moving <|endoftext|> to vocab.json resolves the issue, as outlined in the GitHub comment.

I will merge, thanks for working on this

ArthurZ changed pull request status to merged

Sign up or log in to comment