Missing Merge Pair (based on the original BPE from whisper)

#16
by mert-kurttutan - opened

When you run the original tokenizer from whisper package

whisper_tokenizer = tokenizer.get_tokenizer(multilingual=True, language="en", task="transcribe")
whisper_tokenizer.encoding.encode(" the")

You get

[264]

But the same code gives different result using the tokenizer.json included here

hf_tok = tokenizers.Tokenizer.from_pretrained("openai/whisper-large")
hf_tok.encode(" the", add_special_tokens=False).ids

You get

[220, 3322]

I think this is because of missing merge (but that is present in the original BPE)

Using the function here,
I detected only one missing merge
Correct tokenizer.json should be as follows

    "merges": [
      "Ġ t",
      "Ġ a",
      "Ġt h",

whereas current one

    "merges": [
      "Ġ a",
      "Ġt h",

Accordingly, other repos and merges.txt should be modified

cc @ArthurZ when you get the chance 🙌

I just tripped over this too. It would be good to see it fixed.

Oh wow sorry. Just saw that as well will have look

I'll check the conversion script, there might be a bug in there.

What's weird is that installing pip install openai-whisper==20230117 and doing:

>>> from whisper.tokenizer import get_tokenizer
>>> tokenizer = get_tokenizer(True)
>>> tokenizer.encode(" the")
[220, 3322]

also gives the wrong result.
However the merge was there since day 1: https://github.com/openai/whisper/blame/ad3250a846fe7553a25064a2dc593e492dadf040/whisper/assets/gpt2/merges.txt.
The conversion from slow to fast is probably removing it .

See:

tokenizer.tokenizer.save_pretrained("local_path/whisper-old")

opening the merges.txt, it no longer has the first merge.
If we remove the tokenizer.json and add the correct merge to merges.txt the conversions works well with tokenizer.from_pretrained("local_path/whisper-old"), meaning it's probably the layer on top of transformers that was causing this.

Sign up or log in to comment