FIX special token tokenization errors in MT5 tokenizer

#10
by Dhurmir - opened

As previously discussed in https://huggingface.co/google/mt5-small/discussions/8, the proposed fix in PR: https://github.com/huggingface/transformers/pull/27738 didn't fix the mentioned issue it only allowed to expand the initial vocabulary from 250100 to 250200. Although it successfully fixes the extra_ids parameter issue, the PR fails to address the mentioned tokenization issue and needlessly adds a 100 extra special tokens that should be already available in the vocabulary, introducing new tokenization errors in the model.

The proposed fix simply adds the 100 special tokens to the configuration files mapping, the extra_ids parameter is left as 0, due to the fact that leaving it as 100 seems to as mentioned before forcefully expand the vocabulary from 250100 to 250200 (probably will need to create a new tokenizer class for MT5 or might want to check T5Tokenizer class behavior).

Any notes or comments are warmly welcomed the known issues are shown in https://huggingface.co/google/mt5-small/discussions/8

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment