Problems with latex tokenization

#2
by DimOgu - opened

I would like to report a bug when updating the version of the transformers library (transormers 4.16.2 -> 4.20.1), the version of the tokenizer library has also changed (tokenizer == 0.10.3 -> 0.12.1), which entailed changes when applying the tokenizer.
Consider an example.
This figure shows the operation of the tokenizer with tokenizer version 0.10.3

image.png

This figure shows the operation of the tokenizer with tokenizer version 0.12.1

image.png

The difference in this case is the separation of "" into a separate token.

There are also problems with the allocation of such latex "words" as "\cite" "\Omega" and so on into single tokens, in both versions of the tokenizer.

Hi @DimOgu ,
Please note that mathberta has been trained with transformers==4.18.0, which requires tokenizers>=0.11.1,!=0.11.3,<0.13. Therefore, we recommend not using mathberta with older versions of transformers than 4.18.0 and we recommend using it with transformers==4.20.1 due to an issue that we fixed in the meantime.

If you still need to use it with older version, check that the resulting wordpieces from calling the tokenizer are the same:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("witiko/mathberta")
text = "This \emph{Extended Patience Sorting Algorithm} is similar."

# on transformers==4.20.1 + tokenizers==0.12.1:
t(text).input_ids
>>> [0, 152, 1437, 57042, 50619, 1437, 11483, 6228, 3769, 11465, 208, 23817, 83, 53143, 50, 3432, 54598, 16, 16207, 4882, 55021, 2]

If the input_ids on your desired versions of libraries match input_ids in the supported version (transformers==4.20.1), you do not have to care about the spaces in decoding (normally done by calling decode or batch_decode) and you can still use the model without doubts.

Sign up or log in to comment