Bug in tokenizer?

#2
by birgitbrause - opened

I might have noticed a bug in tokenizer. It seems that in its normalizer, strip_accents is automatically set to True when loaded, because it is not explicitly set to False in tokenizer_config.json.
Because of this, the tokenizer normalizes German Umlauts. In the tokenizer's vocabulary however, I can see tokens having Umlauts.

When loaded from hub the tokenizer behaves like this:
dbmdz_tokenizer_before.png

If I deactivate the normalization:
dbmdz_tokenizer_after.png

Because of the Umlauts in the vocabulary, I guess strip_accents == False is the originally intended behaviour?
Do you have an idea if the model was trained with or without seeing Umlauts?

I tested this with transformers versions 2.3.0, 4.6.1, 4.25.1.

Sign up or log in to comment