Problems with tokenizer of Polish letters

#1
by awawrzynski - opened

There are missing Polish letters in vocab.json: Ż, Ź, Ś. For letters tokenizer assigns wrong tokens: Ó, Ź, Ą, Ę, Ń.
I tried using AutoTokenizer, RobertaTokenizer and RobertaTokenizerFast. I am using transformers==4.18.0, tokenizers==0.21.1, Python3.6.9.

Examples (words are artificial, important is first letter):

word: Óstka
input_ids: tensor([0, 2666, 246, 3154, 2)]
tokens: ['<s>', ' �', '�', 'stka', '</s>']

word: Źstka
input_ids: tensor([0, 552, 122, 3154, 2)]
tokens: ['<s>', ' �', '�', 'stka', '</s>']

word: Ąstka
input_ids: tensor([0, 6327, 231, 3154, 2)]
tokens: ['<s>', ' �', '�', 'stka', '</s>']

word: Ęstka
input_ids: tensor([0, 6327, 251, 3154, 2)]
tokens: ['<s>', ' �', '�', 'stka', '</s>']

word: Ństka
input_ids: tensor([0, 552, 230, 3154, 2)]
tokens: ['<s>', ' �', '�', 'stka', '</s>']

Thanks for contacting me!
This LM was trained a long time ago for the learning purposes so it might be missing something.

Sign up or log in to comment