Tokenizer <unk>s

#3
by BramVanroy - opened

Using the tokenizer, all my encodings end with <unk>. Is this the intended behavior (some kind of undefined end-of-sentence token that usually is </s> like in RobBERT)?

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("FremyCompany/roberta-large-nl-oscar23")
encoded = tokenizer("De volgende recensie is positief.")
tokenizer.convert_ids_to_tokens(encoded.input_ids)
# ['<s>', 'ĠDe', 'Ġvolgende', 'Ġrecensie', 'Ġis', 'Ġpositief', 'Ġ', '.', '<unk>']

encoded = tokenizer("I like cookies!")
tokenizer.convert_ids_to_tokens(encoded.input_ids)
# ['<s>', 'ĠI', 'Ġlike', 'Ġcookies', 'Ġ', '!', '<unk>']

encoded = tokenizer("Ik wil naar huis gaan")
tokenizer.convert_ids_to_tokens(encoded.input_ids)
# ['<s>', 'ĠIk', 'Ġwil', 'Ġnaar', 'Ġhuis', 'Ġgaan', '<unk>']

Yes, there is an EOS token, but it was not supposed to be <unk> but </s>. The tokenizer was trained by Pieter, but I had to reorder the tokens due to a bug the Transformers librairy, so I might have introduced an inconsistency :/

I took a quick look, and this looks like this was caused an off-by-one error in the tokenizer.json config file.

  "post_processor": {
    "type": "RobertaProcessing",
    "sep": [
      "</s>",
      2
    ],

The number 2 is not correct, it should be 3 per the vocab.

    "vocab": {
      "<s>": 0,
      "<pad>": 1,
      "<unk>": 2,
      "</s>": 3,
      "<mask>": 4,

That said, the model has been trained with this config now, maybe it's best not to change anything?

I'll talk to Pieter about this, I don't think this changes much for the MLM pre-trained model, but maybe this can cause problems for NLI-type finetuning. An easy fix would be to fix the tokenizer for the final release, and copy the weights of <unk> to those of </s>. This will yield the exact same results, but by doing this, we allow the model to finetune </s> without affecting the <unk> token for tasks where this is relevant.

Very good catch, thanks!

Thank you @BramVanroy for letting us know about this issue! I just pushed an update that fixes the tokenizer to use </s> as was originally intended. This update will not affect the outputs of the model in any way (even if it is accidentally used with the old version of the tokenizer).

However, the old model should however not be used with the new tokenizer, but I don't think this combination is likely. We are also going to migrate this model to the usual repository of RobBERT models, so new users will get a clean start.

FremyCompany changed discussion status to closed

Sign up or log in to comment