Is MT5 tokenizer and Generation model working properly

#8
by Dhurmir - opened

Versions
Python: 3.10.12
Transformers: 4.35.2

I've been doing some work on migrating a T5 system fine-tuned on a span filling task towards multilingual, I've come across the following problem with sentinel tokens and text generation.

First, I illustrate how it's done in T5 and what's to be expected

from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

t5_config = T5Config.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base",
                                                    config=t5_config)
tokenizer = T5Tokenizer.from_pretrained("t5-base", legacy=False)
# illustrative training for fine-tuning
input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids
outputs = model(input_ids=input_ids, labels=labels)
loss = outputs.loss
logits = outputs.logits

# inference
original = """<extra_id_0> <extra_id_1> that <extra_id_2> really<extra_id_3>
    barber from andy griffith<extra_id_4>"""
print(tokenizer.tokenize(original))
print(tokenizer.all_special_tokens)
input_ids = tokenizer(original, return_tensors="pt").input_ids  # Batch size 1
for i in range(5):
  outputs = model.generate(input_ids, do_sample=True, max_new_tokens=50)
  inp = tokenizer.decode(outputs[0])
  print(f'{i}: {inp}')
['<extra_id_0>', '<extra_id_1>', '▁that', '<extra_id_2>', '▁really', '<extra_id_3>', 'bar', 'ber', '▁from', '▁and', 'y', '▁', 'griff', 'i', 'th', '<extra_id_4>']
['</s>', '<unk>', '<pad>', '<extra_id_0>', '<extra_id_1>', '<extra_id_2>', '<extra_id_3>', '<extra_id_4>', '<extra_id_5>', '<extra_id_6>', '<extra_id_7>', '<extra_id_8>', '<extra_id_9>', '<extra_id_10>', '<extra_id_11>', '<extra_id_12>', '<extra_id_13>', '<extra_id_14>', '<extra_id_15>', '<extra_id_16>', '<extra_id_17>', '<extra_id_18>', '<extra_id_19>', '<extra_id_20>', '<extra_id_21>', '<extra_id_22>', '<extra_id_23>', '<extra_id_24>', '<extra_id_25>', '<extra_id_26>', '<extra_id_27>', '<extra_id_28>', '<extra_id_29>', '<extra_id_30>', '<extra_id_31>', '<extra_id_32>', '<extra_id_33>', '<extra_id_34>', '<extra_id_35>', '<extra_id_36>', '<extra_id_37>', '<extra_id_38>', '<extra_id_39>', '<extra_id_40>', '<extra_id_41>', '<extra_id_42>', '<extra_id_43>', '<extra_id_44>', '<extra_id_45>', '<extra_id_46>', '<extra_id_47>', '<extra_id_48>', '<extra_id_49>', '<extra_id_50>', '<extra_id_51>', '<extra_id_52>', '<extra_id_53>', '<extra_id_54>', '<extra_id_55>', '<extra_id_56>', '<extra_id_57>', '<extra_id_58>', '<extra_id_59>', '<extra_id_60>', '<extra_id_61>', '<extra_id_62>', '<extra_id_63>', '<extra_id_64>', '<extra_id_65>', '<extra_id_66>', '<extra_id_67>', '<extra_id_68>', '<extra_id_69>', '<extra_id_70>', '<extra_id_71>', '<extra_id_72>', '<extra_id_73>', '<extra_id_74>', '<extra_id_75>', '<extra_id_76>', '<extra_id_77>', '<extra_id_78>', '<extra_id_79>', '<extra_id_80>', '<extra_id_81>', '<extra_id_82>', '<extra_id_83>', '<extra_id_84>', '<extra_id_85>', '<extra_id_86>', '<extra_id_87>', '<extra_id_88>', '<extra_id_89>', '<extra_id_90>', '<extra_id_91>', '<extra_id_92>', '<extra_id_93>', '<extra_id_94>', '<extra_id_95>', '<extra_id_96>', '<extra_id_97>', '<extra_id_98>', '<extra_id_99>']

0: <pad> <extra_id_0> I didn’t know <extra_id_1> why <extra_id_2> ’s <extra_id_3> good. <extra_id_4> y.</s>
1: <pad> <extra_id_0> really <extra_id_1> good and <extra_id_2> I <extra_id_3> hate dis <extra_id_4>.</s>
2: <pad> <extra_id_0>'s when <extra_id_1> thought <extra_id_2> one <extra_id_3> got cold <extra_id_4>.</s>
3: <pad> <extra_id_0> we <extra_id_1> love <extra_id_2> you <extra_id_3> dig s <extra_id_4>!</s>
4: <pad> <extra_id_0> I'm <extra_id_1> liking <extra_id_2> pic <extra_id_3>. er <extra_id_4>.</s>

As we can see the sentinel tokens are kept intact and used as special tokens by the tokenizer which is the expected behavior.
However when I use MT5 the following happens. Please notice that the commented lines showcase other ways of adding the sentinels tokens that seem to be missing in the tokenizer vocabulary.

from transformers import MT5Tokenizer, MT5ForConditionalGeneration, MT5Config

t5_config = MT5Config.from_pretrained("google/mt5-base")
model = MT5ForConditionalGeneration.from_pretrained("google/mt5-base",
                                                    config=t5_config)
tokenizer = MT5Tokenizer.from_pretrained("google/mt5-base",
                                         extra_ids=100,
                                        #truncation=True,
                                        legacy=False
                                         )
#tokenizer.add_tokens([f'<extra_id_{i}>' for i in range(100)], special_tokens=True)
model.resize_token_embeddings(len(tokenizer))
# training
input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids
outputs = model(input_ids=input_ids, labels=labels)
loss = outputs.loss
logits = outputs.logits

# inference
original = """<extra_id_0> <extra_id_1> that <extra_id_2> really<extra_id_3>
    barber from andy griffith<extra_id_4>."""
print(tokenizer.tokenize(original))
print(tokenizer.all_special_tokens)
input_ids = tokenizer(original, return_tensors="pt").input_ids  # Batch size 1
for i in range(5):
  outputs = model.generate(input_ids, do_sample=True, max_new_tokens=50)
  inp = tokenizer.decode(outputs[0])
  print(f'{i}: {inp}')
['<extra_id_0>', '<extra_id_1>', '▁that', '<extra_id_2>', '▁', 'really', '<', 'extra', '_', 'id', '_', '3>', '▁', 'barber', '▁from', '▁and', 'y', '▁', 'griff', 'ith', '<', 'extra', '_', 'id', '_', '4>', '▁', '.']
['</s>', '<unk>', '<pad>', '<extra_id_0>', '<extra_id_1>', '<extra_id_2>', '<extra_id_3>', '<extra_id_4>', '<extra_id_5>', '<extra_id_6>', '<extra_id_7>', '<extra_id_8>', '<extra_id_9>', '<extra_id_10>', '<extra_id_11>', '<extra_id_12>', '<extra_id_13>', '<extra_id_14>', '<extra_id_15>', '<extra_id_16>', '<extra_id_17>', '<extra_id_18>', '<extra_id_19>', '<extra_id_20>', '<extra_id_21>', '<extra_id_22>', '<extra_id_23>', '<extra_id_24>', '<extra_id_25>', '<extra_id_26>', '<extra_id_27>', '<extra_id_28>', '<extra_id_29>', '<extra_id_30>', '<extra_id_31>', '<extra_id_32>', '<extra_id_33>', '<extra_id_34>', '<extra_id_35>', '<extra_id_36>', '<extra_id_37>', '<extra_id_38>', '<extra_id_39>', '<extra_id_40>', '<extra_id_41>', '<extra_id_42>', '<extra_id_43>', '<extra_id_44>', '<extra_id_45>', '<extra_id_46>', '<extra_id_47>', '<extra_id_48>', '<extra_id_49>', '<extra_id_50>', '<extra_id_51>', '<extra_id_52>', '<extra_id_53>', '<extra_id_54>', '<extra_id_55>', '<extra_id_56>', '<extra_id_57>', '<extra_id_58>', '<extra_id_59>', '<extra_id_60>', '<extra_id_61>', '<extra_id_62>', '<extra_id_63>', '<extra_id_64>', '<extra_id_65>', '<extra_id_66>', '<extra_id_67>', '<extra_id_68>', '<extra_id_69>', '<extra_id_70>', '<extra_id_71>', '<extra_id_72>', '<extra_id_73>', '<extra_id_74>', '<extra_id_75>', '<extra_id_76>', '<extra_id_77>', '<extra_id_78>', '<extra_id_79>', '<extra_id_80>', '<extra_id_81>', '<extra_id_82>', '<extra_id_83>', '<extra_id_84>', '<extra_id_85>', '<extra_id_86>', '<extra_id_87>', '<extra_id_88>', '<extra_id_89>', '<extra_id_90>', '<extra_id_91>', '<extra_id_92>', '<extra_id_93>', '<extra_id_94>', '<extra_id_95>', '<extra_id_96>', '<extra_id_97>', '<extra_id_98>', '<extra_id_99>']
0: <pad> <extra_id_24> <extra_id_32> <extra_id_36>. <extra_id_34> <extra_id_9> <extra_id_47> <extra_id_36> <extra_id_17> <extra_id_69> <extra_id_31> <extra_id_5> <extra_id_11> <extra_id_67> <extra_id_31> <extra_id_42> <extra_id_15> <extra_id_42> <extra_id_85> <extra_id_83> <extra_id_47> <extra_id_4> <extra_id_50> <extra_id_52> <extra_id_47> <extra_id_32> <extra_id_3> <extra_id_42> <extra_id_50> <extra_id_6> <extra_id_51> <extra_id_85> <extra_id_47> <extra_id_18> <extra_id_22> <extra_id_47> <extra_id_18> <extra_id_42> <extra_id_67> <extra_id_67> <extra_id_36> <extra_id_11> <extra_id_13> <extra_id_33> <extra_id_47> <extra_id_70> <extra_id_69> <extra_id_30> <extra_id_0> <extra_id_34>
1: <pad> <extra_id_63> <extra_id_47> <extra_id_33> <extra_id_11> <extra_id_69> <extra_id_70> <extra_id_15> <extra_id_22> <extra_id_15> <extra_id_35> <extra_id_84> <extra_id_30> <extra_id_34> <extra_id_21> <extra_id_8> <extra_id_15> <extra_id_65> <extra_id_78> <extra_id_18> <extra_id_77> <extra_id_34> <extra_id_69> <extra_id_34> <extra_id_65> <extra_id_84> <extra_id_34> <extra_id_20> <extra_id_31> <extra_id_38> <extra_id_34> <extra_id_61> <extra_id_8> <extra_id_49> <extra_id_31> <extra_id_5> <extra_id_34> <extra_id_8> <extra_id_61> <extra_id_80> <extra_id_75> <extra_id_35> <extra_id_11> <extra_id_54> <extra_id_78> <extra_id_70> <extra_id_76> <extra_id_6> <extra_id_34> <extra_id_11> RTIME
2: <pad> <extra_id_5> <extra_id_33> <extra_id_70> <extra_id_5> <extra_id_4> ром <extra_id_59> <extra_id_33> <extra_id_75> <extra_id_54> <extra_id_18> <extra_id_31> <extra_id_43> <extra_id_47> <extra_id_36> <extra_id_32> <extra_id_0> <extra_id_62> <extra_id_66> <extra_id_72> <extra_id_33> <extra_id_31> <extra_id_38> <extra_id_5> <extra_id_85> <extra_id_62> <extra_id_2> <extra_id_84> <extra_id_26> <extra_id_6> <extra_id_34> <extra_id_31> <extra_id_35> <extra_id_84> <extra_id_19> <extra_id_34> <extra_id_54> <extra_id_71> <extra_id_75> <extra_id_51> <extra_id_0> <extra_id_3> <extra_id_67> <extra_id_2> <extra_id_20> <extra_id_65> <extra_id_83> <extra_id_36> <extra_id_85> <extra_id_4>
3: <pad> <extra_id_34> <extra_id_76> <extra_id_20> <extra_id_86> <extra_id_3> <extra_id_51> <extra_id_34> <extra_id_36> <extra_id_34> <extra_id_22> <extra_id_34> <extra_id_33> <extra_id_84> <extra_id_34> <extra_id_54> <extra_id_67> <extra_id_75> <extra_id_8> <extra_id_34> <extra_id_59> <extra_id_15> <extra_id_42> <extra_id_8> <extra_id_34> <extra_id_50> <extra_id_34> <extra_id_11> <extra_id_22> <extra_id_80> <extra_id_36> <extra_id_33> <extra_id_5> <extra_id_22> <extra_id_38> <extra_id_54> <extra_id_3> <extra_id_33> <extra_id_67> <extra_id_36> <extra_id_78> <extra_id_85> <extra_id_69> <extra_id_73> <extra_id_85> </s>
4: <pad> <extra_id_75> <extra_id_0> <extra_id_30> <extra_id_20> <extra_id_65> <extra_id_51> <extra_id_42> <extra_id_15> <extra_id_20> <extra_id_85> <extra_id_18> <extra_id_51> <extra_id_47> <extra_id_15> <extra_id_78> <extra_id_34> <extra_id_15> <extra_id_32> <extra_id_36> tiful <extra_id_8> <extra_id_62> <extra_id_67> <extra_id_11> <extra_id_41> <extra_id_43> <extra_id_75> <extra_id_12> <extra_id_86> <extra_id_70> <extra_id_54> <extra_id_78> <extra_id_15> <extra_id_47> <extra_id_34> <extra_id_48> <extra_id_67> <extra_id_31> <extra_id_15> <extra_id_38> <extra_id_31> <extra_id_33> <extra_id_19> <extra_id_5> <extra_id_6> <extra_id_47> <extra_id_85> <extra_id_21> <extra_id_71> <extra_id_67>

As we can see the sentinel tokens are not kept intact, instead they are brokendown into several tokens as if they were regular words even when the tokenizer recognises them as special tokens.
Is this behavior to be expected?
I've read MT5 paper several times already and they seem to make use of the sentinel tokens in pretraining, so it seem to me that they sould be available as special tokens in the vocabulary which doesn't seems to be the case, any help with this would be greatly appreciated.

Google org

Hello! If you print the tokenizer 's added tokens decoder you will see that the tokens are set to have "single_word = True"
Good catch it's because of this line:

        # for legacy purpose, we keep this. Will be removed and tests updated. (when `added_tokens_decoder` is not passed as kwargs)
        self._added_tokens_decoder = {}
        for i in range(len(extra_tokens)):
            self._added_tokens_decoder[len(self.sp_model) - 1 + extra_ids - i] = AddedToken(
                f"<extra_id_{i}>", single_word=False, lstrip=True, rstrip=True, special=True
            )

quick fix use the fast tokenizer. I'll update this today.

Google org

Hello! If you print the tokenizer 's added tokens decoder you will see that the tokens are set to have "single_word = True"
Good catch it's because of this line:

        # for legacy purpose, we keep this. Will be removed and tests updated. (when `added_tokens_decoder` is not passed as kwargs)
        self._added_tokens_decoder = {}
        for i in range(len(extra_tokens)):
            self._added_tokens_decoder[len(self.sp_model) - 1 + extra_ids - i] = AddedToken(
                f"<extra_id_{i}>", single_word=False, lstrip=True, rstrip=True, special=True
            )

quick fix use the fast tokenizer. I'll update this today.

Hey, thank you so much for your reply and your help!
Just wanted to add that I checked using:

MT5TokenizerFast, T5TokenizerFast

And got the same issue, maybe there's something else wrong? Or might want to check those out too.

Many thanks for your work!

@ArthurZ , Hi thank you so much for your help, although it seems that the proposed fix fails to address the main issue with the tokenizer, to me, it seems that the Tokenizer configuration files seems to be outdated. I did some testing and adding the corresponding special tokens to the json files without passing extra_ids (which should be the normal behavior for the tokenizer as MT5 comes with 100 special tokens in its vocabulary) seems to fix the problem, so I'm making a contribution to the repository. I'm not completely sure if that's the correct way to fix the issue or if other errors will arise from this, so any comments will be greatly appreciated.

Related PR: https://huggingface.co/google/mt5-small/discussions/10

Sign up or log in to comment