model_max_length effectively infinite

by AngledLuffa - opened May 25, 2022

May 25, 2022

My workflow, at Stanford as part of our Stanza software, has been to load transformers & tokenizers in a generic manner so that they can be used for various tasks in a variety of languages. Part of that workflow is figuring out how long the input is by checking model_max_length. Unfortunately, the config file has an extremely large value set for that field. Can we get this updated? Thanks.

patrickvonplaten

May 26, 2022

Great remark @AngledLuffa !

Due to backwards breaking compability, I think it'll be difficult to change model_max_length in the config as the users would silently get shorter input_ids.
What do you think @lysandre ?

Note that you can however, just set this variable in your script as follows:

tok = AutoTokenizer.from_pretrained("google/muril-base-cased", model_max_length=250)

Would this work for you?

AngledLuffa

May 26, 2022

I don't believe there's a circumstance under which the input_ids would be shorter. If the input goes beyond the maximum length of the model, there's actually an exception thrown. This was how I first noticed the problem, actually; I had given it an input of ~900, and transformers reported an error with the position encoding.

patrickvonplaten

May 26, 2022

Gottcha - that's a very valid reason then to update the tokenizer to the model's max length.

@lysandre and @sgugger what do you think? Update the tokenizer max length to the model max length at least here?

lysandre

Google org May 30, 2022

I think updating the tokenizer max length to the model max length makes sense in this situation given the model cannot handle a sequence longer than that number of tokens.

patrickvonplaten

May 30, 2022

Cool! @AngledLuffa would you like to open a PR to do this or would you like us to help you on it? :-)

AngledLuffa

May 30, 2022

I am happy to defer to the experts on this model :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment