model_max_length effectively infinite

#1
by AngledLuffa - opened

My workflow, at Stanford as part of our Stanza software, has been to load transformers & tokenizers in a generic manner so that they can be used for various tasks in a variety of languages. Part of that workflow is figuring out how long the input is by checking model_max_length. Unfortunately, the config file has an extremely large value set for that field. Can we get this updated? Thanks.

Great remark @AngledLuffa !

Due to backwards breaking compability, I think it'll be difficult to change model_max_length in the config as the users would silently get shorter input_ids.
What do you think @lysandre ?

Note that you can however, just set this variable in your script as follows:

tok = AutoTokenizer.from_pretrained("google/muril-base-cased", model_max_length=250)

Would this work for you?

I don't believe there's a circumstance under which the input_ids would be shorter. If the input goes beyond the maximum length of the model, there's actually an exception thrown. This was how I first noticed the problem, actually; I had given it an input of ~900, and transformers reported an error with the position encoding.

Gottcha - that's a very valid reason then to update the tokenizer to the model's max length.

@lysandre and @sgugger what do you think? Update the tokenizer max length to the model max length at least here?

Google org

I think updating the tokenizer max length to the model max length makes sense in this situation given the model cannot handle a sequence longer than that number of tokens.

Cool! @AngledLuffa would you like to open a PR to do this or would you like us to help you on it? :-)

I am happy to defer to the experts on this model :)

Sign up or log in to comment