model_max_length effectively infinite
My workflow, at Stanford as part of our Stanza software, has been to load transformers & tokenizers in a generic manner so that they can be used for various tasks in a variety of languages. Part of that workflow is figuring out how long the input is by checking model_max_length
. Unfortunately, the config file has an extremely large value set for that field. Can we get this updated? Thanks.
Great remark @AngledLuffa !
Due to backwards breaking compability, I think it'll be difficult to change model_max_length
in the config as the users would silently get shorter input_ids
.
What do you think
@lysandre
?
Note that you can however, just set this variable in your script as follows:
tok = AutoTokenizer.from_pretrained("google/muril-base-cased", model_max_length=250)
Would this work for you?
I don't believe there's a circumstance under which the input_ids
would be shorter. If the input goes beyond the maximum length of the model, there's actually an exception thrown. This was how I first noticed the problem, actually; I had given it an input of ~900, and transformers
reported an error with the position encoding.
I think updating the tokenizer max length to the model max length makes sense in this situation given the model cannot handle a sequence longer than that number of tokens.
Cool! @AngledLuffa would you like to open a PR to do this or would you like us to help you on it? :-)
I am happy to defer to the experts on this model :)