Model max length #2

by victor-roris - opened

I tried your code with long sentences and the automatic truncation to the max length of the model fails:

encoded_input = tokenizer(really_long_sentence, truncation=True, max_length=None, return_tensors='pt') 

It raises an error about the dimension:

RuntimeError: The expanded size of the tensor (5227) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 5227].  Tensor sizes: [1, 514]

So, I tried to fix the max length to 514:

encoded_input = tokenizer(really_long_sentence, truncation=True, max_length=514, return_tensors='pt') 

But it continues failing:

   2041         # remove once script supports set_grad_enabled
   2042         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2043     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2044 
   2045 

IndexError: index out of range in self

Can you tell me if there is some way to obtain the appropriate model max length from the model/tokenizer configuration?

I had a similar issue. I found (through trial and error) that if you set the max_length to 511, it seems to work.

I'd like to understand why thats the case though

I had the same issue, but if you choose max_length option way less than existing size 514, it would solve your problem. Try max_length=500 or max_length=400 and see if it works.