Model max length

by victor-roris - opened Aug 2, 2022

Aug 2, 2022

I tried your code with long sentences and the automatic truncation to the max length of the model fails:

encoded_input = tokenizer(really_long_sentence, truncation=True, max_length=None, return_tensors='pt')

It raises an error about the dimension:

RuntimeError: The expanded size of the tensor (5227) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 5227].  Tensor sizes: [1, 514]

So, I tried to fix the max length to 514:

encoded_input = tokenizer(really_long_sentence, truncation=True, max_length=514, return_tensors='pt')

But it continues failing:

   2041         # remove once script supports set_grad_enabled
   2042         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2043     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2044 
   2045 

IndexError: index out of range in self

Can you tell me if there is some way to obtain the appropriate model max length from the model/tokenizer configuration?

alaksana96

Aug 8, 2022

I had a similar issue. I found (through trial and error) that if you set the max_length to 511, it seems to work.

I'd like to understand why thats the case though

mmustafaicer

Sep 8, 2022

I had the same issue, but if you choose max_length option way less than existing size 514, it would solve your problem. Try max_length=500 or max_length=400 and see if it works.

yassiracharki

Nov 14, 2022

max_length = 511 woks fine for me.

NosLeeP

Jan 31, 2023

•

edited Feb 1, 2023

Remember that with RoBERTA the encoding process will add a token to the signify beg and end of a document, this is why when you truncate to 514 it adds 2 in to make tensor make 516. I would say that is this is the case it SHOULD be fixed in the config.json file. If you look at other BERT models you will see that the config.json file will have max_position_embeddings be set to 2 less than the tensor is long to account for beg and end (in this model they are 0 and 2 respective when you look at tokenized document you will see that). So to those who used max_length <= 512 that is why is worked.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment