Max positional embedding causes error when exceeding 512.

#3

When I run:

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("vesteinn/DanskBERT")
model = AutoModelForMaskedLM.from_pretrained("vesteinn/DanskBERT")

text = "very long text "*1000

input_ids = tokenizer(text, return_tensors="pt")
input_ids["input_ids"].shape
# truncate to 512 tokens
input_ids = {k: v[:, :514] for k, v in input_ids.items()}

input_ids["input_ids"].shape

outputs = model.forward(**input_ids)

I get:

...
   2208     # remove once script supports set_grad_enabled
   2209     _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2210 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

IndexError: index out of range in self

Hi Kenneth!

This runs fine if you change 514 to 512 in your example, but I'm guessing you know that.

I was also confused by the max_position_embeddings being set to 514, but this might shed some light on it https://github.com/huggingface/transformers/issues/1363 and https://github.com/facebookresearch/fairseq/issues/1187 . The model was trained with fairseq and then ported to hf.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment