vesteinn/DanskBERT · Max positional embedding causes error when exceeding 512.

Jul 26, 2023

When I run:

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("vesteinn/DanskBERT")
model = AutoModelForMaskedLM.from_pretrained("vesteinn/DanskBERT")

text = "very long text "*1000

input_ids = tokenizer(text, return_tensors="pt")
input_ids["input_ids"].shape
# truncate to 512 tokens
input_ids = {k: v[:, :514] for k, v in input_ids.items()}

input_ids["input_ids"].shape

outputs = model.forward(**input_ids)

I get:

...
   2208     # remove once script supports set_grad_enabled
   2209     _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2210 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

IndexError: index out of range in self

Max positional embedding causes error when exceeding 512.aae1f2db

vesteinn

Owner Jul 27, 2023

Hi Kenneth!

This runs fine if you change 514 to 512 in your example, but I'm guessing you know that.

I was also confused by the max_position_embeddings being set to 514, but this might shed some light on it https://github.com/huggingface/transformers/issues/1363 and https://github.com/facebookresearch/fairseq/issues/1187 . The model was trained with fairseq and then ported to hf.