IndexError: index out of range in self

by phosseini - opened Apr 20, 2023

Apr 20, 2023

Hi, I'm running the provided example in the model card and I'm getting the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-8-5181f41be2fc> in <cell line: 1>()
----> 1 torch_outs = model(
      2     tokens_ids,
      3     attention_mask=attention_mask,
      4     encoder_attention_mask=attention_mask,
      5     output_hidden_states=True

8 frames
/usr/local/lib/python3.9/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2208         # remove once script supports set_grad_enabled
   2209         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2210     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2211 
   2212 

IndexError: index out of range in self

Looking at the model and the tokenizer's vocab sizes, I see a mismatch. Could that be the problem or I'm missing something else?

model.config.vocab_size
> 4105

tokenizer.vocab_size
> 4107

Rocketknight1

Apr 21, 2023

Hi @phosseini - this error is fixed by a PR we pushed to transformers, but which is unfortunately only available on main right now. Please try installing from main with pip install --upgrade git+https://github.com/huggingface/transformers.git and see if that fixes your issue!

esko2213

May 18, 2023

•

edited May 18, 2023

Hi @Rocketknight1 ,

I've been tooling around trying to fine-tune this model to a classifier. I was able to get the model through the trainer but when I go to run inference, I am getting the same issue listed above.
The model was trained with this config:

architectures:
0: "EsmForSequenceClassification"
attention_probs_dropout_prob:0
emb_layer_norm_before:false
esmfold_config:null
hidden_dropout_prob:0
hidden_size:1280
id2label:
0:"LABEL_0"
1:"LABEL_1"
2:"LABEL_2"
3:"LABEL_3"
4:"LABEL_4"
5:"LABEL_5"
6:"LABEL_6"
initializer_range:0.02
intermediate_size:5120
is_folding_model:false
label2id:
LABEL_0:0
LABEL_1:1
LABEL_2:2
LABEL_3:3
LABEL_4:4
LABEL_5:5
LABEL_6:6
layer_norm_eps:1e-12
mask_token_id:2
max_position_embeddings:1002
model_type:"esm"
num_attention_heads:20
num_hidden_layers:24
pad_token_id:1
position_embedding_type:"absolute"
problem_type:"single_label_classification"
tie_word_embeddings:false
token_dropout:true
torch_dtype:"float32"
transformers_version:"4.30.0.dev0"
use_cache:false
vocab_list:null
vocab_size:4105

I am using the InstaDeepAI/nucleotide-transformer-500m-1000g tokenizer and have the same transformer version 4.30.0.dev0 loaded in my notebook which I can run the fill-mask model from the card successfully.

Any ideas would be appreciated.

Rocketknight1

May 19, 2023

Hi @esko2213 , can you send me some code to reproduce the issue?

esko2213

May 20, 2023

•

edited May 20, 2023

Thanks for the quick response.

I had been deploying the the model through SageMaker and calling it as such....

predictor = huggingface_estimator.deploy(1, "ml.m5.xlarge")
input_sequence= {"inputs":"CAGCATTTTGAATTTGAATACCAGACCAAAGTGGATGGTGAGATAATCCTTCATCTTTATGACAAAGGAGGAATTGAGCAAACAATTTGTATGTTGGATGGTGTGTTTGCATTTGTTTTACTGGATACTGCCAATAAGAAAGTGTTCCTGGGTAGAGATACATATGGAGTCAGACCTTTGTTTAAAGCAATGACAGAAGATGGATTTTTGGCTGTATGTTCAGAAGCTAAAGGTCTTGTTACATTGAAGCACTCCGCGACTCCCTTTTTAAAAGTGGAGCCTTTTCTTCCTGGACACTATGAAGTTTTGGATTTAAAGCCAAATGGCAAAGTTGCATCCGTGGAAATGGTTAAATATCATCACTGTCGGGATGTACCCCTGCACGCCCTCTATGACAATGTGGAGAAACTCTTTCCAGGTTTTGAGATAGAAACTGTGAAGAACAACCTCAGGATCCTTTTTAATAATGCTGTAAAGAAACGTTTGATGACAGACAGAAGGATTGGCTGCCTTTTATCAGGGGGCTTGGACTCCAGCTTGGTTGCTGCCACTCTGTTGAAGCAGCTGAAAGAAGCCCAAGTACAGTATCCTCTCCAGACATTTGCAATTGGCATGGAAGACAGCCCCGATTTACTGGCTGCTAGAAAGGTGGCAGATCATATTGGAAGTGAACATTATGAAGTCCTTTTTAACTCTGAGGAAGGCATTCAGGCTCTGGATGAAGTCATATTTTCCTTGGAAACTTATGACATTACAACAGTTCGTGCTTCAGTAGGTATGTATTTAATTTCCAAGTATATTCGGAAGAACACAGATAGCGTGGTGATCTTCTCTGGAGAAGGATCAGATGAACTTACGCAGGGTTACATATATTTTCACAAGGCTCCTTCTCCTGAAAAAGCCGAGGAGGAGAGTGAGAGGCTTCTGAGGGAACTCTATTTGTTTGATGTTCTCCGCGCAGATCGAACTACTGCTGCCCATGGTCTTGAACTGAGAGTCCCATTTCTAGATCATCGATTTTCTTCCTATTACTTGTCTCTGCCACCAGAAATGAGAATTCCAAAGAATGGGATAGAAAAACATCTCCTGAGAGAGACGTTTGAGGATTCCAATCTGATACCCAAAGAGATTCTCTGGCGACCAAAAGAAGCCTTCAGTGATGGAATAACTTCAGTTAAGAATTCCTGGTTTAAGATTTTACAGGAATACGTTGAACATCAGGTTGATGATGCAATGATGGCAAATGCAGCCCAGAAATTTCCCTTCAATACTCCTAAAACCAAAGAAGGATATTACTACCGTCAAGTCTTTGAACGCCATTACCCAGGCCGGGCTGACTGGCTGAGCCATTACTGGATGCCCAAGTGGATCAATGCCACTGACCCTTCTGCCCGCACGCTGACCCACTACAAGTCAGCTGTCAAAGCTTAG"}
predictor.predict(input_sequence)

IndexError: index out of range in self

which produced the same error reported above. I've tried it with a single string and an array of strings as input, both throw the error.

I have not pushed it to the Hub yet but wanted to see what happens if I run it locally first...

Running it locally via pipeline -


from transformers import pipeline
pipeline = pipeline(task="text-classification", model="./nucl_class_model/")
for x in pipeline("CAGCATTTTGAATTTGAATACCAGACCAAAGTGGATGGTGAGATAATCCTTCATCTTTATGACAAAGGAGGAATTGAGCAAACAATTTGTATGTTGGATGGTGTGTTTGCATTTGTTTTACTGGATACTGCCAATAAGAAAGTGTTCCTGGGTAGAGATACATATGGAGTCAGACCTTTGTTTAAAGCAATGACAGAAGATGGATTTTTGGCTGTATGTTCAGAAGCTAAAGGTCTTGTTACATTGAAGCACTCCGCGACTCCCTTTTTAAAAGTGGAGCCTTTTCTTCCTGGACACTATGAAGTTTTGGATTTAAAGCCAAATGGCAAAGTTGCATCCGTGGAAATGGTTAAA"):
    print(x)


IndexError: index out of range in self

However, when I run it via AutoModel and remove the encoder_attention_mask

# Import the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-1000g")
model = AutoModelForSequenceClassification.from_pretrained("./nucl_class_model/", local_files_only=True, )

sequences = ['ATGCCCCAACTAAATACTACCGTATGGCCCACCATAATTACCCCCATACTCCTTACACTATTCCTCATCACCCAACTAAAAATATTAAACACAAACTACCACCTACCTCCCTCACCAAAGCCCATAAAAATAAAAAATTATAACAAACCCTGAGAACCAAAATGAACGAAAATCTGTTCGCTTCATTCATTGCCCCCACAATCCTAG']
tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt")["input_ids"]
attention_mask = tokens_ids != tokenizer.mask_token_id

torch_outs = model(
    tokens_ids,
    attention_mask=attention_mask,
    #encoder_attention_mask=attention_mask,
    output_hidden_states=True
)

probs = torch.softmax(torch_outs.logits, dim=1)

I can get probabilities

tensor([[2.8926e-04, 2.9352e-04, 3.5966e-04, 1.9120e-04, 1.0060e-03, 4.8940e-05,
         9.9781e-01]], grad_fn=<SoftmaxBackward0>)

Let me know if you want me to push up the model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment