IndexError: index out of range in self

#2
by phosseini - opened

Hi, I'm running the provided example in the model card and I'm getting the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-8-5181f41be2fc> in <cell line: 1>()
----> 1 torch_outs = model(
      2     tokens_ids,
      3     attention_mask=attention_mask,
      4     encoder_attention_mask=attention_mask,
      5     output_hidden_states=True

8 frames
/usr/local/lib/python3.9/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2208         # remove once script supports set_grad_enabled
   2209         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2210     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2211 
   2212 

IndexError: index out of range in self

Looking at the model and the tokenizer's vocab sizes, I see a mismatch. Could that be the problem or I'm missing something else?

model.config.vocab_size
> 4105

tokenizer.vocab_size
> 4107

Hi @phosseini - this error is fixed by a PR we pushed to transformers, but which is unfortunately only available on main right now. Please try installing from main with pip install --upgrade git+https://github.com/huggingface/transformers.git and see if that fixes your issue!

Hi @Rocketknight1 ,

I've been tooling around trying to fine-tune this model to a classifier. I was able to get the model through the trainer but when I go to run inference, I am getting the same issue listed above.
The model was trained with this config:

architectures:
0: "EsmForSequenceClassification"
attention_probs_dropout_prob:0
emb_layer_norm_before:false
esmfold_config:null
hidden_dropout_prob:0
hidden_size:1280
id2label:
0:"LABEL_0"
1:"LABEL_1"
2:"LABEL_2"
3:"LABEL_3"
4:"LABEL_4"
5:"LABEL_5"
6:"LABEL_6"
initializer_range:0.02
intermediate_size:5120
is_folding_model:false
label2id:
LABEL_0:0
LABEL_1:1
LABEL_2:2
LABEL_3:3
LABEL_4:4
LABEL_5:5
LABEL_6:6
layer_norm_eps:1e-12
mask_token_id:2
max_position_embeddings:1002
model_type:"esm"
num_attention_heads:20
num_hidden_layers:24
pad_token_id:1
position_embedding_type:"absolute"
problem_type:"single_label_classification"
tie_word_embeddings:false
token_dropout:true
torch_dtype:"float32"
transformers_version:"4.30.0.dev0"
use_cache:false
vocab_list:null
vocab_size:4105

I am using the InstaDeepAI/nucleotide-transformer-500m-1000g tokenizer and have the same transformer version 4.30.0.dev0 loaded in my notebook which I can run the fill-mask model from the card successfully.

Any ideas would be appreciated.

Hi @esko2213 , can you send me some code to reproduce the issue?

Thanks for the quick response.

I had been deploying the the model through SageMaker and calling it as such....

predictor = huggingface_estimator.deploy(1, "ml.m5.xlarge")
input_sequence= {"inputs":"CAGCATTTTGAATTTGAATACCAGACCAAAGTGGATGGTGAGATAATCCTTCATCTTTATGACAAAGGAGGAATTGAGCAAACAATTTGTATGTTGGATGGTGTGTTTGCATTTGTTTTACTGGATACTGCCAATAAGAAAGTGTTCCTGGGTAGAGATACATATGGAGTCAGACCTTTGTTTAAAGCAATGACAGAAGATGGATTTTTGGCTGTATGTTCAGAAGCTAAAGGTCTTGTTACATTGAAGCACTCCGCGACTCCCTTTTTAAAAGTGGAGCCTTTTCTTCCTGGACACTATGAAGTTTTGGATTTAAAGCCAAATGGCAAAGTTGCATCCGTGGAAATGGTTAAATATCATCACTGTCGGGATGTACCCCTGCACGCCCTCTATGACAATGTGGAGAAACTCTTTCCAGGTTTTGAGATAGAAACTGTGAAGAACAACCTCAGGATCCTTTTTAATAATGCTGTAAAGAAACGTTTGATGACAGACAGAAGGATTGGCTGCCTTTTATCAGGGGGCTTGGACTCCAGCTTGGTTGCTGCCACTCTGTTGAAGCAGCTGAAAGAAGCCCAAGTACAGTATCCTCTCCAGACATTTGCAATTGGCATGGAAGACAGCCCCGATTTACTGGCTGCTAGAAAGGTGGCAGATCATATTGGAAGTGAACATTATGAAGTCCTTTTTAACTCTGAGGAAGGCATTCAGGCTCTGGATGAAGTCATATTTTCCTTGGAAACTTATGACATTACAACAGTTCGTGCTTCAGTAGGTATGTATTTAATTTCCAAGTATATTCGGAAGAACACAGATAGCGTGGTGATCTTCTCTGGAGAAGGATCAGATGAACTTACGCAGGGTTACATATATTTTCACAAGGCTCCTTCTCCTGAAAAAGCCGAGGAGGAGAGTGAGAGGCTTCTGAGGGAACTCTATTTGTTTGATGTTCTCCGCGCAGATCGAACTACTGCTGCCCATGGTCTTGAACTGAGAGTCCCATTTCTAGATCATCGATTTTCTTCCTATTACTTGTCTCTGCCACCAGAAATGAGAATTCCAAAGAATGGGATAGAAAAACATCTCCTGAGAGAGACGTTTGAGGATTCCAATCTGATACCCAAAGAGATTCTCTGGCGACCAAAAGAAGCCTTCAGTGATGGAATAACTTCAGTTAAGAATTCCTGGTTTAAGATTTTACAGGAATACGTTGAACATCAGGTTGATGATGCAATGATGGCAAATGCAGCCCAGAAATTTCCCTTCAATACTCCTAAAACCAAAGAAGGATATTACTACCGTCAAGTCTTTGAACGCCATTACCCAGGCCGGGCTGACTGGCTGAGCCATTACTGGATGCCCAAGTGGATCAATGCCACTGACCCTTCTGCCCGCACGCTGACCCACTACAAGTCAGCTGTCAAAGCTTAG"}
predictor.predict(input_sequence)

IndexError: index out of range in self

which produced the same error reported above. I've tried it with a single string and an array of strings as input, both throw the error.

I have not pushed it to the Hub yet but wanted to see what happens if I run it locally first...

Running it locally via pipeline -


from transformers import pipeline
pipeline = pipeline(task="text-classification", model="./nucl_class_model/")
for x in pipeline("CAGCATTTTGAATTTGAATACCAGACCAAAGTGGATGGTGAGATAATCCTTCATCTTTATGACAAAGGAGGAATTGAGCAAACAATTTGTATGTTGGATGGTGTGTTTGCATTTGTTTTACTGGATACTGCCAATAAGAAAGTGTTCCTGGGTAGAGATACATATGGAGTCAGACCTTTGTTTAAAGCAATGACAGAAGATGGATTTTTGGCTGTATGTTCAGAAGCTAAAGGTCTTGTTACATTGAAGCACTCCGCGACTCCCTTTTTAAAAGTGGAGCCTTTTCTTCCTGGACACTATGAAGTTTTGGATTTAAAGCCAAATGGCAAAGTTGCATCCGTGGAAATGGTTAAA"):
    print(x)


IndexError: index out of range in self

However, when I run it via AutoModel and remove the encoder_attention_mask

# Import the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-1000g")
model = AutoModelForSequenceClassification.from_pretrained("./nucl_class_model/", local_files_only=True, )

sequences = ['ATGCCCCAACTAAATACTACCGTATGGCCCACCATAATTACCCCCATACTCCTTACACTATTCCTCATCACCCAACTAAAAATATTAAACACAAACTACCACCTACCTCCCTCACCAAAGCCCATAAAAATAAAAAATTATAACAAACCCTGAGAACCAAAATGAACGAAAATCTGTTCGCTTCATTCATTGCCCCCACAATCCTAG']
tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt")["input_ids"]
attention_mask = tokens_ids != tokenizer.mask_token_id

torch_outs = model(
    tokens_ids,
    attention_mask=attention_mask,
    #encoder_attention_mask=attention_mask,
    output_hidden_states=True
)

probs = torch.softmax(torch_outs.logits, dim=1)

I can get probabilities

tensor([[2.8926e-04, 2.9352e-04, 3.5966e-04, 1.9120e-04, 1.0060e-03, 4.8940e-05,
         9.9781e-01]], grad_fn=<SoftmaxBackward0>)

Let me know if you want me to push up the model.

Sign up or log in to comment