Embedding sequences

#12
by flpgrz - opened

Hello,

Thanks for making this model available!

I have been trying to embed sequences (of different lengths), by using the following code:

inputs = tokenizer(['CASSPRAGGITDTQYF', 'CASSLLQPFGTEAFF'], return_tensors="pt", padding=True)
outputs = model(**inputs, output_hidden_states=True)
embeddings = outputs.hidden_states[0] #embedding before final fc layer

The two exemplary sequences have a different length and give a different number of tokens. Hence, padding is needed (padding=True).

However, I get the following error:
ValueError: Asking to pad but the tokenizer does not have a padding token.

This makes me think that padding was not used at training time, as the tokenizer does not have a padding token.
How did you concatenate proteins of different lengths to create a batch at training time without padding?

Thanks for your help.

Hi flpgrz,

I did not pad in ProtGTP2 because the sequences were truncated across groups. This is something I did not like and modified in ZymCTRL, which has a padding token.
In any case, I think you can add a padding token on the fly; could you try this?

if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

I understand. Thanks for clarifying.

I might be wrong, but I think adding the padding token in the tokeniser step might not work, because the model does not know how to process it. But I should first try.

What I did so far is to embed one sequence at a time and do 0-padding afterwards to account for the different lengths

Yes, I think you are right. I think they talk about this issue in the GitHub issue I sent: https://github.com/huggingface/transformers/issues/3021
But I haven't tested it myself. Let me know if it works!

Sign up or log in to comment