nferruz/ProtGPT2 · End-to-end example for AA sequence vectorization

from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "nferruz/ProtGPT2" model = AutoModelForCausalLM.from_pretrained(model_name).to('cuda:0') tokenizer = AutoTokenizer.from_pretrained(model_name) protein_sequences = [ "MINDLLDISRIISGKMTLDRAEVNLTAIARQVVEEQRQAAEAKSIQLLCSTPDTNHYVFG", (...) ] input_ids = tokenizer(protein_sequences, return_tensors="pt").to('cuda:0') outputs = model(**input_ids) (...)

Hi Piotr,

Thanks a lot for reaching out!

I haven't explored sequence embedding myself yet, since I trained ProtGTP2 with protein design in mind, but I'd love to see how it performs.
But, from the original GPT paper (although it wasn't explored in the GPT2 and GPT3 papers), it should be possible to embed the sequences by taking as a vector the attention heads.

Following your code, you could do:

outputs = model(input_ids, output_attentions=True)
This returns a dictionary with keys loss, logits, past_key_values, and attentions.

The attention tensor will have a shape (batch_size, num_heads, sequence_length, sequence_length),
in your case: torch.Size([1, 20, 21, 21]).

It is 21 because your 60 amino acid long sequence gets converted to a 21 token-long sequence.
This will be a problem if you want vectors of the same length after mutating a single amino acid because the number of tokens could change.

I hope this helps for now, but in the meanwhile, I'm going to try to read how to do this with autoregressive models in HuggingFace. I'll get back to you; sorry that I do not have hands-on experience to show!

Noelia