Inquiry Regarding ProtGPT2 Prompt and Output Length

#26
by littleworth - opened

@nferruz

Dear Noeli,

I'd like to express my appreciation for your efforts in developing the ProtGPT2 model. It's an excellent resource for the research community, and I'm excited to work with it.

I have a few questions regarding the correct usage of prompts and output length when working with ProtGPT2. In your provided example, I noticed that you use an <|endoftext|> as the prompt:

from transformers import pipeline
protgpt2 = pipeline('text-generation', model="nferruz/ProtGPT2")
sequences = protgpt2("<|endoftext|>", max_length=100, do_sample=True, top_k=950, repetition_penalty=1.2, num_return_sequences=10, eos_token_id=0)
for seq in sequences:
    print(seq)

Is it necessary to include the <|endoftext|> as a prompt? I experimented with using "MKK" directly as my prompt and the model returned results without any errors. However, I'm concerned about the accuracy of the results when using this approach. Could you please clarify whether the <|endoftext|> token is required for accurate output?
Like is <|endoftext|>MKK a more correct approach?

Additionally, does ProtGPT2 default to starting with "M" as the beginning of the generated sequence when provided with an <|endoftext|> prompt only?

My second question pertains to the max_length parameter. In your example, you set max_length=100, but the generated output can exceed this length. Is this an expected behavior of the model?

Your guidance on these matters would be greatly appreciated. I look forward to hearing from you and gaining a deeper understanding of the ProtGPT2 model.

Sincerely,
Littleworth

Hi Littleworth,

I put <|endoftext|> in the documentation as an example so that the model starts with a de novo sequence. But you can start with any sequence seed you want, like the one you mention. Or one could also leave it empty.

The model will often generate an 'M' after <|endoftext|>, but not always. It reproduces the distribution shown in the training set, and since some natural sequences don't start with 'M', sometimes it generates other amino acids too.

The max_length param refers to the number of tokens. Each token has an average length of 4 amino acids, so I'd expect that a max_length of 100 gives sequences from anywhere to 0 to 500 amino acids.

Hope this helps
Noelia

@nferruz Thanks so much for your clarification.

Sign up or log in to comment