Predicting end of sequence

#31
by pipparichter - opened

Hello! I am trying to use the model to predict whether or not a protein sequence is complete (as opposed to erroneously truncated). It seems as though ProtGPT2 model tends to extend already-complete protein sequences (i.e. it would rarely predict 0 as the next token). I wanted to double-check these results using the API on the browser, which led to the following observation: I noticed that it never predicts and <|end of text|> token -- it simply spits out a warning saying "no text generated." Does this mean that the model has concluded the input sequence is complete?

Thank you!

Hi!
The model will terminate sequences producing an <|endoftext|> token, but it won't appear during generation because special tokens are not displayed unless you choose to. Bear in mind that this model tends to generate sequences a bit longer than natural on average (it likes to speak a lot), so it may continue sequences that are already complete. Other models we've trained later do not show this behavior.

On a different note, what do you mean by the API in the browser? Do you mean here in HF? If so, I would not rely on the generation because it's automatically produced by HF without using the most optimal generation parameters. Also, it does not filter by perplexity. In any case, the HF API would not show the special character <|endoftext|>, so I guess the behaviour would be 'no text generated' as you said (but with no experience generating from this browser, I'd do it locally instead :)).

Hope this helps,
Noelia

Sign up or log in to comment