clarification question- sequence with invalid proteins

#14
by skr3178 - opened

image.png

If I feed it invalid sequence it still predicts a sequence following this amino acid sequence list.

That is probably due to the basic model coming from chatGPT.
It is probably never an situation that someone will use this for. But I wanted to hear any thoughts on this.

Similarly I ran the sequence on colabfold but got non converging sequence error. For reference:

image.png

Hi skr3178,

Thanks for writing!
The model will always predict a sequence after an input. It is autoregressive and chooses the next token based on its context. In your case, the context is 'BJOUX, it has no biological meaning, but it still corresponds to a set of tokens. Hence, the model can compute associate probabilities for the tokens after that. But I can imagine that the perplexities for those sequences should be a bit high. If you want to avoid specific tokens during generation, you could use the bad_words parameter.
Not an expert on Colabfold, but what sequence did you try?
Thanks!
noelia

Hi Noelia,
Thank you for the clarification and for sharing this work.
I find it very interesting and learnt a lot :-)
I tried a random combination on colabfold (not a real sequence).
Thanks!

skr3178 changed discussion status to closed

Sign up or log in to comment