nferruz/ProtGPT2 · New line characters in generated sequences

Hello!

Thanks for your post. Yes, those newline tokens are an artifact of the way I trained the model. I didn't notice them at the time of training, but of course, following the fasta format, they were there after every 60 characters. We trained several models after ProtGPT2, and I ensured they didn't have newline characters as they only make generation more complicated.
In any case, for this model, I'd ignore all sequences where the model generates a new line character in the first 60 amino acids- those are bad sequences. And then, for the rest of the sequences, you can take the sequence after removing the newline character to get the final string - although I'd leave the newline character if you are computing perplexity values since the model expects them every 60 characters. Also, it has never happened to me, but if a newline character appeared at a different position than a 60 amino acid window, I would discard that sequence too.

Let me know if questions remain.
Noelia