There is an 'X' in generated sequences.

#23
by SweetGAN - opened

Dear Noelia,

When I analyze the generated sequences with ProtGPT2, some sequences have 'X' as an amino acid (either in the beginning or in the middle). Could you please let me know how I should interpret this character?

Thanks!

SweetGAN changed discussion status to closed
SweetGAN changed discussion status to open

Hi SweetGAN,

The UniRef database contains sequences with 'X' as an amino acid (actually, it appears pretty frequently somehow!). Hence the model has learned that this token sometimes appears in the set and when it appears, and it generates sequences that resemble that distribution.
What I recommend is always to compute the perplexity and only select the best 5-10% for each generation batch (or be even more restrictive if you can). This way, you ensure the best possible sequences from the model. If I am not wrong, those should have a lower proportion of the 'X' amino acid.

Sign up or log in to comment