Many generated sequences are highly similar to WT sequences

#10
by ShanGao - opened

Dear Authors,

I tried the model to generate sequences for some random Brenda enzymes. Many generated sequences are over 90% similar to sequences in Brenda, and some are 100% identical. I just want to know if this is expected. I used the parameters (top_p, top_k, temperature) recommended in your manuscript.

AI for protein design org

Hi Shangao,

It is only expected in BRENDA classes with high redundancy. For example, if a Brenda class only contains 10 sequences, but they are in 10 different clusters at 50%, ZymCTRL will generate sequences at that distance. To decrease the identity, you could fine-tune the model in a less redundant dataset.
Best wishes
Noelia

Sign up or log in to comment