Generation of sequences based on one input sequence

#8
by 0jj0 - opened

Hi Noelia,
thanks a lot for creating ProtGTP2. It is my first experience with protein language models and deep learning, so I have been learning quite a lot when playing with ProtGTP2. One noob question maybe, is it possible to use ProtGTP2 to generate a set of new proteins (enzymes) based on one input sequence of an enzyme that was characterized before. For now from what I understand, you need to provide a part of the sequence (is that the case that it must be the beginning of the sequence?) and ProtGTP2 will "fill in the blank" after the initial input.
I saw that there is a way to retrain the model based on a set of sequences but I guess an example of 1 single sequence is not enough to retrain it?
One small comment is that if possible, you could include a short instruction on how to setup Transformers for someone who is very inexperienced with the field. I'm familiar with conda and co so it's not a real issue for me and a Colab notebook (https://colab.research.google.com/drive/15ucZMtrAeFE_YOBQ9FdrWlAngvljJ4ss?usp=sharing&pli=1#scrollTo=rZjPq1f5y9Op) has been very useful for me to setup ProtGTP2. Also from my setup, installing transformers using conda with the channel conda-forgeseemed to work better than the channel huggingface as suggested by Hugging Face installation page (pytorch and other dependencies were installed at the same time when using conda-forge).
Thanks again and looking forward to your feedback,
Best regards,
Mr. Curiosity

Hi kurioscity, thanks a lot for writing!
Yes, as you mention with one single sequence it will be hard to fine-tune the model (I don’t expect the weights to update much), and yes, the only other option to condition the output is to provide the context (i.e., leftmost part of the sequence), which is not helpful in your case.
But maybe there’s another option. I recently trained another model, which was trained only on enzymes - you can find it with the name ZymCTRL. We haven’t uploaded the preprint yet, but it’s about to get out. If you know your enzyme EC number, finetuning this model might give you better results than ProtGPT2.

You’re right about an intro to the Transformers module - I will try make the docs more accessible on Monday!

Let me know if you try to finetune the models and questions arise,
Noelia

Hi Noeila,
thanks a lot for your reply. I did see the ZymCTRL project and wanted to try it out as it looks very interesting (I work a lot with enzyme discovery and characterization). I will try it out and let you know if I have any questions. As mentioned, I did try to use ProtGTP2 to create new sequences of an enzyme having a pretty new sequence in terms of sequence identity (the enzyme class is not novel but kind of underexplored) using the first 25 amino acids as the starting point. The ouput sequences are very different from the original whole sequence but I have not really looked into whether they will have the same function. One thing I consider for using ProtGTP2 for my purpose is to let the network create, say, 100 000 sequences, and use for example InterProScan to predict the putative function and select the ones that are predicted to be my enzyme classes of interest.
Thank you also for considering to improve the documentation. And now I'm pretty curious trying ZymCTRL. I saw also that you have a preprint exploring using ProtGTP2 for Biosynthetic Gene Clusters applcation (also one of my research interest). I think there are a lot of potentials there, maybe training a network of MiBIG BGCs' genes with their corresponding reactions (when available) and use it to create a totally novel BGC that might produce a new compound.
Best regards,
Mr. Curiosity

Hi Kurioscity,
your project sounds really interesting to me! Indeed with the 25 first amino acids, the generated sequences will most likely diverge a lot. Still, I would do as you say and pass them through Interproscan to see if they perform the function you want. For this purpose, I'd suggest you also have a look at ProteInfer (https://google-research.github.io/proteinfer/) since it directly outputs EC classes. You can also send a job and request the result with an URL request instead of using the web interface.

The preprint that you mention actually followed a similar protocol. We generated 100,000 sequences and as an example chose sequences that could produce molecular glues. Now that you say it, you could use the same pipeline we used and search for your specific function: https://github.com/hefeda/PGP.
What you mention is a good idea! all these models tend to focus on a single sequence or reaction but we haven't yet explored training networks that exploit the interconnection of many to create enzymatic cascades or BGCs

Sign up or log in to comment