SilkomeGPT / README.md
mjbuehler's picture
Update README.md
d5360a1
metadata
license: apache-2.0

SilkomeGPT: Generative strategies for modeling, design and analysis of spider silk protein sequences for enhanced mechanical properties

Generative strategies for modeling, design and analysis of silk protein sequences for enhanced mechanical properties

Wei Lu, David L. Kaplan, Markus J. Buehler

Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, MA 02139, USA

Contact email: mbuehler@mit.edu

Abstract: Spider silks are remarkable materials characterized by superb mechanical properties such as strength, extensibility and lightweightedness. Yet, to date, limited models are available to fully explore sequence-property relationships for analysis and design. Here a custom generative large-language model is proposed to enable design of novel spider silk protein sequences to meet complex combinations of target mechanical properties. The model, pretrained on a large set of protein sequences, is fine-tuned on ~1,000 major ampullate spidroin (MaSp) sequences for which associated fiber-level mechanical properties exist, to yield an end-to-end forward and inverse generative approach that is aplied in a multi-agent strategy. Performance is assessed through: (1) a novelty analysis and protein type classification for generated spidroin sequences through Basic Local Alignment Search Tool (BLAST) searches, (2) property evaluation and comparison with similar sequences, (3) comparison of molecular structures, as well as, and (4) a detailed sequence motif analyses. This work generates silk sequences with property combinations that do not exist in nature, and develops a deep understanding of the mechanistic roles of sequence patterns in achieving overarching key mechanical properties (elastic modulus, strength, toughness, failure strain). The model provides an efficient approach to expand the silkome dataset, facilitating further sequence-structure analyses of silks, and establishes a foundation for synthetic silk design and optimization. This work not only shows the capacity of generative transformer models to design complex materials, but also illustrates an effective use of agentic modeling for self-improving design solutions.

Keywords: biomaterials; deep learning; generative autoregressive transformer; hierarchical; multiscale modeling; spider silk; spidroin

GitHub (more codes, notebooks, etc.): https://github.com/lamm-mit/SilkomeGPT

Trained model and inference

This model is a pretrained autoregressive transformer model in GPT-style, trained on a large number of silk and other protein sequences. The pretraining task is defined as "Sequence<...>" where ... is an amino acid sequence.

Load pretrained model:

from transformers import AutoModelForCausalLM, AutoTokenizer

trained_model_name='lamm-mit/SilkomeGPT'

tokenizer = AutoTokenizer.from_pretrained(trained_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

model_name = pretrained_model_name
 
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    trust_remote_code=True
).to(device)

model.config.use_cache = False

Sample inference using the "GenerateSilkContent<...>" task, where here, the model will produce a silk sequence that meets the list of properties requested:

prompt = "GenerateSilkContent<0.177,0.222,0.082,0.065,0.225,0.241,0.266,0.515>"
generated = torch.tensor(tokenizer.encode(prompt, add_special_tokens = False)).unsqueeze(0).to(device)
print(generated.shape, generated)

sample_outputs = model.generate(
                                inputs=generated, 
                                eos_token_id =tokenizer.eos_token_id,
                                do_sample=True,   
                                top_k=500, 
                                max_length = 300,
                                top_p=0.9, 
                                num_return_sequences=3,
                                temperature=1,
                                ).to(device)

for i, sample_output in enumerate(sample_outputs):
      print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output (here, three candidate sequences):

torch.Size([1, 66]) tensor([[ 43, 299,  73,  86,  69,  88,  73,  55,  77,  80,  79,  39,  83,  82,
          88, 299,  88,  32,  20,  18,  21,  27,  27,  16,  20,  18,  22,  22,
          22,  16,  20,  18,  20,  28,  22,  16,  20,  18,  20,  26,  25,  16,
          20,  18,  22,  22,  25,  16,  20,  18,  22,  24,  21,  16,  20,  18,
          22,  26,  26,  16,  20,  18,  25,  21,  25,  34]], device='cuda:0')
0: GenerateSilkContent<0.177,0.222,0.082,0.065,0.225,0.241,0.266,0.515> [AAAAGGSGGSGGYGPGGYGPGGSGDAAAAAAAAGGSGGAGGYGPGGYGPGGFGPGGSGDAAAAAAAAAGGSGGSGGYGPGGYGPGGSGDAAAAAAAAGGSGGPGGYGPGGYGPGGFGLSGSGDAAAAAAAAAGGSGGSEGYGPGGYGPGGSGDAAAAAAAAAGGSGGPGGYGPGGYGPGGYGPGGSGDAAAAAAAAAGGSGGSGGYGPGGYGPGGSGDAAAAAAAAGGSGGPGGYGPGGYGPGGFGPGGSGDAAAAAAAAAGGSGGSGGYGPGGYGPGGSGAAVAAASAAGGSGGSGGYGPGGYGPGGSGAAAASAAASAISSPASTSRISFVASRLVSGGTANVSNLSNTIGTVMSQVRAGNPGASECEVVIQTLIELLAALIHILGSASIGNVNYGSTAQSAAVVSESFQSAFQ]
1: GenerateSilkContent<0.177,0.222,0.082,0.065,0.225,0.241,0.266,0.515> [MTLTIRLALSLLVAICTQSMFALGQSVSPWSSPDMAENFMSVFTDSLSQSGAFSYDQMDDISSIGDSIRSGVEKMARSGKTSANKLQAMNMAFASAVAEIAISEGGGQSAQVKTNAVADALSTAFLQTTGVVNTQFVNEIRSLISMFAQANSVSSSSASVSASAGGAGGYGPQAQGAAAVVAGGYGPGSQGPQSYGPGPQAQSSAVAVSAGSQGPQSYGPGPQGPGPQGPGPQGSGPQGPGPQGPGSQGPQSYGPGPQGPSSPGQSSYQYSVSITSQSGSQGTSGGLGSQGAGGADQGGYGNGQGGSGSAAAAAAAGGAGGAGQGGLGAGGAGQGYGAGLGRQGGSGQGGAAAAAAAAGGLGGQGGYGGQDSQGAGQGGYGSGQGGSGAAAAAAAAGGAGRGGLGSGGAGQGYGAGLGGQGGSGQGGQGGQQPGQSGYGRQGQGSGGAGQGGLGSGGAGQGYGAGLGGQGGSGQGGAAAAAAAAGGLGRQGPGSGGAGQGYGAGLGGQGGSGQGGAAAAAAAAGGLGGQGGYGGQGSQGAGQGGYGSGQGGSGAAAAAAAAGGAGQGGYGGQGSQGAGQGGYGSGQGGSGQGGAAAAAAAAGGLGGQGGYGGQGSQGAGQGGYGSGQGGSGQGGAAAAAAAAGGLGGQGGYGGQGSQGAGQGGYGSGQGGSGAAAAAAAAGGAGGAGRG]
2: GenerateSilkContent<0.177,0.222,0.082,0.065,0.225,0.241,0.266,0.515> [MNWSIRLALLGLVVLSTQTTFAFGQAATPWENTALAEAFINSFLDSIGRTGAFSLSQQDDMSTIGDTLKSAMEKMAQSRKSSKSKLQALNMAFASSMAEIAVAEEGGLSIQAKTEAIASSLSSAFLQTTGVVNYQFVNEIKSLIYMIAQATTNEVASSEASAGGGGGSGQGRYVSSSAAGTYGSAPQSTGENRPAPQGPPQQGPTYGPSAAVLVSAVGGYGQGPAAPSQQGPTGPSQQRQANQGPYGLSVQQEPESQGSYGPETNAAAAAAGGYGPGAVGQQGLGAGGQQGPGGQRP]

Citation

To cite this work:

@article{WeiKaplanBuehler_2023,
    title   = {Generative Modeling, Design, and Analysis of Spider Silk Protein Sequences for Enhanced Mechanical Properties},
    author  = {W. Lu, D. L., Kaplan, M.J. Buehler},
    journal = {Adv. Funct. Mater.},
    year    = {2023},
    volume  = {},
    pages   = {},
    url     = {https://doi.org/10.1002/adfm.202311324}
}