File size: 4,816 Bytes
9ae8e1c
 
 
7756bc9
 
 
3c8cc5e
 
3425691
7756bc9
c41ee8a
7756bc9
 
 
3c8cc5e
 
 
 
7756bc9
503222f
 
 
 
 
 
 
 
 
 
3c8cc5e
 
 
573284b
7756bc9
 
 
 
3c8cc5e
7756bc9
3c8cc5e
7756bc9
 
 
3c8cc5e
7756bc9
 
 
 
 
 
4517d36
 
7756bc9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
503222f
7756bc9
3c8cc5e
7756bc9
 
 
 
 
 
 
 
 
 
 
3c8cc5e
 
 
 
4517d36
3c8cc5e
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
license: apache-2.0
---
# ProteinForceGPT: Generative strategies for modeling, design and analysis of protein mechanics


### Basic information

This protein language model is a 454M parameter autoregressive transformer model in GPT-style, trained to analyze and predict the mechanical properties of a large number of protein sequences. The model has both forward and inverse capabilities. For instance, using generate tasks, the model can design novel proteins that meet one or more mechanical constraints. 

This protein language foundation model was based on the NeoGPT-X architecture and uses rotary positional embeddings (RoPE). It has 16 attention heads, 36 hidden layers and a hidden size of 1024, an intermediate size of 4096 and uses a GeLU activation function. 

The pretraining task is defined as "Sequence<...>" where ... is an amino acid sequence.

Pretraining dataset: https://huggingface.co/datasets/lamm-mit/GPTProteinPretrained 
Pretrained model: https://huggingface.co/lamm-mit/GPTProteinPretrained

In this fine-tuned model, mechanics-related forward and inverse tasks are:

```raw
CalculateForce<GEECDCGSPSNP..>, 
CalculateEnergy<GEECDCGSPSNP..> 
CalculateForceEnergy<GEECDCGSPSNP...>
CalculateForceHistory<GEECDCGSPSNP...> 
GenerateForce<0.262> 
GenerateForce<0.220> 
GenerateForceEnergy<0.262,0.220> 
GenerateForceHistory<0.004,0.034,0.125,0.142,0.159,0.102,0.079,0.073,0.131,0.105,0.071,0.058,0.072,0.060,0.049,0.114,0.122,0.108,0.173,0.192,0.208,0.153,0.212,0.222,0.244>
```

### Load model

You can load the model using this code.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

ForceGPT_model_name='lamm-mit/ProteinForceGPT'

tokenizer = AutoTokenizer.from_pretrained(ForceGPT_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    ForceGPT_model_name, 
    trust_remote_code=True
).to(device)

model.config.use_cache = False
```

### Inference

Sample inference using the "Sequence<...>" task, where here, the model will simply autocomplete the sequence starting with "AIIAA":

```python
prompt = "Sequence<GEECDC"
generated = torch.tensor(tokenizer.encode(prompt, add_special_tokens = False)) .unsqueeze(0).to(device)
print(generated.shape, generated)

sample_outputs = model.generate(
                                inputs=generated, 
                                eos_token_id =tokenizer.eos_token_id,
                                do_sample=True,   
                                top_k=500, 
                                max_length = 300,
                                top_p=0.9, 
                                num_return_sequences=1,
                                temperature=1,
                                ).to(device)

for i, sample_output in enumerate(sample_outputs):
      print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
```
Sample inference using the "CalculateForce<...>" task, where here, the model will calculate the maximum unfolding force of a given sequence:

```python
prompt = "'CalculateForce<GEECDCGSPSNPCCDAATCKLRPGAQCADGLCCDQCRFKKKRTICRIARGDFPDDRCTGQSADCPRWN>"
generated = torch.tensor(tokenizer.encode(prompt, add_special_tokens = False)) .unsqueeze(0).to(device)

sample_outputs = model.generate(
                                inputs=generated, 
                                eos_token_id =tokenizer.eos_token_id,
                                do_sample=True,   
                                top_k=500, 
                                max_length = 300,
                                top_p=0.9, 
                                num_return_sequences=3,
                                temperature=1,
                                ).to(device)

for i, sample_output in enumerate(sample_outputs):
      print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
```
Output:
```raw
0: CalculateForce<GEECDCGSPSNPCCDAATCKLRPGAQCADGLCCDQCRFKKKRTICRIARGDFPDDRCTGQSADCPRWN> [0.262]```
```

## Citations
To cite this work:
```
@article{GhafarollahiBuehler_2024,
    title   = {ProtAgents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning },
    author  = {A. Ghafarollahi, M.J. Buehler},
    journal = {},
    year    = {2024},
    volume  = {},
    pages   = {},
    url     = {}
}
```

The dataset used to fine-tune the model is available at:

```
@article{GhafarollahiBuehler_2024,
    title   = {ForceGen: End-to-end de novo protein generation based on nonlinear mechanical unfolding responses using a protein language diffusion model},
    author  = {B. Ni, D.L. Kaplan, M.J. Buehler},
    journal = {Science Advances},
    year    = {2024},
    volume  = {},
    pages   = {},
    url     = {}
}
```