English
Protein_Language_Model
MSA Generation
File size: 4,928 Bytes
eed7825
 
 
 
 
 
 
 
0dce0bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19a2ee8
 
 
0dce0bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
license: apache-2.0
language:
- en
tags:
- Protein_Language_Model
- MSA Generation
---
# MSAGPT

<table>
  <tr>
    <td>
      <h2>MSAGPT</h2>
      <p>📖 Paper: <a href="xxx">MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training</a></p>
      <p><b>MSAGPT</b> is a powerful protein language model (PLM). MSAGPT has 3 billion parameters with three versions of the model, MSAGPT, MSAGPT-Sft, and MSAGPT-Dpo, <b>supporting zero-shot and few-shot MSA generation</b>.</p>
      <p><b>MSAGPT achieves state-of-the-art structural prediction performance on natural MSA-scarce scenarios</b>.</p>
    </td>
  </tr>
</table>


## Overall Framework
<p align="center">
<img src="resources/overall_frame.png" alt="描述文字" style="display: block; margin: auto; width: 90%;">
</p>

## Visualized Cases
Visualization of improved structure prediction compared with nature MSA.
<font color=orange>Yellow</font>: Ground truth; 
<font color=purple>Purple</font>: Predictions based on MSA generated by MSAGPT; 
<font color=cyan>Cyan</font>: Predictions from MSA generated by natural MSA.

<p align="center">
<img src="resources/app_case.png" alt="描述文字" style="display: block; margin: auto; width: 90%;">
</p>


## Get Started: 

### Option 1:Deploy MSAGPT by yourself

We support GUI for model inference.

First, we need to install the dependencies.

```bash
# CUDA >= 11.8
pip install -r requirements.txt
```

#### Model List
You can choose to manually download the necessary weights. Then UNZIP it and put it into the **checkpoints** folder.

| Model            | Type | Seq Length | Download                                                                                                                                |                                                                                                                                                                                
|------------------|------|------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| MSAGPT         | Base | 16K         | [🤗 Huggingface](https://huggingface.co/THUDM/MSAGPT)  [🔨 SwissArmyTransformer](https://cloud.tsinghua.edu.cn/f/ebfc954a4cd24cef9243/?dl=1)  |
| MSAGPT-SFT   | Sft | 16K        | [🤗 Huggingface](https://huggingface.co/THUDM/MSAGPT)  [🔨 SwissArmyTransformer](https://cloud.tsinghua.edu.cn/f/32da3eadf6e042aab2fa/?dl=1)   |
| MSAGPT-DPO | Rlhf | 16K         | [🤗 Huggingface](https://huggingface.co/THUDM/MSAGPT)  [🔨 SwissArmyTransformer](https://cloud.tsinghua.edu.cn/f/ebfc954a4cd24cef9243/?dl=1) |                                                                                                                                                                                      |                                                                                                                                                                                  |


#### Situation 1.1 CLI (SAT version)

Run CLI demo via:

```bash
# Online Chat
bash scripts/cli_sat.sh --from_pretrained ./checkpoints/MSAGPT-DPO --input-source chat --stream_chat --max-gen-length 1024
```

The program will automatically interact in the command line. You can generate replies entering the protein sequence you need to generate virtual MSAs (or add a few MSAs as a prompt, connected by "\<M\>"), for example: "PEGKQGDPGIPGEPGPPGPPGPQGARGPPG\<M\>VTVEFVNSCLIGDMGVDGPPGQQGQPGPPG", where "PEGKQGDPGIPGEPGPPGPPGPQGARGPPG" is the main sequence, and "VTVEFVNSCLIGDMGVDGPPGQQGQPGPPG" are MSA prompts, and pressing enter. Enter `stop` to stop the program. The chat CLI looks like:
<p align="center">
<img src="resources/demo.gif" alt="描述文字" style="display: block; margin: auto; width: 90%;">
</p>


You can also enable the offline generation by set the **--input-source \<your input file\>** and **--output-path \<your output path\>**.
We set an input file example: *msa_input*. 
```bash
# Offline Generation
bash scripts/cli_sat.sh --from_pretrained ./checkpoints/MSAGPT-DPO --input-source <your input file> --output-path <your output path> --max-gen-length 1024
```

#### Situation 1.2 CLI (Huggingface version)
(TODO)

#### Situation 1.3 Web Demo
(TODO)

### Option 2:Finetuning MSAGPT

(TODO)

### Hardware requirement

* Model Inference:
  For BF16: 1 * A100(80G) 

* Finetuning:

  For BF16: 4 * A100(80G) *[Recommend]*.


## License

The code in this repository is open source under the [Apache-2.0 license](./LICENSE).

If you find our work helpful, please consider citing the our paper

```
@article{chen2024msagpt,
  title={MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training},
  author={Chen, Bo and Bei, Zhilei and Cheng, Xingyi and Li, Pan and Tang, Jie and Song, Le},
  journal={arXiv preprint arXiv:2406.05347},
  year={2024}
}
```