File size: 4,928 Bytes
eed7825 0dce0bd 19a2ee8 0dce0bd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
---
license: apache-2.0
language:
- en
tags:
- Protein_Language_Model
- MSA Generation
---
# MSAGPT
<table>
<tr>
<td>
<h2>MSAGPT</h2>
<p>📖 Paper: <a href="xxx">MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training</a></p>
<p><b>MSAGPT</b> is a powerful protein language model (PLM). MSAGPT has 3 billion parameters with three versions of the model, MSAGPT, MSAGPT-Sft, and MSAGPT-Dpo, <b>supporting zero-shot and few-shot MSA generation</b>.</p>
<p><b>MSAGPT achieves state-of-the-art structural prediction performance on natural MSA-scarce scenarios</b>.</p>
</td>
</tr>
</table>
## Overall Framework
<p align="center">
<img src="resources/overall_frame.png" alt="描述文字" style="display: block; margin: auto; width: 90%;">
</p>
## Visualized Cases
Visualization of improved structure prediction compared with nature MSA.
<font color=orange>Yellow</font>: Ground truth;
<font color=purple>Purple</font>: Predictions based on MSA generated by MSAGPT;
<font color=cyan>Cyan</font>: Predictions from MSA generated by natural MSA.
<p align="center">
<img src="resources/app_case.png" alt="描述文字" style="display: block; margin: auto; width: 90%;">
</p>
## Get Started:
### Option 1:Deploy MSAGPT by yourself
We support GUI for model inference.
First, we need to install the dependencies.
```bash
# CUDA >= 11.8
pip install -r requirements.txt
```
#### Model List
You can choose to manually download the necessary weights. Then UNZIP it and put it into the **checkpoints** folder.
| Model | Type | Seq Length | Download |
|------------------|------|------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| MSAGPT | Base | 16K | [🤗 Huggingface](https://huggingface.co/THUDM/MSAGPT) [🔨 SwissArmyTransformer](https://cloud.tsinghua.edu.cn/f/ebfc954a4cd24cef9243/?dl=1) |
| MSAGPT-SFT | Sft | 16K | [🤗 Huggingface](https://huggingface.co/THUDM/MSAGPT) [🔨 SwissArmyTransformer](https://cloud.tsinghua.edu.cn/f/32da3eadf6e042aab2fa/?dl=1) |
| MSAGPT-DPO | Rlhf | 16K | [🤗 Huggingface](https://huggingface.co/THUDM/MSAGPT) [🔨 SwissArmyTransformer](https://cloud.tsinghua.edu.cn/f/ebfc954a4cd24cef9243/?dl=1) | | |
#### Situation 1.1 CLI (SAT version)
Run CLI demo via:
```bash
# Online Chat
bash scripts/cli_sat.sh --from_pretrained ./checkpoints/MSAGPT-DPO --input-source chat --stream_chat --max-gen-length 1024
```
The program will automatically interact in the command line. You can generate replies entering the protein sequence you need to generate virtual MSAs (or add a few MSAs as a prompt, connected by "\<M\>"), for example: "PEGKQGDPGIPGEPGPPGPPGPQGARGPPG\<M\>VTVEFVNSCLIGDMGVDGPPGQQGQPGPPG", where "PEGKQGDPGIPGEPGPPGPPGPQGARGPPG" is the main sequence, and "VTVEFVNSCLIGDMGVDGPPGQQGQPGPPG" are MSA prompts, and pressing enter. Enter `stop` to stop the program. The chat CLI looks like:
<p align="center">
<img src="resources/demo.gif" alt="描述文字" style="display: block; margin: auto; width: 90%;">
</p>
You can also enable the offline generation by set the **--input-source \<your input file\>** and **--output-path \<your output path\>**.
We set an input file example: *msa_input*.
```bash
# Offline Generation
bash scripts/cli_sat.sh --from_pretrained ./checkpoints/MSAGPT-DPO --input-source <your input file> --output-path <your output path> --max-gen-length 1024
```
#### Situation 1.2 CLI (Huggingface version)
(TODO)
#### Situation 1.3 Web Demo
(TODO)
### Option 2:Finetuning MSAGPT
(TODO)
### Hardware requirement
* Model Inference:
For BF16: 1 * A100(80G)
* Finetuning:
For BF16: 4 * A100(80G) *[Recommend]*.
## License
The code in this repository is open source under the [Apache-2.0 license](./LICENSE).
If you find our work helpful, please consider citing the our paper
```
@article{chen2024msagpt,
title={MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training},
author={Chen, Bo and Bei, Zhilei and Cheng, Xingyi and Li, Pan and Tang, Jie and Song, Le},
journal={arXiv preprint arXiv:2406.05347},
year={2024}
}
``` |