EdwardoSunny commited on
Commit
08c490f
β€’
1 Parent(s): 85ab89d

README UPDATE

Browse files
Files changed (1) hide show
  1. README.md +9 -110
README.md CHANGED
@@ -1,110 +1,9 @@
1
- # ProteinChat: Towards Enabling ChatGPT-Like Capabilities on Protein 3D Structures
2
-
3
- This repository holds the code and data of ProteinChat: Towards Enabling ChatGPT-Like Capabilities on Protein 3D Structures.
4
-
5
- ## Technical report is available [here](https://www.techrxiv.org/articles/preprint/ProteinChat_Towards_Achieving_ChatGPT-Like_Functionalities_on_Protein_3D_Structures/23120606)
6
-
7
- ## Examples
8
-
9
- ![Eg1](fig/protein-eg.png)
10
-
11
-
12
- ## Introduction
13
- - In this work, we make an initial attempt towards enabling ChatGPT-like capabilities on protein 3D structures, by developing a prototype system ProteinChat.
14
- - ProteinChat works in a similar way as ChatGPT. Users upload a protein 3D structure and ask various questions about this protein. ProteinChat will answer these questions in a multi-turn, interactive manner.
15
- - The ProteinChat system consists of a protein 3D structure encoder (based on [ESM inverse folding](https://github.com/facebookresearch/esm/tree/main/examples/inverse_folding)), a large language model (LLM), and an adaptor. The protein encoder takes a protein 3D structure as input and learns a representation for this protein. The adaptor transforms the protein representation produced by the protein encoder into another representation that is acceptable to the LLM. The LLM takes the representation transformed by the adaptor and users' questions about this protein as inputs and generates answers. All these components are trained end-to-end.
16
- - To train ProteinChat, we collected instruction tuning datasets which contain 143508 proteins and 143508 instructions.
17
-
18
-
19
- ![overview](fig/proteinchat_overview.png)
20
-
21
- ## Datasets
22
-
23
- The dataset contains 143508 proteins (represented using 3D structures) with 143508 instructions.
24
- The instruction set are available at [this link](https://drive.google.com/file/d/1iMgPyiIzpvXdKiNsXnRKn2YpmP92Xyub/view?usp=share_link).
25
- The processed protein files (83G in total) are available at [this link](https://drive.google.com/file/d/1AeJW5BY5C-d8mKJjAULTax6WA4hzWS0N/view?usp=share_link).
26
- The data is curated from the [Protein Data Bank](https://www.rcsb.org/). More details can be found [here](data/README.md).
27
-
28
- ## Getting Started
29
- ### Installation
30
- These instructions largely follow those in MiniGPT-4.
31
-
32
- **1. Prepare the code and the environment**
33
-
34
- Git clone our repository, creating a python environment and ativate it via the following command
35
-
36
- ```bash
37
- git clone https://github.com/UCSD-AI4H/proteinchat
38
- cd proteinchat
39
- conda env create -f environment.yml
40
- conda activate proteinchat
41
- pip install einops
42
- ```
43
-
44
- Verify the installation of `torch` and `torchvision` is successful by running `python -c "import torchvision; print(torchvision.__version__)"`. If it outputs the version number without any warnings or errors, then you are good to go. __If it outputs any warnings or errors__, try to uninstall `torch` by `conda uninstall pytorch torchvision torchaudio cudatoolkit` and then reinstall them following [here](https://pytorch.org/get-started/previous-versions/#v1121). You need to find the correct command according to the CUDA version your GPU driver supports (check `nvidia-smi`).
45
-
46
- **2. Prepare the pretrained Vicuna weights**
47
-
48
- The current version of ProteinChat is built on the v0 versoin of Vicuna-13B.
49
- Please refer to our instruction [here](PrepareVicuna.md)
50
- to prepare the Vicuna weights.
51
- The final weights would be in a single folder in a structure similar to the following:
52
-
53
- ```
54
- vicuna_weights
55
- β”œβ”€β”€ config.json
56
- β”œβ”€β”€ generation_config.json
57
- β”œβ”€β”€ pytorch_model.bin.index.json
58
- β”œβ”€β”€ pytorch_model-00001-of-00003.bin
59
- ...
60
- ```
61
-
62
- Then, set the path to the vicuna weight in the model config file
63
- [here](minigpt4/configs/models/minigpt4.yaml#L16) at Line 16.
64
-
65
- ### Training
66
- **You need roughly 45 GB GPU memory for the training.**
67
-
68
- The training configuration file is [configs/train_instruction_tuning.yaml](configs/train_instruction_tuning.yaml). In addition, you may want to change the number of epochs and other hyper-parameters there, such as `max_epoch`, `init_lr`, `min_lr`,`warmup_steps`, `batch_size_train`. Please adjust `iters_per_epoch` so that `iters_per_epoch` * `batch_size_train` = your training set size. Due to the GPU consumption, we set `batch_size_train=1`.
69
-
70
- Start training on LLaMA model with protein dataset by running [finetune.sh](finetune.sh) `bash finetune.sh`.
71
-
72
- **It takes around 24 GB GPU memory for the demo.**
73
-
74
- Find the checkpoint you save in the training process above, which is located under the folder `minigpt4/output/minigpt4_stage2_esm/` by default. Copy it to the folder `ckpt` by running `cp minigpt4/output/minigpt4_stage2_esm/.../checkpoint_xxx.pth`, and modify the `ckpt` entry in [configs/evaluation.yaml](configs/evaluation.yaml) to the location of your checkpoint.
75
-
76
- Now we launch the `demo.py` in our original environment. Then, start the demo [demo.sh](demo.sh) on your local machine by running `bash demo.sh`. Then, open the URL created by the demo and try it out!
77
-
78
-
79
- ## Acknowledgement
80
-
81
- + [ProteinChat](https://github.com/UCSD-AI4H/proteinchat)
82
- + [MiniGPT-4](https://minigpt-4.github.io/)
83
- + [Lavis](https://github.com/salesforce/LAVIS)
84
- + [Vicuna](https://github.com/lm-sys/FastChat)
85
- + [ESM-IF1](https://github.com/facebookresearch/esm/tree/main/examples/inverse_folding)
86
-
87
-
88
-
89
- ## License
90
- This repository is under [BSD 3-Clause License](LICENSE.md).
91
- Many codes are based on [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4) with BSD 3-Clause License [here](LICENSE_MiniGPT4.md), which is based on [Lavis](https://github.com/salesforce/LAVIS) with
92
- BSD 3-Clause License [here](LICENSE_Lavis.md).
93
-
94
-
95
- ## Disclaimer
96
-
97
- This is a prototype system that has not been systematically and comprehensively validated by biologists yet. Please use with caution.
98
-
99
- Trained models and demo websites will be released after we thoroughly validate the system with biologists.
100
-
101
-
102
- ## Citation
103
-
104
- If you're using ProteinChat in your research or applications, please cite using this BibTeX:
105
- ```bibtex
106
- @article{guo2023proteinchat,
107
- title={ProteinChat: Towards Enabling ChatGPT-Like Capabilities on Protein 3D Structures},
108
- author={Guo, Han and Huo, Mingjia and Xie, Pengtao},
109
- year={2023}
110
- }
 
1
+ title: ProteinGPT
2
+ emoji: πŸš€
3
+ colorFrom: purple
4
+ colorTo: gray
5
+ sdk: gradio
6
+ sdk_version: 4.43.0
7
+ app_file: app.py
8
+ pinned: false
9
+ license: other