jyz-mbzuai
commited on
Commit
·
e889e43
1
Parent(s):
460a338
update readme
Browse files- README.md +76 -4
- assets/images/architecture.png +0 -0
README.md
CHANGED
@@ -10,10 +10,82 @@ license: other
|
|
10 |
|
11 |
# AIDO.StructureDecoder
|
12 |
|
13 |
-
AIDO.StructureDecoder is the decoder-only component of [AIDO.StructureTokenizer](https://huggingface.co/genbio-ai/AIDO.StructureTokenizer) for
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
|
15 |
## How to Use
|
16 |
-
Please see `experiments/AIDO.StructureTokenizer` in [Model Generator](https://github.com/genbio-ai/modelgenerator)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
|
18 |
# Citation
|
19 |
Please cite AIDO.StructureTokenizer using the following BibTex code:
|
@@ -25,6 +97,6 @@ Please cite AIDO.StructureTokenizer using the following BibTex code:
|
|
25 |
publisher = {bioRxiv},
|
26 |
author = {Zhang, Jiayou and Meynard-Piganeau, Barthelemy and Gong, James and Cheng, Xingyi and Luo, Yingtao and Ly, Hugo and Song, Le and Xing, Eric},
|
27 |
year = {2024},
|
28 |
-
|
29 |
}
|
30 |
-
```
|
|
|
10 |
|
11 |
# AIDO.StructureDecoder
|
12 |
|
13 |
+
AIDO.StructureDecoder is the decoder-only component of [AIDO.StructureTokenizer](https://huggingface.co/genbio-ai/AIDO.StructureTokenizer) for tokenization of protein structures.
|
14 |
+
|
15 |
+
## Model Description
|
16 |
+
|
17 |
+
![Model Architecture](./assets/images/architecture.png)
|
18 |
+
|
19 |
+
**AIDO.StructureTokenizer** is built on a Vector Quantized Variational Autoencoder (VQ-VAE) architecture with the following components:
|
20 |
+
- Equivariant Encoder (6M): Encodes backbone structures into a latent space that maintains rotational and translational symmetries using the Equiformer architecture.
|
21 |
+
- Discrete Codebook: Maps continuous latent vectors into 512 discrete structural tokens.
|
22 |
+
- Invariant Decoder (300M): Reconstructs full 3D structures, including side chains, from the structural tokens using an architecture adapted from ESMFold.
|
23 |
+
|
24 |
+
This model strikes a balance between reconstruction fidelity and structural locality, optimizing its suitability for downstream tasks such as structure prediction, homology detection, and multimodal protein modeling.
|
25 |
+
|
26 |
+
### Key Features
|
27 |
+
|
28 |
+
- Encoding Structures into Tokens (See [genbio-ai/AIDO.StructureEncoder](https://huggingface.co/genbio-ai/AIDO.StructureEncoder))
|
29 |
+
- Decoding Tokens into Structures (See [below](#how-to-use))
|
30 |
+
- Reconstructing Structures (See [genbio-ai/AIDO.StructureTokenizer](https://huggingface.co/genbio-ai/AIDO.StructureTokenizer))
|
31 |
+
- Structure Prediction (See [this section](https://huggingface.co/genbio-ai/AIDO.Protein2StructureToken-16B/blob/main/README.md#structure-prediction) in genbio-ai/AIDO.Protein2StructureToken-16B)
|
32 |
+
|
33 |
|
34 |
## How to Use
|
35 |
+
Please see `experiments/AIDO.StructureTokenizer` in [Model Generator](https://github.com/genbio-ai/modelgenerator) for more details.
|
36 |
+
|
37 |
+
### Setup
|
38 |
+
Install [Model Generator](https://github.com/genbio-ai/modelgenerator)
|
39 |
+
|
40 |
+
### Decoding Structure Tokens from AIDO.StructureEncoder
|
41 |
+
|
42 |
+
If you have run the encoding task in [genbio-ai/AIDO.StructureEncoder](https://huggingface.co/genbio-ai/AIDO.StructureEncoder) with the default `encode.yaml`, the default `decode.yaml` configuration file is already set up to decode the encoded tokens. You don't need to change anything in the configuration file. You can directly run the decoding task using the following command:
|
43 |
+
|
44 |
+
```bash
|
45 |
+
CUDA_VISIBLE_DEVICES=0 mgen predict --config=experiments/AIDO.StructureTokenizer/decode.yaml
|
46 |
+
```
|
47 |
+
|
48 |
+
### Decoding Customized Structure Tokens
|
49 |
+
To decode protein structures, you will need the structure tokens in `.pt` format and a corresponding codebook file (`codebook.pt`). For ease of use, we recommend preparing the structure tokens in TSV format and then converting them to `.pt` format using the provided script.
|
50 |
+
|
51 |
+
The TSV file should include the following columns (an example file is available at `experiments/AIDO.StructureTokenizer/decode_example_input.tsv`):
|
52 |
+
- `uid`: A unique identifier for the protein sequence.
|
53 |
+
- `sequences`: The amino acid sequence (e.g., "LRTPTT").
|
54 |
+
- `predictions`: The structure tokens to be decoded, provided as a list (e.g., "[164, 287, 119, ...]"). The list length must match the length of the amino acid sequence.
|
55 |
+
|
56 |
+
|
57 |
+
After preparing the TSV file, you need to convert the TSV file to the `.pt` format using the following command:
|
58 |
+
```bash
|
59 |
+
python experiments/AIDO.StructureTokenizer/struct_token_format_conversion.py your_tsv_file.tsv your_output_pt_file.pt
|
60 |
+
```
|
61 |
+
|
62 |
+
You also need to prepare a codebook file (`codebook.pt`) that contains the embedding of each token. The codebook could be extracted using this command:
|
63 |
+
```bash
|
64 |
+
python experiments/AIDO.StructureTokenizer/extract_structure_tokenizer_codebook.py --output_path your_output_codebook.pt
|
65 |
+
```
|
66 |
+
|
67 |
+
Then you need to update the `struct_tokens_path` and `codebook_path` in the `decode.yaml` configuration file to point to your structure tokens and codebook file. Alternatively, you can override these parameters when running the command:
|
68 |
+
```bash
|
69 |
+
CUDA_VISIBLE_DEVICES=0 mgen predict --config experiments/AIDO.StructureTokenizer/decode.yaml \
|
70 |
+
--data.init_args.config.struct_tokens_datasets_configs.name="your_dataset_name" \
|
71 |
+
--data.init_args.config.struct_tokens_datasets_configs.struct_tokens_path="your_structure_tokens.pt" \
|
72 |
+
--data.init_args.config.struct_tokens_datasets_configs.codebook_path="your_codebook.pt" \
|
73 |
+
--trainer.callbacks.dict_kwargs.dirpath="your_output_dir"
|
74 |
+
```
|
75 |
+
|
76 |
+
**Input:**
|
77 |
+
- The encoded tokens saved in `.pt` format.
|
78 |
+
- The codebook file (`codebook.pt`) that contains the embedding of each token.
|
79 |
+
|
80 |
+
**Output:**
|
81 |
+
- The decoded protein structures will be saved in the output directory specified in the configuration file. By default it is saved in `logs/protstruct_decode/`.
|
82 |
+
- The decoded structures are saved as PDB files.
|
83 |
+
|
84 |
+
**Notes:**
|
85 |
+
- Decoding the structures could take a long time even when using a GPU.
|
86 |
+
- Currently, this function only supports single GPU inference due to the file saving mechanism. We plan to support multi-GPU inference in the future.
|
87 |
+
- Currently, we don't support specifying the residue index in TSV format. If you need to specify the residue index, you need to modify the
|
88 |
+
`struct_token_format_conversion.py` script to include the residue index in the TSV file (we may support this feature in the future), or you could provide the `.pt` file directly with the desired residue index.
|
89 |
|
90 |
# Citation
|
91 |
Please cite AIDO.StructureTokenizer using the following BibTex code:
|
|
|
97 |
publisher = {bioRxiv},
|
98 |
author = {Zhang, Jiayou and Meynard-Piganeau, Barthelemy and Gong, James and Cheng, Xingyi and Luo, Yingtao and Ly, Hugo and Song, Le and Xing, Eric},
|
99 |
year = {2024},
|
100 |
+
booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
|
101 |
}
|
102 |
+
```
|
assets/images/architecture.png
ADDED