jyz-mbzuai commited on
Commit
e889e43
·
1 Parent(s): 460a338

update readme

Browse files
Files changed (2) hide show
  1. README.md +76 -4
  2. assets/images/architecture.png +0 -0
README.md CHANGED
@@ -10,10 +10,82 @@ license: other
10
 
11
  # AIDO.StructureDecoder
12
 
13
- AIDO.StructureDecoder is the decoder-only component of [AIDO.StructureTokenizer](https://huggingface.co/genbio-ai/AIDO.StructureTokenizer) for structure prediction from structure tokens.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  ## How to Use
16
- Please see `experiments/AIDO.StructureTokenizer` in [Model Generator](https://github.com/genbio-ai/modelgenerator)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
  # Citation
19
  Please cite AIDO.StructureTokenizer using the following BibTex code:
@@ -25,6 +97,6 @@ Please cite AIDO.StructureTokenizer using the following BibTex code:
25
  publisher = {bioRxiv},
26
  author = {Zhang, Jiayou and Meynard-Piganeau, Barthelemy and Gong, James and Cheng, Xingyi and Luo, Yingtao and Ly, Hugo and Song, Le and Xing, Eric},
27
  year = {2024},
28
- booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
29
  }
30
- ```
 
10
 
11
  # AIDO.StructureDecoder
12
 
13
+ AIDO.StructureDecoder is the decoder-only component of [AIDO.StructureTokenizer](https://huggingface.co/genbio-ai/AIDO.StructureTokenizer) for tokenization of protein structures.
14
+
15
+ ## Model Description
16
+
17
+ ![Model Architecture](./assets/images/architecture.png)
18
+
19
+ **AIDO.StructureTokenizer** is built on a Vector Quantized Variational Autoencoder (VQ-VAE) architecture with the following components:
20
+ - Equivariant Encoder (6M): Encodes backbone structures into a latent space that maintains rotational and translational symmetries using the Equiformer architecture.
21
+ - Discrete Codebook: Maps continuous latent vectors into 512 discrete structural tokens.
22
+ - Invariant Decoder (300M): Reconstructs full 3D structures, including side chains, from the structural tokens using an architecture adapted from ESMFold.
23
+
24
+ This model strikes a balance between reconstruction fidelity and structural locality, optimizing its suitability for downstream tasks such as structure prediction, homology detection, and multimodal protein modeling.
25
+
26
+ ### Key Features
27
+
28
+ - Encoding Structures into Tokens (See [genbio-ai/AIDO.StructureEncoder](https://huggingface.co/genbio-ai/AIDO.StructureEncoder))
29
+ - Decoding Tokens into Structures (See [below](#how-to-use))
30
+ - Reconstructing Structures (See [genbio-ai/AIDO.StructureTokenizer](https://huggingface.co/genbio-ai/AIDO.StructureTokenizer))
31
+ - Structure Prediction (See [this section](https://huggingface.co/genbio-ai/AIDO.Protein2StructureToken-16B/blob/main/README.md#structure-prediction) in genbio-ai/AIDO.Protein2StructureToken-16B)
32
+
33
 
34
  ## How to Use
35
+ Please see `experiments/AIDO.StructureTokenizer` in [Model Generator](https://github.com/genbio-ai/modelgenerator) for more details.
36
+
37
+ ### Setup
38
+ Install [Model Generator](https://github.com/genbio-ai/modelgenerator)
39
+
40
+ ### Decoding Structure Tokens from AIDO.StructureEncoder
41
+
42
+ If you have run the encoding task in [genbio-ai/AIDO.StructureEncoder](https://huggingface.co/genbio-ai/AIDO.StructureEncoder) with the default `encode.yaml`, the default `decode.yaml` configuration file is already set up to decode the encoded tokens. You don't need to change anything in the configuration file. You can directly run the decoding task using the following command:
43
+
44
+ ```bash
45
+ CUDA_VISIBLE_DEVICES=0 mgen predict --config=experiments/AIDO.StructureTokenizer/decode.yaml
46
+ ```
47
+
48
+ ### Decoding Customized Structure Tokens
49
+ To decode protein structures, you will need the structure tokens in `.pt` format and a corresponding codebook file (`codebook.pt`). For ease of use, we recommend preparing the structure tokens in TSV format and then converting them to `.pt` format using the provided script.
50
+
51
+ The TSV file should include the following columns (an example file is available at `experiments/AIDO.StructureTokenizer/decode_example_input.tsv`):
52
+ - `uid`: A unique identifier for the protein sequence.
53
+ - `sequences`: The amino acid sequence (e.g., "LRTPTT").
54
+ - `predictions`: The structure tokens to be decoded, provided as a list (e.g., "[164, 287, 119, ...]"). The list length must match the length of the amino acid sequence.
55
+
56
+
57
+ After preparing the TSV file, you need to convert the TSV file to the `.pt` format using the following command:
58
+ ```bash
59
+ python experiments/AIDO.StructureTokenizer/struct_token_format_conversion.py your_tsv_file.tsv your_output_pt_file.pt
60
+ ```
61
+
62
+ You also need to prepare a codebook file (`codebook.pt`) that contains the embedding of each token. The codebook could be extracted using this command:
63
+ ```bash
64
+ python experiments/AIDO.StructureTokenizer/extract_structure_tokenizer_codebook.py --output_path your_output_codebook.pt
65
+ ```
66
+
67
+ Then you need to update the `struct_tokens_path` and `codebook_path` in the `decode.yaml` configuration file to point to your structure tokens and codebook file. Alternatively, you can override these parameters when running the command:
68
+ ```bash
69
+ CUDA_VISIBLE_DEVICES=0 mgen predict --config experiments/AIDO.StructureTokenizer/decode.yaml \
70
+ --data.init_args.config.struct_tokens_datasets_configs.name="your_dataset_name" \
71
+ --data.init_args.config.struct_tokens_datasets_configs.struct_tokens_path="your_structure_tokens.pt" \
72
+ --data.init_args.config.struct_tokens_datasets_configs.codebook_path="your_codebook.pt" \
73
+ --trainer.callbacks.dict_kwargs.dirpath="your_output_dir"
74
+ ```
75
+
76
+ **Input:**
77
+ - The encoded tokens saved in `.pt` format.
78
+ - The codebook file (`codebook.pt`) that contains the embedding of each token.
79
+
80
+ **Output:**
81
+ - The decoded protein structures will be saved in the output directory specified in the configuration file. By default it is saved in `logs/protstruct_decode/`.
82
+ - The decoded structures are saved as PDB files.
83
+
84
+ **Notes:**
85
+ - Decoding the structures could take a long time even when using a GPU.
86
+ - Currently, this function only supports single GPU inference due to the file saving mechanism. We plan to support multi-GPU inference in the future.
87
+ - Currently, we don't support specifying the residue index in TSV format. If you need to specify the residue index, you need to modify the
88
+ `struct_token_format_conversion.py` script to include the residue index in the TSV file (we may support this feature in the future), or you could provide the `.pt` file directly with the desired residue index.
89
 
90
  # Citation
91
  Please cite AIDO.StructureTokenizer using the following BibTex code:
 
97
  publisher = {bioRxiv},
98
  author = {Zhang, Jiayou and Meynard-Piganeau, Barthelemy and Gong, James and Cheng, Xingyi and Luo, Yingtao and Ly, Hugo and Song, Le and Xing, Eric},
99
  year = {2024},
100
+ booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
101
  }
102
+ ```
assets/images/architecture.png ADDED