Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,223 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
datasets:
|
4 |
+
- adrianhenkel/lucidprots_full_data
|
5 |
+
pipeline_tag: translation
|
6 |
+
tags:
|
7 |
+
- biology
|
8 |
---
|
9 |
+
# Model Card for ProstT5
|
10 |
+
|
11 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
12 |
+
|
13 |
+
ProstT5 is a protein language model (pLM) which can translate between protein sequence and structure.
|
14 |
+
![ProstT5 pre-training and inference](./prostt5_sketch2.png)
|
15 |
+
|
16 |
+
## Model Details
|
17 |
+
|
18 |
+
### Model Description
|
19 |
+
|
20 |
+
ProstT5 (Protein structure-sequence T5) is based on [ProtT5-XL-U50](https://huggingface.co/Rostlab/prot_t5_xl_uniref50), a T5 model trained on encoding protein sequences using span corruption applied on billions of protein sequences.
|
21 |
+
ProstT5 finetunes [ProtT5-XL-U50](https://huggingface.co/Rostlab/prot_t5_xl_uniref50) on translating between protein sequence and structure using 17M proteins with high-quality 3D structure predictions from the AlphaFoldDB.
|
22 |
+
Protein structure is converted from 3D to 1D using the 3Di-tokens introduced by [Foldseek](https://github.com/steineggerlab/foldseek).
|
23 |
+
In a first step, ProstT5 learnt to represent the newly introduced 3Di-tokens by continuing the original span-denoising objective applied on 3Di- and amino acid- (AA) sequences.
|
24 |
+
Only in a second step, ProstT5 was trained on translating between the two modalities.
|
25 |
+
The direction of the translation is indicated by two special tokens ("\<fold2AA>" for translating from 3Di to AAs, “\<AA2fold>” for translating from AAs to 3Di).
|
26 |
+
To avoid clashes with AA tokens, 3Di-tokens were cast to lower-case (alphabets are identical otherwise).
|
27 |
+
|
28 |
+
- **Developed by:** Michael Heinzinger (GitHub [@mheinzinger](https://github.com/mheinzinger); Twitter [@HeinzingerM](https://twitter.com/HeinzingerM))
|
29 |
+
- **Model type:** Encoder-decoder (T5)
|
30 |
+
- **Language(s) (NLP):** Protein sequence and structure
|
31 |
+
- **License:** MIT
|
32 |
+
- **Finetuned from model:** [ProtT5-XL-U50](https://huggingface.co/Rostlab/prot_t5_xl_uniref50)
|
33 |
+
|
34 |
+
## Uses
|
35 |
+
|
36 |
+
1. The model can be used for traditional feature extraction.
|
37 |
+
For this, we recommend using only the [encoder](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel) in half-precision (fp16) together with batching. Examples (currently only for original [ProtT5-XL-U50](https://huggingface.co/Rostlab/prot_t5_xl_uniref50) but replacing repository links and adding prefixes works): [script](https://github.com/agemagician/ProtTrans/blob/master/Embedding/prott5_embedder.py) and [colab](https://colab.research.google.com/drive/1h7F5v5xkE_ly-1bTQSu-1xaLtTP2TnLF?usp=sharing)
|
38 |
+
While original [ProtT5-XL-U50](https://huggingface.co/Rostlab/prot_t5_xl_uniref50) could only embed AA sequences, ProstT5 can now also embed 3D structures represented by 3Di tokens. 3Di tokens can either be derived from 3D structures using Foldseek or they can be predicted from AA sequences by ProstT5.
|
39 |
+
3. "Folding": Translation from sequence (AAs) to structure (3Di). The resulting 3Di strings can be used together with [Foldseek](https://github.com/steineggerlab/foldseek) for remote homology detection while avoiding to compute 3D structures explicitly.
|
40 |
+
4. "Inverse Folding": Translation from structure (3Di) to sequence (AA).
|
41 |
+
|
42 |
+
|
43 |
+
## How to Get Started with the Model
|
44 |
+
|
45 |
+
Feature extraction:
|
46 |
+
```python
|
47 |
+
from transformers import T5Tokenizer, T5EncoderModel
|
48 |
+
import torch
|
49 |
+
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
|
50 |
+
|
51 |
+
# Load the tokenizer
|
52 |
+
tokenizer = T5Tokenizer.from_pretrained('Rostlab/ProstT5', do_lower_case=False).to(device)
|
53 |
+
|
54 |
+
# Load the model
|
55 |
+
model = T5EncoderModel.from_pretrained("Rostlab/ProstT5").to(device)
|
56 |
+
|
57 |
+
# only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)
|
58 |
+
model.full() if device=='cpu' else model.half()
|
59 |
+
|
60 |
+
# prepare your protein sequences/structures as a list. Amino acid sequences are expected to be upper-case ("PRTEINO" below) while 3Di-sequences need to be lower-case ("strctr" below).
|
61 |
+
sequence_examples = ["PRTEINO", "strct"]
|
62 |
+
|
63 |
+
# replace all rare/ambiguous amino acids by X (3Di sequences does not have those) and introduce white-space between all sequences (AAs and 3Di)
|
64 |
+
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]
|
65 |
+
|
66 |
+
# add pre-fixes accordingly (this already expects 3Di-sequences to be lower-case)
|
67 |
+
# if you go from AAs to 3Di (or if you want to embed AAs), you need to prepend "<AA2fold>"
|
68 |
+
# if you go from 3Di to AAs (or if you want to embed 3Di), you need to prepend "<fold2AA>"
|
69 |
+
sequence_examples = [ "<AA2fold>" + " " + s if s.isupper() else "<fold2AA>" + " " + s
|
70 |
+
for s in sequence_examples
|
71 |
+
]
|
72 |
+
|
73 |
+
# tokenize sequences and pad up to the longest sequence in the batch
|
74 |
+
ids = tokenizer.batch_encode_plus(sequences_example, add_special_tokens=True, padding="longest",return_tensors='pt').to(device))
|
75 |
+
|
76 |
+
# generate embeddings
|
77 |
+
with torch.no_grad():
|
78 |
+
embedding_rpr = model(
|
79 |
+
ids.input_ids,
|
80 |
+
attention_mask=ids.attention_mask
|
81 |
+
)
|
82 |
+
|
83 |
+
# extract residue embeddings for the first ([0,:]) sequence in the batch and remove padded & special tokens, incl. prefix ([0,1:8])
|
84 |
+
emb_0 = embedding_repr.last_hidden_state[0,1:8] # shape (7 x 1024)
|
85 |
+
# same for the second ([1,:]) sequence but taking into account different sequence lengths ([1,:6])
|
86 |
+
emb_1 = embedding_repr.last_hidden_state[1,1:6] # shape (5 x 1024)
|
87 |
+
|
88 |
+
# if you want to derive a single representation (per-protein embedding) for the whole protein
|
89 |
+
emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)
|
90 |
+
```
|
91 |
+
|
92 |
+
Translation ("folding", i.e., AA to 3Di):
|
93 |
+
```python
|
94 |
+
from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
|
95 |
+
import torch
|
96 |
+
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
|
97 |
+
|
98 |
+
# Load the tokenizer
|
99 |
+
tokenizer = T5Tokenizer.from_pretrained('Rostlab/ProstT5', do_lower_case=False).to(device)
|
100 |
+
|
101 |
+
# Load the model
|
102 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("Rostlab/ProstT5").to(device)
|
103 |
+
|
104 |
+
# only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)
|
105 |
+
model.full() if device=='cpu' else model.half()
|
106 |
+
|
107 |
+
# prepare your protein sequences/structures as a list. Amino acid sequences are expected to be upper-case ("PRTEINO" below)
|
108 |
+
folding_example = ["PRTEINO", "SEQWENCE"]
|
109 |
+
min_len = min([ len(s) for s in folding_example])
|
110 |
+
max_len = max([ len(s) for s in folding_example])
|
111 |
+
|
112 |
+
# replace all rare/ambiguous amino acids by X (3Di sequences does not have those) and introduce white-space between all sequences (AAs and 3Di)
|
113 |
+
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]
|
114 |
+
|
115 |
+
# add pre-fixes accordingly. For the translation from AAs to 3Di, you need to prepend "<AA2fold>"
|
116 |
+
sequence_examples = [ "<AA2fold>" + " " + s for s in sequence_examples]
|
117 |
+
|
118 |
+
# tokenize sequences and pad up to the longest sequence in the batch
|
119 |
+
ids = tokenizer.batch_encode_plus(sequences_example, add_special_tokens=True, padding="longest",return_tensors='pt').to(device)
|
120 |
+
|
121 |
+
# Generation configuration
|
122 |
+
gen_kwargs_aa2fold = {
|
123 |
+
"do_sample": True,
|
124 |
+
"num_beams": 3,
|
125 |
+
"top_p" : 0.95,
|
126 |
+
"temperature" : 1.2,
|
127 |
+
"top_k" : 6,
|
128 |
+
"repetition_penalty" : 1.2,
|
129 |
+
}
|
130 |
+
|
131 |
+
# translate from AA to 3Di
|
132 |
+
with torch.no_grad():
|
133 |
+
target = model.generate(
|
134 |
+
start_encoding.input_ids,
|
135 |
+
attention_mask=start_encoding.attention_mask,
|
136 |
+
max_length=max_len, # max length of generated text
|
137 |
+
min_length=min_len, # minimum length of the generated text
|
138 |
+
early_stopping=True, # stop early if end-of-text token is generated
|
139 |
+
num_return_sequences=1, # return only a single sequence
|
140 |
+
**gen_kwargs_aa2fold
|
141 |
+
)
|
142 |
+
# Decode and remove white-spaces between tokens
|
143 |
+
t_strings = tokenizer.batch_decode( target, skip_special_tokens=True )
|
144 |
+
t_strings = [ "".join(ts.split(" ")) for ts in t_strings ]
|
145 |
+
```
|
146 |
+
|
147 |
+
Translation ("inverse folding", i.e., 3Di to AA):
|
148 |
+
```python
|
149 |
+
from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
|
150 |
+
import torch
|
151 |
+
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
|
152 |
+
|
153 |
+
# Load the tokenizer
|
154 |
+
tokenizer = T5Tokenizer.from_pretrained('Rostlab/ProstT5', do_lower_case=False).to(device)
|
155 |
+
|
156 |
+
# Load the model
|
157 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("Rostlab/ProstT5").to(device)
|
158 |
+
|
159 |
+
# only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)
|
160 |
+
model.full() if device=='cpu' else model.half()
|
161 |
+
|
162 |
+
# prepare your protein sequences/structures as a list. Amino acid sequences are expected to be upper-case ("PRTEINO" below)
|
163 |
+
folding_example = ["prtein", "strctr"]
|
164 |
+
min_len = min([ len(s) for s in folding_example])
|
165 |
+
max_len = max([ len(s) for s in folding_example])
|
166 |
+
|
167 |
+
# replace all rare/ambiguous amino acids by X (3Di sequences does not have those) and introduce white-space between all sequences (AAs and 3Di)
|
168 |
+
sequence_examples = [" ".join(list(sequence)) for sequence in sequence_examples]
|
169 |
+
|
170 |
+
# add pre-fixes accordingly. For the translation from 3Di to AAs, you need to prepend "<fold2AA>"
|
171 |
+
sequence_examples = [ "<fold2AA>" + " " + s for s in sequence_examples]
|
172 |
+
|
173 |
+
# tokenize sequences and pad up to the longest sequence in the batch
|
174 |
+
ids = tokenizer.batch_encode_plus(sequences_example, add_special_tokens=True, padding="longest",return_tensors='pt').to(device)
|
175 |
+
|
176 |
+
# Generation configuration
|
177 |
+
gen_kwargs_fold2AA = {
|
178 |
+
"do_sample": True,
|
179 |
+
"top_p" : 0.90,
|
180 |
+
"temperature" : 1.1,
|
181 |
+
"top_k" : 6,
|
182 |
+
"repetition_penalty" : 1.2,
|
183 |
+
}
|
184 |
+
|
185 |
+
# translate from 3Di to AA
|
186 |
+
with torch.no_grad():
|
187 |
+
target = model.generate(
|
188 |
+
start_encoding.input_ids,
|
189 |
+
attention_mask=start_encoding.attention_mask,
|
190 |
+
max_length=max_len, # max length of generated text
|
191 |
+
min_length=min_len, # minimum length of the generated text
|
192 |
+
early_stopping=True, # stop early if end-of-text token is generated
|
193 |
+
num_return_sequences=1, # return only a single sequence
|
194 |
+
**gen_kwargs_fold2AA
|
195 |
+
)
|
196 |
+
# Decode and remove white-spaces between tokens
|
197 |
+
t_strings = tokenizer.batch_decode( target, skip_special_tokens=True )
|
198 |
+
t_strings = [ "".join(ts.split(" ")) for ts in t_strings ]
|
199 |
+
```
|
200 |
+
|
201 |
+
|
202 |
+
## Training Details
|
203 |
+
|
204 |
+
### Training Data
|
205 |
+
|
206 |
+
[Pre-training data (3Di+AA sequences for 17M proteins)](https://huggingface.co/datasets/adrianhenkel/lucidprots_full_data)
|
207 |
+
|
208 |
+
|
209 |
+
### Training Procedure
|
210 |
+
|
211 |
+
The first phase of the pre-training is continuing span-based denoising using 3Di- and AA-sequences using this [script](https://github.com/huggingface/transformers/blob/main/examples/flax/language-modeling/run_t5_mlm_flax.py).
|
212 |
+
For the second phase of pre-training (actual translation from 3Di- to AA-sequences and vice versa), we used this [script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/summarization/run_summarization_no_trainer.py).
|
213 |
+
|
214 |
+
|
215 |
+
#### Training Hyperparameters
|
216 |
+
|
217 |
+
- **Training regime:** we used DeepSpeed (stage-2), gradient accumulation steps (5 steps), mixed half-precision (bf16) and PyTorch2.0’s torchInductor compiler
|
218 |
+
|
219 |
+
#### Speed
|
220 |
+
|
221 |
+
Generating embeddings for the human proteome from the Pro(s)tT5 encoder requires around 35m (minutes) or 0.1s (seconds) per protein using batch-processing and half-precision (fp16) on a single RTX A6000 GPU with 48 GB vRAM.
|
222 |
+
The translation is comparatively slow (0.6-2.5s/protein at an average length 135 and 406, respectively) due to the sequential nature of the decoding process which needs to generate left-to-right, token-by-token.
|
223 |
+
We only used batch-processing with half-precision without further optimization.
|