Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<h1>Model descriptions</h1>
|
2 |
+
|
3 |
+
PAIR ([paper](https://www.biorxiv.org/content/10.1101/2024.07.22.604688)) is a flexible fine-tuning framework to improve the quality of protein representations for function predictions. PAIR uses a text decoder to guide the fine-tuning process of a protein encoder so that the learned representations could extract information contained within the diverse set of annotations in Swiss-Prot. This model fine-tunes ESM2-650M ([repo](https://huggingface.co/facebook/esm2_t33_650M_UR50D)) with PAIR.
|
4 |
+
|
5 |
+
<h1>Intended use</h1>
|
6 |
+
|
7 |
+
The model can be used for feature extractions in protein function prediction tasks.
|
8 |
+
|
9 |
+
<h1>How to load the model?</h1>
|
10 |
+
|
11 |
+
```python
|
12 |
+
from transformers import AutoTokenizer, AutoModel
|
13 |
+
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
|
14 |
+
model = AutoModel.from_pretrained("h4duan/PAIR-esm2")
|
15 |
+
```
|
16 |
+
|
17 |
+
<h1>How to extract the features?</h1>
|
18 |
+
|
19 |
+
```python
|
20 |
+
proteins = ["AETCZAO","SKTZP"]
|
21 |
+
def extract_features(proteins):
|
22 |
+
ids = tokenizer(proteins, return_tensors="pt", padding=True, max_length=1024, truncation=True, return_attention_mask=True)
|
23 |
+
input_ids = torch.tensor(ids['input_ids']).to(self.device)
|
24 |
+
attention_mask = torch.tensor(ids['attention_mask']).to(self.device)
|
25 |
+
with torch.no_grad():
|
26 |
+
embedding_repr = self.model(output_hidden_states=True, input_ids=input_ids,attention_mask=attention_mask).hidden_states
|
27 |
+
embedding_repr = embedding_repr[self.hidden_layer]
|
28 |
+
attention_mask = attention_mask.unsqueeze(-1)
|
29 |
+
attention_mask = attention_mask.expand(-1, -1, embedding_repr.size(-1))
|
30 |
+
masked_embedding_repr = embedding_repr * attention_mask
|
31 |
+
sum_embedding_repr = masked_embedding_repr.sum(dim=1)
|
32 |
+
non_zero_count = attention_mask.sum(dim=1)
|
33 |
+
mean_embedding_repr = sum_embedding_repr / non_zero_count
|
34 |
+
return mean_embedding_repr
|
35 |
+
```
|