h4duan commited on
Commit
d462c50
1 Parent(s): 0a9c0dd

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -0
README.md ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <h1>Model descriptions</h1>
2
+
3
+ PAIR ([paper](https://www.biorxiv.org/content/10.1101/2024.07.22.604688)) is a flexible fine-tuning framework to improve the quality of protein representations for function predictions. PAIR uses a text decoder to guide the fine-tuning process of a protein encoder so that the learned representations could extract information contained within the diverse set of annotations in Swiss-Prot. This model fine-tunes ESM2-650M ([repo](https://huggingface.co/facebook/esm2_t33_650M_UR50D)) with PAIR.
4
+
5
+ <h1>Intended use</h1>
6
+
7
+ The model can be used for feature extractions in protein function prediction tasks.
8
+
9
+ <h1>How to load the model?</h1>
10
+
11
+ ```python
12
+ from transformers import AutoTokenizer, AutoModel
13
+ tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
14
+ model = AutoModel.from_pretrained("h4duan/PAIR-esm2")
15
+ ```
16
+
17
+ <h1>How to extract the features?</h1>
18
+
19
+ ```python
20
+ proteins = ["AETCZAO","SKTZP"]
21
+ def extract_features(proteins):
22
+ ids = tokenizer(proteins, return_tensors="pt", padding=True, max_length=1024, truncation=True, return_attention_mask=True)
23
+ input_ids = torch.tensor(ids['input_ids']).to(self.device)
24
+ attention_mask = torch.tensor(ids['attention_mask']).to(self.device)
25
+ with torch.no_grad():
26
+ embedding_repr = self.model(output_hidden_states=True, input_ids=input_ids,attention_mask=attention_mask).hidden_states
27
+ embedding_repr = embedding_repr[self.hidden_layer]
28
+ attention_mask = attention_mask.unsqueeze(-1)
29
+ attention_mask = attention_mask.expand(-1, -1, embedding_repr.size(-1))
30
+ masked_embedding_repr = embedding_repr * attention_mask
31
+ sum_embedding_repr = masked_embedding_repr.sum(dim=1)
32
+ non_zero_count = attention_mask.sum(dim=1)
33
+ mean_embedding_repr = sum_embedding_repr / non_zero_count
34
+ return mean_embedding_repr
35
+ ```