File size: 2,305 Bytes
d462c50 f588852 d462c50 f588852 d462c50 732b248 f588852 732b248 f588852 bf1606e 732b248 d462c50 f588852 d462c50 f588852 d462c50 732b248 d462c50 bf1606e d462c50 732b248 d462c50 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
<h1>Model descriptions</h1>
PAIR ([paper](https://www.biorxiv.org/content/10.1101/2024.07.22.604688)) is a flexible fine-tuning framework to improve the quality of protein representations for function predictions. PAIR uses a text decoder to guide the fine-tuning process of a protein encoder so that the learned representations can extract information contained within the diverse set of annotations in Swiss-Prot. This model fine-tunes ESM2-650M ([repo](https://huggingface.co/facebook/esm2_t33_650M_UR50D)) with PAIR.
<h1>Intended use</h1>
The model can be used for feature extractions in protein function prediction tasks.
<h1>How to load the model for feature extractions?</h1>
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
model = AutoModel.from_pretrained("h4duan/PAIR-esm2").to("cuda")
protein = ["AETCZAO"]
def extract_feature(protein):
ids = tokenizer(protein, return_tensors="pt", padding=True, max_length=1024, truncation=True, return_attention_mask=True)
input_ids = torch.tensor(ids['input_ids']).to("cuda")
attention_mask = torch.tensor(ids['attention_mask']).to("cuda")
with torch.no_grad():
embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask).last_hidden_state
return torch.mean(embedding_repr, dim=1)
feature = extract_feature(protein)
```
<h1>How to extract the features in batch?</h1>
```python
proteins = ["AETCZAO","SKTZP"]
def extract_features_batch(proteins):
ids = tokenizer(proteins, return_tensors="pt", padding=True, max_length=1024, truncation=True, return_attention_mask=True)
input_ids = torch.tensor(ids['input_ids']).to("cuda")
attention_mask = torch.tensor(ids['attention_mask']).to("cuda")
with torch.no_grad():
embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask).last_hidden_state
attention_mask = attention_mask.unsqueeze(-1)
attention_mask = attention_mask.expand(-1, -1, embedding_repr.size(-1))
masked_embedding_repr = embedding_repr * attention_mask
sum_embedding_repr = masked_embedding_repr.sum(dim=1)
non_zero_count = attention_mask.sum(dim=1)
mean_embedding_repr = sum_embedding_repr / non_zero_count
return mean_embedding_repr
feature = extract_features_batch(proteins)
``` |