YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Model descriptions
PAIR (paper) is a flexible fine-tuning framework to improve the quality of protein representations for function predictions. PAIR uses a text decoder to guide the fine-tuning process of a protein encoder so that the learned representations could extract information contained within the diverse set of annotations in Swiss-Prot. This model fine-tunes Prot-T5 (repo) with PAIR.
Intended use
The model can be used for feature extractions in protein function prediction tasks.
How to load the model?
from transformers import AutoTokenizer, AutoModel, T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("Rostlab/prot_t5_xl_uniref50")
model = AutoModel.from_pretrained("h4duan/PAIR-prott5").to("cuda")
protein = ["AETCZAO"]
def extract_feature(protein):
protein = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in protein]
ids = tokenizer(protein, return_tensors="pt", padding=True, max_length=1024, truncation=True, return_attention_mask=True)
input_ids = torch.tensor(ids['input_ids']).to("cuda")
attention_mask = torch.tensor(ids['attention_mask']).to("cuda")
with torch.no_grad():
embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask).last_hidden_state
return torch.mean(embedding_repr, dim=1)
feature = extract_feature(protein)
How to extract the features in batch?
proteins = ["AETCZAO","SKTZP"]
def extract_features(proteins):
sequences = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in proteins]
ids = tokenizer.batch_encode_plus(sequences, add_special_tokens=True, padding='max_length',
max_length=1024, truncation=True)
input_ids = torch.tensor(ids['input_ids']).to("cuda")
attention_mask = torch.tensor(ids['attention_mask']).to("cuda")
with torch.no_grad():
embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask).last_hidden_state
attention_mask = attention_mask.unsqueeze(-1)
attention_mask = attention_mask.expand(-1, -1, embedding_repr.size(-1))
masked_embedding_repr = embedding_repr * attention_mask
sum_embedding_repr = masked_embedding_repr.sum(dim=1)
non_zero_count = attention_mask.sum(dim=1)
mean_embedding_repr = sum_embedding_repr / non_zero_count
return mean_embedding_repr
features = extract_feature(proteins)
- Downloads last month
- 15