Model descriptions

PAIR ([paper](https://www.biorxiv.org/content/10.1101/2024.07.22.604688)) is a flexible fine-tuning framework to improve the quality of protein representations for function predictions. PAIR uses a text decoder to guide the fine-tuning process of a protein encoder so that the learned representations can extract information contained within the diverse set of annotations in Swiss-Prot. This model fine-tunes ESM-1b ([repo](https://huggingface.co/facebook/esm1b_t33_650M_UR50S)) with PAIR.

Intended use

The model can be used for feature extractions in protein function prediction tasks.

How to load the model for feature extractions?

```python from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D") model = AutoModel.from_pretrained("h4duan/PAIR-esm1b").to("cuda") protein = ["AETCZAO"] def extract_feature(protein): ids = tokenizer(protein, return_tensors="pt", padding=True, max_length=1024, truncation=True, return_attention_mask=True) input_ids = torch.tensor(ids['input_ids']).to("cuda") attention_mask = torch.tensor(ids['attention_mask']).to("cuda") with torch.no_grad(): embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask).last_hidden_state return torch.mean(embedding_repr, dim=1) feature = extract_feature(protein) ```

How to extract the features in batch?

```python proteins = ["AETCZAO","SKTZP"] def extract_features_batch(proteins): ids = tokenizer(proteins, return_tensors="pt", padding=True, max_length=1024, truncation=True, return_attention_mask=True) input_ids = torch.tensor(ids['input_ids']).to("cuda") attention_mask = torch.tensor(ids['attention_mask']).to("cuda") with torch.no_grad(): embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask).last_hidden_state attention_mask = attention_mask.unsqueeze(-1) attention_mask = attention_mask.expand(-1, -1, embedding_repr.size(-1)) masked_embedding_repr = embedding_repr * attention_mask sum_embedding_repr = masked_embedding_repr.sum(dim=1) non_zero_count = attention_mask.sum(dim=1) mean_embedding_repr = sum_embedding_repr / non_zero_count return mean_embedding_repr feature = extract_feature_batch(proteins) ```