File size: 3,613 Bytes
68d5343
 
 
 
 
 
 
 
 
 
 
 
 
2df4a99
68d5343
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2df4a99
 
 
68d5343
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
license: mit
tags:
- biology
- protein
---
# PLTNUM-ESM2-HeLa
PLTNUM is a protein language model trained to predict protein half-lives based on their sequences.  
This model was created based on [facebook/esm2_t33_650M_UR50D](https://huggingface.co/facebook/esm2_t33_650M_UR50D) and trained on protein half-life dataset of HeLa human cell line ([paper link](https://pubmed.ncbi.nlm.nih.gov/29414762/)).

### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/sagawatatsuya/PLTNUM
- **Paper:** [Prediction of Protein Half-lives from Amino Acid Sequences by Protein Language Models](https://www.biorxiv.org/content/10.1101/2024.09.10.612367v1)
- **Demo:** https://huggingface.co/spaces/sagawa/PLTNUM

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

## How to Get Started with the Model

Use the code below to get started with the model.

```python
from torch import sigmoid
import torch.nn as nn
from transformers import AutoModel, AutoConfig, PreTrainedModel, AutoTokenizer


class PLTNUM_PreTrainedModel(PreTrainedModel):
    config_class = AutoConfig

    def __init__(self, config):
        super(PLTNUM_PreTrainedModel, self).__init__(config)
        self.model = AutoModel.from_pretrained(self.config._name_or_path)

        self.fc_dropout1 = nn.Dropout(0.8)
        self.fc_dropout2 = nn.Dropout(0.4)
        self.fc = nn.Linear(self.config.hidden_size, 1)
        self._init_weights(self.fc)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                nn.init.constant_(module.bias, 0)
        elif isinstance(module, nn.Embedding):
            nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
            if module.padding_idx is not None:
                nn.init.constant_(module.weight[module.padding_idx], 0.0)
        elif isinstance(module, nn.LayerNorm):
            nn.init.constant_(module.bias, 0)
            nn.init.constant_(module.weight, 1.0)

    def forward(self, inputs):
        outputs = self.model(**inputs)
        last_hidden_state = outputs.last_hidden_state[:, 0]
        output = (
            self.fc(self.fc_dropout1(last_hidden_state))
            + self.fc(self.fc_dropout2(last_hidden_state))
        ) / 2
        return output

    def create_embedding(self, inputs):
        outputs = self.model(**inputs)
        last_hidden_state = outputs.last_hidden_state[:, 0]
        return last_hidden_state


model = PLTNUM_PreTrainedModel.from_pretrained("sagawa/PLTNUM-ESM2-HeLa")
tokenizer = AutoTokenizer.from_pretrained("sagawa/PLTNUM-ESM2-HeLa")
seq = "MSGRGKQGGKARAKAKTRSSRAGLQFPVGRVHRLLRKGNYSERVGAGAPVYLAAVLEYLTAEILELAGNAARDNKKTRIIPRHLQLAIRNDEELNKLLGRVTIAQGGVLPNIQAVLLPKKTESHHKPKGK"
input = tokenizer(
    [seq],
    add_special_tokens=True,
    max_length=512,
    padding="max_length",
    truncation=True,
    return_offsets_mapping=False,
    return_attention_mask=True,
    return_tensors="pt",
)
print(sigmoid(model(input)))
```

## Citation
Prediction of Protein Half-lives from Amino Acid Sequences by Protein Language Models  
Tatsuya Sagawa, Eisuke Kanao, Kosuke Ogata, Koshi Imami, Yasushi Ishihama  
bioRxiv 2024.09.10.612367; doi: https://doi.org/10.1101/2024.09.10.612367
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->