File size: 3,774 Bytes
f4b40f5
 
 
 
 
 
 
 
cd1fd42
f5ac1e0
 
d9296a1
 
6964eab
8bb66ed
 
 
 
f4b40f5
1ad4640
33256ed
f4b40f5
33256ed
 
 
 
 
f4b40f5
8302648
b13f157
38a4e0a
c1da9ab
45a318f
f4b40f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c5558bc
 
f4b40f5
 
 
c5558bc
4ecd8e3
f4b40f5
c5558bc
 
f4b40f5
c5558bc
f4b40f5
 
c5558bc
 
f4b40f5
4ecd8e3
 
 
 
 
c5558bc
 
f4b40f5
 
 
 
 
657da6e
fe694f5
f4b40f5
fe694f5
f4b40f5
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
license: mit
metrics:
- accuracy
tags:
- biology
pipeline_tag: text-classification
---
# Model description
In biology, "targeting peptides" typically refer to "targeting signal peptides" or "targeting sequences," also known as "signal peptides" or "signal sequences." These are short amino acid sequences located at the N-terminal or C-terminal of a protein that direct the protein to specific locations within the cell, such as the mitochondria, chloroplasts, plastids, endoplasmic reticulum, and more. Targeting peptides play a crucial signaling role during protein synthesis, ensuring that the protein is correctly localized to its intended cellular destination.  

**TarPepSubLoc-ESM2**  (TarPepSubLoc, Targeting Peptide Subcellular Localization) is a protein language model fine-tuned from [**ESM2**](https://github.com/facebookresearch/esm) pretrained model [(***facebook/esm2_t36_3B_UR50D***)](https://huggingface.co/facebook/esm2_t36_3B_UR50D) on a trageting peptides subcelluar localization dataset with five classes.   

**TarPepSubLoc-ESM2** achieved the following results:  
Train Loss: 0.0385  
Train Accuracy: 0.9881  
Validation Loss: 0.0566  
Validation Accuracy: 0.9812  
Epoch: 20 
# The dataset for training **TarPepSubLoc-ESM2**
The full dataset contains 13,005 protein sequences, including SP (2,697), MT (499), CH (227), TH (45), and Other (9,537).
The highly imbalanced sample sizes across the six categories in this dataset pose a significant challenge for classification.  
- "SP" for signal peptide,
- "MT" for mitochondrial transit peptide (mTP),
- "CH" for chloroplast transit peptide (cTP),
- "TH" for thylakoidal lumen composite transit peptide (lTP),
- "Other" for no targeting peptide (in this case, the length is given as 0).

The dataset was downloaded from the website at [**TargetP - 2.0**](https://services.healthtech.dtu.dk/services/TargetP-2.0/).  
# Model training code at GitHub
https://github.com/pengsihua2023/TarPepSubLoc-ESM2  

# How to use **TarPepSubLoc-ESM2**
### An example
Pytorch and transformers libraries should be installed in your system.  
### Install pytorch
```
pip install torch torchvision torchaudio

```
### Install transformers
```
pip install transformers

```
### Run the following code
```
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the fine-tuned model and tokenizer from Hugging Face
model_name = "sihuapeng/TarPepSubLoc-ESM2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Define the amino acid sequence
sequence = "MNSLLMITACLALVGTVWAKEGYLVNSYTGCKFECFKLGDNDYCLRECRQQYGKGSGGYCYAFGCWCTHLYEQAVVWPLPNKTCNGK"

# Tokenize the sequence
inputs = tokenizer(sequence, return_tensors="pt")

# Make the prediction
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class_id = logits.argmax().item()

# Define the ID to Label mapping
id2label = {0: 'CH', 1: 'MT', 2: 'Other', 3: 'SP', 4: 'TH'}

# Get the predicted label
predicted_label = id2label[predicted_class_id]

print(f"The predicted class for the sequence is: {predicted_label}")

```

## Funding
This project was funded by the CDC to Justin Bahl (BAA 75D301-21-R-71738).  
### Model architecture, coding and implementation
Sihua Peng  
## Group, Department and Institution  
### Lab: [Justin Bahl](https://bahl-lab.github.io/)  
### Department: [College of Veterinary Medicine Department of Infectious Diseases](https://vet.uga.edu/education/academic-departments/infectious-diseases/)  
### Institution: [The University of Georgia](https://www.uga.edu/)  

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c56e2d2d07296c7e35994f/2rlokZM1FBTxibqrM8ERs.png)