ligeti commited on
Commit
a25312e
1 Parent(s): 5a914a1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +167 -0
README.md CHANGED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ tags:
4
+ - prokbert
5
+ - bioinformatics
6
+ - genomics
7
+ - sequence embedding
8
+ - genomic language models
9
+ - nucleotide
10
+ - dna-sequence
11
+ - promoter-prediction
12
+ - phage
13
+ ---
14
+ ## ProkBERT-mini-pahge Model
15
+
16
+ This finetuned model is specifically designed for promoter identification and is based on the [ProkBERT-mini model](https://huggingface.co/neuralbioinfo/prokbert-mini).
17
+
18
+ For more details, refer to the [pahge dataset description](https://huggingface.co/datasets/neuralbioinfo/phage-test-10k) used for training and evaluating this model.
19
+
20
+ ### Example Usage
21
+
22
+ For practical examples on how to use this model, see the following Jupyter notebooks:
23
+
24
+ - [Training Notebook](https://colab.research.google.com/github/nbrg-ppcu/prokbert/blob/main/examples/Finetuning.ipynb): A guide to fine-tuning the ProkBERT-mini model for promoter identification tasks.
25
+ - [Evaluation Notebook](https://colab.research.google.com/github/nbrg-ppcu/prokbert/blob/main/examples/Inference.ipynb): Demonstrates how to evaluate the finetuned ProkBERT-mini-promoter model on test datasets.
26
+
27
+ ### Model Application
28
+
29
+ The model was trained for binary classification to distinguish between pahge and non-phage (bacteria) sequences. The non-phage sequences were sampled form the phage's host genome randomly.
30
+
31
+
32
+
33
+ ## Simple Usage Example
34
+
35
+ The following example demonstrates how to use the ProkBERT-mini-promoter model for processing a DNA sequence:
36
+
37
+ ```python
38
+ from prokbert.prokbert_tokenizer import ProkBERTTokenizer
39
+ from transformers import MegatronBertForSequenceClassification
40
+ finetuned_model = "neuralbioinfo/prokbert-mini-phage"
41
+ kmer = 6
42
+ shift= 1
43
+
44
+ tok_params = {'kmer' : kmer,
45
+ 'shift' : shift}
46
+ tokenizer = ProkBERTTokenizer(tokenization_params=tok_params)
47
+ model = BertForBinaryClassificationWithPooling.from_pretrained(finetuned_model)
48
+ sequence = 'CACCGCATGGAGATCGGCACCTACTTCGACAAGCTGGAGGCGCTGCTGAAGGAGTGGTACGAGGCGCGCGGGGGTGAGGCATGACGGACTGGCAAGAGGAGCAGCGTCAGCGC'
49
+ inputs = tokenizer(sequence, return_tensors="pt")
50
+ # Ensure that inputs have a batch dimension
51
+ inputs = {key: value.unsqueeze(0) for key, value in inputs.items()}
52
+ # Generate outputs from the model
53
+ outputs = model(**inputs)
54
+ print(outputs)
55
+ ```
56
+
57
+ ### Model Details
58
+
59
+ **Developed by:** Neural Bioinformatics Research Group
60
+
61
+ **Architecture:**
62
+
63
+ ...
64
+ **Tokenizer:** The model uses a 6-mer tokenizer with a shift of 1 (k6s1), specifically designed to handle DNA sequences efficiently.
65
+
66
+ **Parameters:**
67
+
68
+ | Parameter | Description |
69
+ |----------------------|--------------------------------------|
70
+ | Model Size | 20.6 million parameters |
71
+ | Max. Context Size | 1024 bp |
72
+ | Training Data | 206.65 billion nucleotides |
73
+ | Layers | 6 |
74
+ | Attention Heads | 6 |
75
+
76
+ ### Intended Use
77
+
78
+ **Intended Use Cases:** ProkBERT-mini-phage is intended for bioinformatics researchers and practitioners focusing on genomic sequence analysis, including:
79
+ - sequence classification tasks
80
+ - Exploration of genomic patterns and features
81
+
82
+
83
+ ### Installation of ProkBERT (if needed)
84
+
85
+ For setting up ProkBERT in your environment, you can install it using the following command (if not already installed):
86
+
87
+ ```python
88
+ try:
89
+ import prokbert
90
+ print("ProkBERT is already installed.")
91
+ except ImportError:
92
+ !pip install prokbert
93
+ print("Installed ProkBERT.")
94
+ ```
95
+
96
+
97
+ ### Evaluation on phage recognition benchmark dataset
98
+
99
+ | method | L | auc_class1 | acc | f1 | mcc | recall | sensitivity | specificity | tn | fp | fn | tp | Np | Nn | eval_time |
100
+ |:--------------|-----:|-------------:|---------:|---------:|---------:|---------:|--------------:|--------------:|-----:|-----:|-----:|-----:|------:|------:|------------:|
101
+ | DeepVirFinder | 256 | 0.734914 | 0.627163 | 0.481213 | 0.309049 | 0.345317 | 0.345317 | 0.909856 | 4542 | 450 | 3278 | 1729 | 5007 | 4992 | 7580 |
102
+ | DeepVirFinder | 512 | 0.791423 | 0.708 | 0.637717 | 0.443065 | 0.521192 | 0.521192 | 0.889722 | 4510 | 559 | 2361 | 2570 | 4931 | 5069 | 2637 |
103
+ | DeepVirFinder | 1024 | 0.826255 | 0.7424 | 0.702678 | 0.505333 | 0.605651 | 0.605651 | 0.880579 | 4380 | 594 | 1982 | 3044 | 5026 | 4974 | 1294 |
104
+ | DeepVirFinder | 2048 | 0.853098 | 0.7717 | 0.743339 | 0.557177 | 0.6612 | 0.6612 | 0.8822 | 4411 | 589 | 1694 | 3306 | 5000 | 5000 | 1351 |
105
+ | INHERIT | 256 | 0.75982 | 0.6943 | 0.67012 | 0.393179 | 0.620008 | 0.620008 | 0.76883 | 3838 | 1154 | 1903 | 3105 | 5008 | 4992 | 2131 |
106
+ | INHERIT | 512 | 0.816326 | 0.7228 | 0.651408 | 0.479323 | 0.525248 | 0.525248 | 0.914973 | 4638 | 431 | 2341 | 2590 | 4931 | 5069 | 2920 |
107
+ | INHERIT | 1024 | 0.846547 | 0.7264 | 0.659447 | 0.495935 | 0.527059 | 0.527059 | 0.927825 | 4615 | 359 | 2377 | 2649 | 5026 | 4974 | 3055 |
108
+ | INHERIT | 2048 | 0.864122 | 0.7365 | 0.668595 | 0.518541 | 0.5316 | 0.5316 | 0.9414 | 4707 | 293 | 2342 | 2658 | 5000 | 5000 | 3225 |
109
+ | MINI | 256 | 0.846745 | 0.7755 | 0.766462 | 0.552855 | 0.735623 | 0.735623 | 0.815505 | 4071 | 921 | 1324 | 3684 | 5008 | 4992 | 6.68888 |
110
+ | MINI | 512 | 0.924973 | 0.8657 | 0.859121 | 0.732696 | 0.83046 | 0.83046 | 0.89998 | 4562 | 507 | 836 | 4095 | 4931 | 5069 | 16.3681 |
111
+ | MINI | 1024 | 0.956432 | 0.9138 | 0.911189 | 0.829645 | 0.879825 | 0.879825 | 0.94813 | 4716 | 258 | 604 | 4422 | 5026 | 4974 | 51.3319 |
112
+ | MINI-C | 256 | 0.827635 | 0.7512 | 0.7207 | 0.51538 | 0.640974 | 0.640974 | 0.861779 | 4302 | 690 | 1798 | 3210 | 5008 | 4992 | 7.33697 |
113
+ | MINI-C | 512 | 0.913378 | 0.8466 | 0.834876 | 0.69725 | 0.786453 | 0.786453 | 0.905109 | 4588 | 481 | 1053 | 3878 | 4931 | 5069 | 17.6749 |
114
+ | MINI-C | 1024 | 0.94644 | 0.8937 | 0.891564 | 0.788427 | 0.869479 | 0.869479 | 0.918175 | 4567 | 407 | 656 | 4370 | 5026 | 4974 | 54.204 |
115
+ | MINI-LONG | 256 | 0.777697 | 0.71495 | 0.686224 | 0.437727 | 0.622404 | 0.622404 | 0.807792 | 8065 | 1919 | 3782 | 6234 | 10016 | 9984 | 6.10304 |
116
+ | MINI-LONG | 512 | 0.880831 | 0.81405 | 0.798001 | 0.632855 | 0.744879 | 0.744879 | 0.881338 | 8935 | 1203 | 2516 | 7346 | 9862 | 10138 | 12.1307 |
117
+ | MINI-LONG | 1024 | 0.9413 | 0.88925 | 0.884917 | 0.781465 | 0.847195 | 0.847195 | 0.931745 | 9269 | 679 | 1536 | 8516 | 10052 | 9948 | 30.5088 |
118
+ | MINI-LONG | 2048 | 0.964551 | 0.929 | 0.927455 | 0.85878 | 0.9077 | 0.9077 | 0.9503 | 9503 | 497 | 923 | 9077 | 10000 | 10000 | 94.404 |
119
+ | Virsorter2 | 512 | 0.620782 | 0.6259 | 0.394954 | 0.364831 | 0.247617 | 0.247617 | 0.993884 | 5038 | 31 | 3710 | 1221 | 4931 | 5069 | 2057 |
120
+ | Virsorter2 | 1024 | 0.719898 | 0.7178 | 0.621919 | 0.51036 | 0.461799 | 0.461799 | 0.976478 | 4857 | 117 | 2705 | 2321 | 5026 | 4974 | 3258 |
121
+ | Virsorter2 | 2048 | 0.816142 | 0.8103 | 0.778724 | 0.647532 | 0.6676 | 0.6676 | 0.953 | 4765 | 235 | 1662 | 3338 | 5000 | 5000 | 5737 |
122
+
123
+
124
+ ### Column Descriptions
125
+
126
+ - **method**: The algorithm or method used for prediction (e.g., DeepVirFinder, INHERIT).
127
+ - **L**: Length of the genomic segment.
128
+ - **auc_class1**: Area under the ROC curve for class 1, indicating the model's ability to distinguish between classes.
129
+ - **acc**: Accuracy of the prediction, representing the proportion of true results (both true positives and true negatives) among the total number of cases examined.
130
+ - **f1**: The F1 score, a measure of a test's accuracy that considers both the precision and the recall.
131
+ - **mcc**: Matthews correlation coefficient, a quality measure for binary (two-class) classifications.
132
+ - **recall**: The recall, or true positive rate, measures the proportion of actual positives that are correctly identified.
133
+ - **sensitivity**: Sensitivity or true positive rate; identical to recall.
134
+ - **specificity**: The specificity, or true negative rate, measures the proportion of actual negatives that are correctly identified.
135
+ - **fp**: The number of false positives, indicating how many negative class samples were incorrectly identified as positive.
136
+ - **tp**: The number of true positives, indicating how many positive class samples were correctly identified.
137
+ - **eval_time**: The time taken to evaluate the model or method, usually in seconds.
138
+
139
+
140
+ ### Ethical Considerations and Limitations
141
+
142
+ Testing and evaluation have been conducted within specific genomic contexts, and the model's outputs in other scenarios are not guaranteed. Users should exercise caution and perform additional testing as necessary for their specific use cases.
143
+
144
+ ### Reporting Issues
145
+
146
+ Please report any issues with the model or its outputs to the Neural Bioinformatics Research Group through the following means:
147
+
148
+ - **Model issues:** [GitHub repository link](https://github.com/nbrg-ppcu/prokbert)
149
+ - **Feedback and inquiries:** [obalasz@gmail.com](mailto:obalasz@gmail.com)
150
+
151
+ ## Reference
152
+ If you use ProkBERT in your research, please cite the following paper:
153
+
154
+
155
+ ```
156
+ @ARTICLE{10.3389/fmicb.2023.1331233,
157
+ AUTHOR={Ligeti, Balázs and Szepesi-Nagy, István and Bodnár, Babett and Ligeti-Nagy, Noémi and Juhász, János},
158
+ TITLE={ProkBERT family: genomic language models for microbiome applications},
159
+ JOURNAL={Frontiers in Microbiology},
160
+ VOLUME={14},
161
+ YEAR={2024},
162
+ URL={https://www.frontiersin.org/articles/10.3389/fmicb.2023.1331233},
163
+ DOI={10.3389/fmicb.2023.1331233},
164
+ ISSN={1664-302X},
165
+ ABSTRACT={...}
166
+ }
167
+ ```