AmelieSchreiber
/

esm2_t12_35M_lora_binding_sites_v2_cp1

Token Classification

protein language model

Model card Files Files and versions Community

AmelieSchreiber commited on Sep 15, 2023

Commit

cd1f4b4

·

1 Parent(s): a29fcb1

Update README.md

Files changed (1) hide show

README.md +13 -10

README.md CHANGED Viewed

@@ -22,7 +22,7 @@ pipeline_tag: token-classification
 # ESM-2 for Binding Site Prediction
-This model is a finetuned version of the 35M parameter `esm2_t12_35M_UR50D` ([see here](https://huggingface.co/facebook/esm2_t12_35M_UR50D)
 and [here](https://huggingface.co/docs/transformers/model_doc/esm) for more details). The model was finetuned with LoRA for
 the binay token classification task of predicting binding sites (and active sites) of protein sequences based on sequence alone.
 The model may be underfit and undertrained, however it still achieved better performance on the test set in terms of loss, accuracy,
@@ -38,15 +38,18 @@ This model was finetuned on ~549K protein sequences from the UniProt database. T
 the following test metrics:
 ```
-Test: (Epoch 1)
- {'Training Loss': 0.037400,
-  'Validation Loss': 0.301413,
-  'accuracy': 0.939431,
-  'precision': 0.366282,
-  'recall': 0.833003,
-  'f1': 0.508826,
-  'auc': 0.888300,
-  'mcc': 0.528311})
 ```
 The dataset size increase from ~209K protein sequences to ~549K clearly improved performance in terms of test metric.

 # ESM-2 for Binding Site Prediction
+**This model is overfit (see below).** This model is a finetuned version of the 35M parameter `esm2_t12_35M_UR50D` ([see here](https://huggingface.co/facebook/esm2_t12_35M_UR50D)
 and [here](https://huggingface.co/docs/transformers/model_doc/esm) for more details). The model was finetuned with LoRA for
 the binay token classification task of predicting binding sites (and active sites) of protein sequences based on sequence alone.
 The model may be underfit and undertrained, however it still achieved better performance on the test set in terms of loss, accuracy,
 the following test metrics:
 ```
+({'accuracy': 0.9905461579981686,
+  'precision': 0.7695765003685506,
+  'recall': 0.9841352974610041,
+  'f1': 0.8637307441810476,
+  'auc': 0.9874413786006525,
+  'mcc': 0.8658850560635515},
+ {'accuracy': 0.9394282959813123,
+  'precision': 0.3662722265170941,
+  'recall': 0.8330231316088238,
+  'f1': 0.5088208423175958,
+  'auc': 0.8883078682492643,
+  'mcc': 0.5283098562376193})
 ```
 The dataset size increase from ~209K protein sequences to ~549K clearly improved performance in terms of test metric.