AmelieSchreiber commited on
Commit
d4048c3
1 Parent(s): 17b2397

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -2
README.md CHANGED
@@ -38,13 +38,13 @@ This model was finetuned on ~549K protein sequences from the UniProt database. T
38
  the following test metrics:
39
 
40
  ```python
41
- ({'accuracy': 0.9905461579981686,
42
  'precision': 0.7695765003685506,
43
  'recall': 0.9841352974610041,
44
  'f1': 0.8637307441810476,
45
  'auc': 0.9874413786006525,
46
  'mcc': 0.8658850560635515},
47
- {'accuracy': 0.9394282959813123,
48
  'precision': 0.3662722265170941,
49
  'recall': 0.8330231316088238,
50
  'f1': 0.5088208423175958,
@@ -52,6 +52,67 @@ the following test metrics:
52
  'mcc': 0.5283098562376193})
53
  ```
54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  The dataset size increase from ~209K protein sequences to ~549K clearly improved performance in terms of test metric.
56
  We used Hugging Face's parameter efficient finetuning (PEFT) library to finetune with Low Rank Adaptation (LoRA). We decided
57
  to use a rank of 2 for the LoRA, as this was shown to slightly improve the test metrics compared to rank 8 and rank 16 on the
 
38
  the following test metrics:
39
 
40
  ```python
41
+ Train: ({'accuracy': 0.9905461579981686,
42
  'precision': 0.7695765003685506,
43
  'recall': 0.9841352974610041,
44
  'f1': 0.8637307441810476,
45
  'auc': 0.9874413786006525,
46
  'mcc': 0.8658850560635515},
47
+ Test: {'accuracy': 0.9394282959813123,
48
  'precision': 0.3662722265170941,
49
  'recall': 0.8330231316088238,
50
  'f1': 0.5088208423175958,
 
52
  'mcc': 0.5283098562376193})
53
  ```
54
 
55
+ To analyze the train and test metrics, we will consider each metric individually and then offer a comprehensive view of the
56
+ model’s performance. Let's start:
57
+
58
+ ### **1. Accuracy**
59
+ - **Train**: 99.05%
60
+ - **Test**: 93.94%
61
+
62
+ The accuracy is quite high in both the training and test datasets, indicating that the model is correctly identifying the positive
63
+ and negative classes most of the time.
64
+
65
+ ### **2. Precision**
66
+ - **Train**: 76.96%
67
+ - **Test**: 36.63%
68
+
69
+ The precision, which measures the proportion of true positive predictions among all positive predictions, drops significantly in
70
+ the test set. This suggests that the model might be identifying too many false positives when generalized to unseen data.
71
+
72
+ ### **3. Recall**
73
+ - **Train**: 98.41%
74
+ - **Test**: 83.30%
75
+
76
+ The recall, which indicates the proportion of actual positives correctly identified, remains quite high in the test set, although
77
+ lower than in the training set. This suggests the model is quite sensitive and is able to identify most of the positive cases.
78
+
79
+ ### **4. F1-Score**
80
+ - **Train**: 86.37%
81
+ - **Test**: 50.88%
82
+
83
+ The F1-score is the harmonic mean of precision and recall. The significant drop in the F1-score from training to testing indicates
84
+ that the balance between precision and recall has worsened in the test set, which is primarily due to the lower precision.
85
+
86
+ ### **5. AUC (Area Under the ROC Curve)**
87
+ - **Train**: 98.74%
88
+ - **Test**: 88.83%
89
+
90
+ The AUC is high in both training and testing, but it decreases in the test set. A high AUC indicates that the model has good measure
91
+ of separability and is able to distinguish between the positive and negative classes well.
92
+
93
+ ### **6. MCC (Matthews Correlation Coefficient)**
94
+ - **Train**: 86.59%
95
+ - **Test**: 52.83%
96
+
97
+ MCC is a balanced metric that considers true and false positives and negatives. The decline in MCC from training to testing indicates
98
+ a decrease in the quality of binary classifications.
99
+
100
+ ### **Overall Analysis**
101
+
102
+ - **Overfitting**: The significant drop in metrics such as precision, F1-score, and MCC from training to test set suggests that the model might be overfitting to the training data, i.e., it may not generalize well to unseen data.
103
+
104
+ - **High Recall, Low Precision**: The model has a high recall but low precision on the test set, indicating that it is identifying too many cases as positive, including those that are actually negative (false positives). This could be a reflection of a model that is biased towards predicting the positive class.
105
+
106
+ - **Improvement Suggestions**:
107
+ - **Data Augmentation**: So, we might want to consider data augmentation strategies to make the model more robust.
108
+ - **Class Weights**: If there is a class imbalance in the dataset, adjusting the class weights during training might help.
109
+ - **Hyperparameter Tuning**: Experiment with different hyperparameters, including the learning rate, batch size, etc., to see if you can improve the model's performance on the test set.
110
+ - **Feature Engineering**: Consider revisiting the features used to train the model. Sometimes, introducing new features or removing irrelevant ones can help improve performance.
111
+
112
+ In conclusion, while the model performs excellently on the training set, its performance drops in the test set, suggesting that there
113
+ is room for improvement to make the model more generalizable to unseen data. It would be beneficial to look into strategies to reduce
114
+ overfitting and improve precision without significantly sacrificing recall.
115
+
116
  The dataset size increase from ~209K protein sequences to ~549K clearly improved performance in terms of test metric.
117
  We used Hugging Face's parameter efficient finetuning (PEFT) library to finetune with Low Rank Adaptation (LoRA). We decided
118
  to use a rank of 2 for the LoRA, as this was shown to slightly improve the test metrics compared to rank 8 and rank 16 on the