wgetdd commited on
Commit
5bfda01
1 Parent(s): a451ab8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +76 -77
README.md CHANGED
@@ -16,14 +16,20 @@ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-
16
 
17
  # Training Procedure
18
 
19
- 1. The model is trained on Tesla T4 (15GB GPU memory)
20
- 2. The training is completed in two phases
21
- 3. The first phase contains 20 epochs and second phase contains another 20 epochs
22
- 4. In the first training we see loss dropping correctly but in the second training it drops less
23
- 5. We run our two training loops separately and do not run any kind of validation on them, except for validation loss
24
- 6. Later we evaluate the model and get the numbers
25
- 7. The lightning generally saves the model as .ckpt format, so we convert it to torch format by saving state dict as .pt format
26
- 8. For doing this we use these two lines of code
 
 
 
 
 
 
27
 
28
  ```
29
  best_model = torch.load(weights_path)
@@ -35,12 +41,14 @@ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-
35
  ```
36
 
37
 
38
- 8. The model starts overfitting on the dataset after 30 epochs
39
- 9. Future Improvements
40
  1. Train the model in 1 shot instead of two different phases
41
  2. Keep a better batch size (Basically earn more money and buy a good GPU)
42
  3. Data transformation also plays a vital role here
43
  4. OneCycle LR range needs to be appropriately modified for a better LR
 
 
44
 
45
  # Data Transformation
46
 
@@ -51,83 +59,74 @@ Along with the transforms mentioned in the [config file](https://github.com/deep
51
  # Accuracy Report
52
 
53
  ```
54
- Class accuracy is: 82.999725%
55
- No obj accuracy is: 96.828300%
56
- Obj accuracy is: 76.898473%
57
 
58
- MAP: 0.29939851760864258
59
 
60
  ```
61
 
62
- # [Training Logs](https://github.com/deepanshudashora/ERAV1/blob/master/session13/lightning_version/merged_logs.csv)
63
 
64
  #### For faster execution we run the validation step after 20 epochs for the first 20 epochs of training and after that after every 5 epochs till 40 epochs
65
 
66
  ```
67
- Unnamed: 0 lr-Adam step train_loss epoch val_loss
68
- 6576 6576 NaN 164299 4.186745 39.0 NaN
69
- 6577 6577 0.000132 164349 NaN NaN NaN
70
- 6578 6578 NaN 164349 2.936086 39.0 NaN
71
- 6579 6579 0.000132 164399 NaN NaN NaN
72
- 6580 6580 NaN 164399 4.777130 39.0 NaN
73
- 6581 6581 0.000132 164449 NaN NaN NaN
74
- 6582 6582 NaN 164449 3.139145 39.0 NaN
75
- 6583 6583 0.000132 164499 NaN NaN NaN
76
- 6584 6584 NaN 164499 4.596097 39.0 NaN
77
- 6585 6585 0.000132 164549 NaN NaN NaN
78
- 6586 6586 NaN 164549 5.587294 39.0 NaN
79
- 6587 6587 0.000132 164599 NaN NaN NaN
80
- 6588 6588 NaN 164599 4.592830 39.0 NaN
81
- 6589 6589 0.000132 164649 NaN NaN NaN
82
- 6590 6590 NaN 164649 3.914468 39.0 NaN
83
- 6591 6591 0.000132 164699 NaN NaN NaN
84
- 6592 6592 NaN 164699 3.180615 39.0 NaN
85
- 6593 6593 0.000132 164749 NaN NaN NaN
86
- 6594 6594 NaN 164749 5.772174 39.0 NaN
87
- 6595 6595 0.000132 164799 NaN NaN NaN
88
- 6596 6596 NaN 164799 2.894014 39.0 NaN
89
- 6597 6597 0.000132 164849 NaN NaN NaN
90
- 6598 6598 NaN 164849 4.473828 39.0 NaN
91
- 6599 6599 0.000132 164899 NaN NaN NaN
92
- 6600 6600 NaN 164899 6.397766 39.0 NaN
93
- 6601 6601 0.000132 164949 NaN NaN NaN
94
- 6602 6602 NaN 164949 3.789242 39.0 NaN
95
- 6603 6603 0.000132 164999 NaN NaN NaN
96
- 6604 6604 NaN 164999 5.182691 39.0 NaN
97
- 6605 6605 0.000132 165049 NaN NaN NaN
98
- 6606 6606 NaN 165049 4.845749 39.0 NaN
99
- 6607 6607 0.000132 165099 NaN NaN NaN
100
- 6608 6608 NaN 165099 3.672542 39.0 NaN
101
- 6609 6609 0.000132 165149 NaN NaN NaN
102
- 6610 6610 NaN 165149 4.230726 39.0 NaN
103
- 6611 6611 0.000132 165199 NaN NaN NaN
104
- 6612 6612 NaN 165199 4.625024 39.0 NaN
105
- 6613 6613 0.000132 165249 NaN NaN NaN
106
- 6614 6614 NaN 165249 4.549682 39.0 NaN
107
- 6615 6615 0.000132 165299 NaN NaN NaN
108
- 6616 6616 NaN 165299 4.040627 39.0 NaN
109
- 6617 6617 0.000132 165349 NaN NaN NaN
110
- 6618 6618 NaN 165349 4.857126 39.0 NaN
111
- 6619 6619 0.000132 165399 NaN NaN NaN
112
- 6620 6620 NaN 165399 3.081895 39.0 NaN
113
- 6621 6621 0.000132 165449 NaN NaN NaN
114
- 6622 6622 NaN 165449 3.945353 39.0 NaN
115
- 6623 6623 0.000132 165499 NaN NaN NaN
116
- 6624 6624 NaN 165499 3.203420 39.0 NaN
117
- 6625 6625 NaN 165519 NaN 39.0 3.081895
118
-
119
-
120
 
121
  ```
122
 
123
  # Results
124
 
125
- ## For epochs 0 to 19
126
- ![train_logs_1.png](images/train_logs_1.png)
127
-
128
- ## From 19 to 20
129
- ![train_logs_2.png](images/train_logs_2.png)
130
-
131
- ## Full training logs for loss
132
-
133
- ![full_training.png](images/full_training.png)
 
16
 
17
  # Training Procedure
18
 
19
+ #### [Experiment 1](https://github.com/deepanshudashora/ERAV1/tree/master/session13/lightning_version/Experiments)
20
+ 1. The model is trained on Tesla T4 (15GB GPU memory)
21
+ 2. The training is completed in two phases
22
+ 3. The first phase contains 20 epochs and the second phase contains another 20 epochs
23
+ 4. In the first training we see loss dropping correctly but in the second training it drops less
24
+ 5. We run our two training loops separately and do not run any kind of validation on them, except for validation loss
25
+
26
+ #### [Experiment 2](https://github.com/deepanshudashora/ERAV1/tree/master/session13/lightning_version)
27
+ 1. The model is trained on 2 Tesla t4 GPUs, with distributed training using PyTorch lightning
28
+ 2. For doing the distributed training we use the strategy ```ddp_notebook_find_unused_parameters_true```
29
+
30
+ * Later we evaluate the model and get the numbers
31
+ * The lightning generally saves the model as .ckpt format, so we convert it to torch format by saving state dict as .pt format
32
+ * For doing this we use these two lines of code
33
 
34
  ```
35
  best_model = torch.load(weights_path)
 
41
  ```
42
 
43
 
44
+ * The model starts overfitting on the dataset after 30 epochs
45
+ * Future Improvements
46
  1. Train the model in 1 shot instead of two different phases
47
  2. Keep a better batch size (Basically earn more money and buy a good GPU)
48
  3. Data transformation also plays a vital role here
49
  4. OneCycle LR range needs to be appropriately modified for a better LR
50
+
51
+
52
 
53
  # Data Transformation
54
 
 
59
  # Accuracy Report
60
 
61
  ```
62
+ Class accuracy is: 85.015236%
63
+ No obj accuracy is: 98.522491%
64
+ Obj accuracy is: 65.760597%
65
 
66
+ MAP: 0.4661380648612976
67
 
68
  ```
69
 
70
+ # [Training Logs](https://github.com/deepanshudashora/ERAV1/blob/master/session13/lightning_version/training_logs/csv_training_logs/lightning_logs/version_0/metrics.csv)
71
 
72
  #### For faster execution we run the validation step after 20 epochs for the first 20 epochs of training and after that after every 5 epochs till 40 epochs
73
 
74
  ```
75
+ lr-Adam step train_loss epoch val_loss
76
+ 786 NaN 19499 4.653981 37.0 NaN
77
+ 787 0.000160 19549 NaN NaN NaN
78
+ 788 NaN 19549 4.864988 37.0 NaN
79
+ 789 0.000160 19599 NaN NaN NaN
80
+ 790 NaN 19599 5.241925 37.0 NaN
81
+ 791 0.000160 19649 NaN NaN NaN
82
+ 792 NaN 19649 5.020171 37.0 NaN
83
+ 793 0.000161 19699 NaN NaN NaN
84
+ 794 NaN 19699 4.245292 38.0 NaN
85
+ 795 0.000161 19749 NaN NaN NaN
86
+ 796 NaN 19749 4.541957 38.0 NaN
87
+ 797 0.000161 19799 NaN NaN NaN
88
+ 798 NaN 19799 3.837740 38.0 NaN
89
+ 799 0.000161 19849 NaN NaN NaN
90
+ 800 NaN 19849 4.239679 38.0 NaN
91
+ 801 0.000161 19899 NaN NaN NaN
92
+ 802 NaN 19899 4.034101 38.0 NaN
93
+ 803 0.000161 19949 NaN NaN NaN
94
+ 804 NaN 19949 5.010788 38.0 NaN
95
+ 805 0.000161 19999 NaN NaN NaN
96
+ 806 NaN 19999 3.980245 38.0 NaN
97
+ 807 0.000161 20049 NaN NaN NaN
98
+ 808 NaN 20049 4.641729 38.0 NaN
99
+ 809 0.000161 20099 NaN NaN NaN
100
+ 810 NaN 20099 4.563717 38.0 NaN
101
+ 811 0.000161 20149 NaN NaN NaN
102
+ 812 NaN 20149 4.422552 38.0 NaN
103
+ 813 0.000161 20199 NaN NaN NaN
104
+ 814 NaN 20199 4.925357 38.0 NaN
105
+ 815 0.000161 20249 NaN NaN NaN
106
+ 816 NaN 20249 4.788391 39.0 NaN
107
+ 817 0.000161 20299 NaN NaN NaN
108
+ 818 NaN 20299 4.478580 39.0 NaN
109
+ 819 0.000161 20349 NaN NaN NaN
110
+ 820 NaN 20349 4.624731 39.0 NaN
111
+ 821 0.000161 20399 NaN NaN NaN
112
+ 822 NaN 20399 4.425498 39.0 NaN
113
+ 823 0.000161 20449 NaN NaN NaN
114
+ 824 NaN 20449 4.361921 39.0 NaN
115
+ 825 0.000161 20499 NaN NaN NaN
116
+ 826 NaN 20499 4.318252 39.0 NaN
117
+ 827 0.000161 20549 NaN NaN NaN
118
+ 828 NaN 20549 4.013813 39.0 NaN
119
+ 829 0.000161 20599 NaN NaN NaN
120
+ 830 NaN 20599 4.476331 39.0 NaN
121
+ 831 0.000161 20649 NaN NaN NaN
122
+ 832 NaN 20649 4.192605 39.0 NaN
123
+ 833 0.000161 20699 NaN NaN NaN
124
+ 834 NaN 20699 4.065756 39.0 NaN
125
+ 835 NaN 20719 NaN 39.0 4.348697
 
 
126
 
127
  ```
128
 
129
  # Results
130
 
131
+ ## Loss Curve
132
+ ![train_logs_1.png](images/accuracy_curve.png)