Update README.md
Browse files
README.md
CHANGED
@@ -43,8 +43,7 @@ Figure 2: Training loss closeup. We mark two hotswap places, where the training
|
|
43 |
In Figure 2, we perform two ablations:
|
44 |
|
45 |
- (a) After first hot swap, we continued training on the corpus #1 for a while. Result: The fact that test loss is slightly better, signifies the slight difference between distribution of corpus #1 and corpus #2.
|
46 |
-
- (b) On step 94,000, the training loss stopped decreasing, increased, and around step 120,000 (near hot swap #2) started decreasing again. To ablate whether this was an effect of hot-swap,
|
47 |
-
- we resume training from step 93,000 using corpus #3. The optimizer states were reinitialized. Result: Neither corpus #3, nor optimizier state reinitialization seems to mitigate the issue of local divergence at step 94,000.
|
48 |
|
49 |
-
|
50 |
<img src="figures/vloss_closeup.png" width="900"/>
|
|
|
43 |
In Figure 2, we perform two ablations:
|
44 |
|
45 |
- (a) After first hot swap, we continued training on the corpus #1 for a while. Result: The fact that test loss is slightly better, signifies the slight difference between distribution of corpus #1 and corpus #2.
|
46 |
+
- (b) On step 94,000, the training loss stopped decreasing, increased, and around step 120,000 (near hot swap #2) started decreasing again. To ablate whether this was an effect of hot-swap, we resume training from step 93,000 using corpus #3.The optimizer states were reinitialized. Result: Neither corpus #3, nor optimizier state reinitialization seems to mitigate the issue of local divergence at step 94,000.
|
|
|
47 |
|
48 |
-
|
49 |
<img src="figures/vloss_closeup.png" width="900"/>
|