Update README.md
Browse files
README.md
CHANGED
@@ -68,7 +68,7 @@ Consider the following chat interaction:
|
|
68 |
|
69 |
The model must predict the bolded parts. So, we randomly mask tokens from the bolded parts, and run the model once on the masked sequence and once on the full sequence.
|
70 |
|
71 |
-
We then compute a distance loss `D(p_masked, p_full)` between the two predictions. This approach resembles self-distillation, and MSE tends to perform better than KL Divergence for distillation, along with being easier to tune, so I went with MSE.
|
72 |
|
73 |
Finally, we add this loss to the standard cross-entropy language modeling losses from each prediction, with a weighting value:
|
74 |
```
|
|
|
68 |
|
69 |
The model must predict the bolded parts. So, we randomly mask tokens from the bolded parts, and run the model once on the masked sequence and once on the full sequence.
|
70 |
|
71 |
+
We then compute a distance loss `D(p_masked, p_full)` between the two predictions. This approach resembles self-distillation, and MSE tends to perform better than KL Divergence for distillation, along with being easier to tune, so I went with MSE (note that R-TeaFor uses a mix of reverse and forward KL divergence).
|
72 |
|
73 |
Finally, we add this loss to the standard cross-entropy language modeling losses from each prediction, with a weighting value:
|
74 |
```
|