euclaise
/

ReMask-3B

Text Generation

Inference Endpoints

Model card Files Files and versions Community

euclaise commited on Mar 28, 2024

Commit

febd59e

·

verified ·

1 Parent(s): c404867

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -68,7 +68,7 @@ Consider the following chat interaction:
 The model must predict the bolded parts.  So, we randomly mask tokens from the bolded parts, and run the model once on the masked sequence and once on the full sequence.
-We then compute a distance loss `D(p_masked, p_full)` between the two predictions. This approach resembles self-distillation, and MSE tends to perform better than KL Divergence for distillation, along with being easier to tune, so I went with MSE.
 Finally, we add this loss to the standard cross-entropy language modeling losses from each prediction, with a weighting value:
 ```

 The model must predict the bolded parts.  So, we randomly mask tokens from the bolded parts, and run the model once on the masked sequence and once on the full sequence.
+We then compute a distance loss `D(p_masked, p_full)` between the two predictions. This approach resembles self-distillation, and MSE tends to perform better than KL Divergence for distillation, along with being easier to tune, so I went with MSE (note that R-TeaFor uses a mix of reverse and forward KL divergence).
 Finally, we add this loss to the standard cross-entropy language modeling losses from each prediction, with a weighting value:
 ```