Text Generation
Transformers
Safetensors
English
stablelm
conversational
Inference Endpoints
euclaise commited on
Commit
febd59e
·
verified ·
1 Parent(s): c404867

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -68,7 +68,7 @@ Consider the following chat interaction:
68
 
69
  The model must predict the bolded parts. So, we randomly mask tokens from the bolded parts, and run the model once on the masked sequence and once on the full sequence.
70
 
71
- We then compute a distance loss `D(p_masked, p_full)` between the two predictions. This approach resembles self-distillation, and MSE tends to perform better than KL Divergence for distillation, along with being easier to tune, so I went with MSE.
72
 
73
  Finally, we add this loss to the standard cross-entropy language modeling losses from each prediction, with a weighting value:
74
  ```
 
68
 
69
  The model must predict the bolded parts. So, we randomly mask tokens from the bolded parts, and run the model once on the masked sequence and once on the full sequence.
70
 
71
+ We then compute a distance loss `D(p_masked, p_full)` between the two predictions. This approach resembles self-distillation, and MSE tends to perform better than KL Divergence for distillation, along with being easier to tune, so I went with MSE (note that R-TeaFor uses a mix of reverse and forward KL divergence).
72
 
73
  Finally, we add this loss to the standard cross-entropy language modeling losses from each prediction, with a weighting value:
74
  ```