Update README.md
Browse files
README.md
CHANGED
@@ -21,6 +21,7 @@ The training for the distilled model (student model) is designed to be the close
|
|
21 |
* CosineLoss: and finally a cosine embedding loss. This loss function is applied on the last hidden layers of student and teacher models to guarantee a collinearity between them.
|
22 |
|
23 |
The final loss function is a combination of these three loss functions. We use the following ponderation:
|
|
|
24 |
*Loss = 0.5 DistilLoss + 0.2 MLMLoss + 0.3 CosineLoss*
|
25 |
|
26 |
Dataset
|
|
|
21 |
* CosineLoss: and finally a cosine embedding loss. This loss function is applied on the last hidden layers of student and teacher models to guarantee a collinearity between them.
|
22 |
|
23 |
The final loss function is a combination of these three loss functions. We use the following ponderation:
|
24 |
+
|
25 |
*Loss = 0.5 DistilLoss + 0.2 MLMLoss + 0.3 CosineLoss*
|
26 |
|
27 |
Dataset
|