Cyrile commited on
Commit
f140f84
1 Parent(s): 3141ee4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -17,12 +17,12 @@ Loss function
17
 
18
  The training for the distilled model (student model) is designed to be the closest as possible to the original model (teacher model). To perform this the loss function is composed of 3 parts:
19
  * DistilLoss: a distillation loss which measures the silimarity between the probabilities at the outputs of the student and teacher models with a cross-entropy loss on the MLM task ;
20
- * MLMLoss: a Masked Language Modeling (MLM) task loss to perform the student model with the original task of the teacher model ;
21
- * CosineLoss: and finally a cosine embedding loss. This loss function is applied on the last hidden layers of student and teacher models to guarantee a collinearity between them.
22
 
23
  The final loss function is a combination of these three losses functions. We use the following ponderation:
24
 
25
- $$Loss = 0.5 \times DistilLoss + 0.2 \times MLMLoss + 0.3 \times CosineLoss$$
26
 
27
  Dataset
28
  -------
@@ -41,8 +41,8 @@ Evaluation results
41
  | :----------: | :------: |
42
  | [FLUE](https://huggingface.co/datasets/flue) CLS | 83% |
43
  | [FLUE](https://huggingface.co/datasets/flue) PAWS-X | 77% |
44
- | [FLUE](https://huggingface.co/datasets/flue) XNLI | 68% |
45
- | [wikiner_fr](https://huggingface.co/datasets/Jean-Baptiste/wikiner_fr) NER | 92% |
46
 
47
  How to use DistilCamemBERT
48
  --------------------------
 
17
 
18
  The training for the distilled model (student model) is designed to be the closest as possible to the original model (teacher model). To perform this the loss function is composed of 3 parts:
19
  * DistilLoss: a distillation loss which measures the silimarity between the probabilities at the outputs of the student and teacher models with a cross-entropy loss on the MLM task ;
20
+ * CosineLoss: a cosine embedding loss. This loss function is applied on the last hidden layers of student and teacher models to guarantee a collinearity between them ;
21
+ * MLMLoss: and finaly a Masked Language Modeling (MLM) task loss to perform the student model with the original task of the teacher model.
22
 
23
  The final loss function is a combination of these three losses functions. We use the following ponderation:
24
 
25
+ $$Loss = 0.5 \times DistilLoss + 0.3 \times CosineLoss$$ + 0.2 \times MLMLoss
26
 
27
  Dataset
28
  -------
 
41
  | :----------: | :------: |
42
  | [FLUE](https://huggingface.co/datasets/flue) CLS | 83% |
43
  | [FLUE](https://huggingface.co/datasets/flue) PAWS-X | 77% |
44
+ | [FLUE](https://huggingface.co/datasets/flue) XNLI | 77% |
45
+ | [wikiner_fr](https://huggingface.co/datasets/Jean-Baptiste/wikiner_fr) NER | 98% |
46
 
47
  How to use DistilCamemBERT
48
  --------------------------