Cyrile commited on
Commit
043ba77
1 Parent(s): bc2328d

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -0
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: fr
3
+ license: mit
4
+ datasets:
5
+ - oscar
6
+ ---
7
+
8
+ DistilCamemBERT
9
+ ===============
10
+
11
+ We present a distillation version of the well named [CamemBERT](https://huggingface.co/camembert-base), a RoBERTa French model version, alias DistilCamemBERT. The aim of distillation is to drastically reduce the complexity of the model while preserving the performances. The proof of concept is shown in the DistilBERT paper and the code used for the training is inspired by the code of [DistilBERT](https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation).
12
+
13
+ Loss function
14
+ -------------
15
+
16
+ The training for the distilled model (student model) is designed to be the closest as possible to the original model (teacher model). To perform this the loss function is composed of 3 parts:
17
+ * DistilLoss: a distillation loss which measures the closely probability between the student and teacher outputs with a cross-entropy loss on the MLM task ;
18
+ * MLMLoss: a Masked Language Modeling (MLM) task loss to perform the student model with the original task of teacher model ;
19
+ * CosineLoss: and finally a cosine embedding loss. This loss function is applied on the last hidden layers of student and teacher models to guarantee a collinearity between us.
20
+
21
+ The final loss function is a combination of these three loss functions. We use the following ponderation:
22
+
23
+ Loss = 0.5 DistilLoss + 0.2 MLMLoss + 0.3 CosineLoss
24
+
25
+ Dataset
26
+ -------
27
+
28
+ To limit the bias between the student and teacher models, the dataset used for the DstilCamemBERT training is the same as the camembert-base training one: OSCAR. The french part of this dataset approximately represents 140 GB on a hard drive disk.
29
+
30
+ Training
31
+ --------
32
+
33
+ We pre-trained the model on a nVidia Titan RTX during 18 days.
34
+
35
+ Evaluation results
36
+ ------------------
37
+
38
+ | Dataset name | f1-score |
39
+ | :----------: | :------: |
40
+ | [FLUE](https://huggingface.co/datasets/flue) CLS | 83% |
41
+ | [FLUE](https://huggingface.co/datasets/flue) PAWS-X | 77% |
42
+ | [FLUE](https://huggingface.co/datasets/flue) XNLI | 64% |
43
+ | [wikiner_fr](https://huggingface.co/datasets/Jean-Baptiste/wikiner_fr) NER | 92% |
44
+
45
+ How to use DistilCamemBERT
46
+ --------------------------
47
+
48
+ Load CamemBERT and its sub-word tokenizer :
49
+ ```python
50
+ from transformers import CamembertModel, CamembertTokenizer
51
+
52
+ tokeinzer = CamembertTokenizer.from_pretrained(“Arkea/distilcamembert-base”)
53
+ model = CamembertModel.from_pretrained(“Arkea/distilcamembert-base”)
54
+ model.eval()
55
+ ```
56
+
57
+ Filling masks using pipeline :
58
+ ```python
59
+ model_fill_mask = pipeline(“fill-mask”, model=”Arkea/distilcamembert-base”, tokenizer=”Arkea/distilcamembert-base”)
60
+ result = model_fill_mask(“Le camembert est <mask>  :)”)
61
+ #results
62
+ #[{'sequence': '<s> Le camembert est délicieux :)</s>', 'score': 0.3878222405910492, 'token': 7200},
63
+ # {'sequence': '<s> Le camembert est excellent :)</s>', 'score': 0.06469205021858215, 'token': 2183},
64
+ # {'sequence': '<s> Le camembert est parfait :)</s>', 'score': 0.04534877464175224, 'token': 1654},
65
+ # {'sequence': '<s> Le camembert est succulent :)</s>', 'score': 0.04128391295671463, 'token': 26202},
66
+ # {'sequence': '<s> Le camembert est magnifique :)</s>', 'score': 0.02425697259604931, 'token': 1509}]
67
+ ```