bvantuan commited on
Commit
db2bd59
·
1 Parent(s): ecfecf0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +139 -37
README.md CHANGED
@@ -1,47 +1,149 @@
1
  ---
2
- tags:
3
- - generated_from_keras_callback
4
- model-index:
5
- - name: camembert-mwer
6
- results: []
 
 
7
  ---
8
 
9
- <!-- This model card has been generated automatically according to the information Keras had access to. You should
10
- probably proofread and complete it, then remove this comment. -->
11
-
12
- # camembert-mwer
13
-
14
- This model was trained from scratch on an unknown dataset.
15
- It achieves the following results on the evaluation set:
16
 
 
17
 
18
  ## Model description
19
 
20
- More information needed
21
-
22
- ## Intended uses & limitations
23
-
24
- More information needed
25
-
26
- ## Training and evaluation data
27
-
28
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
  ## Training procedure
31
 
32
- ### Training hyperparameters
33
-
34
- The following hyperparameters were used during training:
35
- - optimizer: None
36
- - training_precision: float32
37
-
38
- ### Training results
39
-
40
-
41
-
42
- ### Framework versions
43
-
44
- - Transformers 4.30.2
45
- - TensorFlow 2.12.0
46
- - Datasets 2.13.0
47
- - Tokenizers 0.13.2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: fr
3
+ license: mit
4
+ datasets:
5
+ - Sequoia
6
+ widget:
7
+ - text: Aucun financement politique occulte n'a pu être mis en évidence.
8
+ - text: L'excrétion de l'acide zolédronique dans le lait maternel n'est pas connue.
9
  ---
10
 
11
+ # Multiword expressions recognition.
 
 
 
 
 
 
12
 
13
+ A multiword expression (MWE) is a combination of words which exhibits lexical, morphosyntactic, semantic, pragmatic and/or statistical idiosyncrasies (Baldwin and Kim, 2010). The objective of Multiword Expression Recognition (MWER) is to automate the identification of these MWEs.
14
 
15
  ## Model description
16
 
17
+ `camembert-mwer` is a model that was fine-tuned from [CamemBERT](https://huggingface.co/camembert-base) as a token classification task specifically on the [Sequoia](http://deep-sequoia.inria.fr/) dataset for the MWER task.
18
+
19
+ ## How to use
20
+
21
+ You can use this model directly with a pipeline for token classification:
22
+
23
+ ```python
24
+ >>> from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
25
+ >>> tokenizer = AutoTokenizer.from_pretrained("bvantuan/camembert-mwer")
26
+ >>> model = AutoModelForTokenClassification.from_pretrained("bvantuan/camembert-mwer")
27
+ >>> mwe_classifier = pipeline('token-classification', model=model, tokenizer=tokenizer)
28
+ >>> sentence = "Pour ce premier rendez-vous, l'animateur a pu faire partager sa passion et présenter quelques oeuvres pour mettre en bouche les participants."
29
+ >>> mwes = mwe_classifier(sentence)
30
+
31
+ [{'entity': 'B-MWE',
32
+ 'score': 0.99492574,
33
+ 'index': 4,
34
+ 'word': '▁rendez',
35
+ 'start': 15,
36
+ 'end': 22},
37
+ {'entity': 'I-MWE',
38
+ 'score': 0.9344883,
39
+ 'index': 5,
40
+ 'word': '-',
41
+ 'start': 22,
42
+ 'end': 23},
43
+ {'entity': 'I-MWE',
44
+ 'score': 0.99398583,
45
+ 'index': 6,
46
+ 'word': 'vous',
47
+ 'start': 23,
48
+ 'end': 27},
49
+ {'entity': 'B-VID',
50
+ 'score': 0.9827843,
51
+ 'index': 22,
52
+ 'word': '▁mettre',
53
+ 'start': 106,
54
+ 'end': 113},
55
+ {'entity': 'I-VID',
56
+ 'score': 0.9835186,
57
+ 'index': 23,
58
+ 'word': '▁en',
59
+ 'start': 113,
60
+ 'end': 116},
61
+ {'entity': 'I-VID',
62
+ 'score': 0.98324823,
63
+ 'index': 24,
64
+ 'word': '▁bouche',
65
+ 'start': 116,
66
+ 'end': 123}]
67
+
68
+ >>> mwe_classifier.group_entities(mwes)
69
+
70
+ [{'entity_group': 'MWE',
71
+ 'score': 0.9744666,
72
+ 'word': 'rendez-vous',
73
+ 'start': 15,
74
+ 'end': 27},
75
+ {'entity_group': 'VID',
76
+ 'score': 0.9831837,
77
+ 'word': 'mettre en bouche',
78
+ 'start': 106,
79
+ 'end': 123}]
80
+ ```
81
+
82
+ ## Training data
83
+
84
+ The Sequoia dataset is divided into train/dev/test sets:
85
+
86
+ | | Sequoia | train | dev | test |
87
+ | :----: | :---: | :----: | :---: | :----: |
88
+ | #sentences | 3099 | 1955 | 273 | 871 |
89
+ | #MWEs | 3450 | 2170 | 306 | 974 |
90
+ | #Unseen MWEs | _ | _ | 100 | 300 |
91
+
92
+ This dataset has 6 distinct categories:
93
+ * MWE: Non-verbal MWEs (e.g. **à peu près**)
94
+ * IRV: Inherently reflexive verb (e.g. **s'occuper**)
95
+ * LVC.cause: Causative light-verb construction (e.g. **causer** le **bouleversement**)
96
+ * LVC.full: Light-verb construction (e.g. **avoir pour but** de )
97
+ * MVC: Multi-verb construction (e.g. **faire remarquer**)
98
+ * VID: Verbal idiom (e.g. **voir le jour**)
99
 
100
  ## Training procedure
101
 
102
+ ### Preprocessing
103
+
104
+ The employed sequential labeling scheme for this task is the Inside–outside–beginning (IOB2) methodology.
105
+
106
+ ### Pretraining
107
+
108
+ The model was trained on train+dev sets with learning rate $3 × 10^{-5}$, batch size 10 and over the course of 15 epochs.
109
+
110
+ ### Evaluation results
111
+
112
+ On the test set, this model achieves the following results:
113
+
114
+ <table>
115
+ <tr>
116
+ <td colspan="3">Global MWE-based</td>
117
+ <td colspan="3">Unseen MWE-based</td>
118
+ </tr>
119
+ <tr>
120
+ <td>Precision</td><td>Recall</td><td>F1</td>
121
+ <td>Precision</td><td>Recall</td><td>F1</td>
122
+ </tr>
123
+ <tr>
124
+ <td>83.78</td><td>83.78</td><td>83.78</td>
125
+ <td>57.05</td><td>60.67</td><td>58.80</td>
126
+ </tr>
127
+ </table>
128
+
129
+ ### BibTeX entry and citation info
130
+
131
+ ```bibtex
132
+ @article{martin2019camembert,
133
+ title={CamemBERT: a tasty French language model},
134
+ author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de La Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
135
+ journal={arXiv preprint arXiv:1911.03894},
136
+ year={2019}
137
+ }
138
+
139
+ @article{candito2020french,
140
+ title={A French corpus annotated for multiword expressions and named entities},
141
+ author={Candito, Marie and Constant, Mathieu and Ramisch, Carlos and Savary, Agata and Guillaume, Bruno and Parmentier, Yannick and Cordeiro, Silvio Ricardo},
142
+ journal={Journal of Language Modelling},
143
+ volume={8},
144
+ number={2},
145
+ year={2020},
146
+ publisher={Polska Akademia Nauk. Instytut Podstaw Informatyki PAN}
147
+ }
148
+
149
+ ```