Update README.md
Browse files
README.md
CHANGED
@@ -1,47 +1,149 @@
|
|
1 |
---
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
-
|
6 |
-
|
|
|
|
|
7 |
---
|
8 |
|
9 |
-
|
10 |
-
probably proofread and complete it, then remove this comment. -->
|
11 |
-
|
12 |
-
# camembert-mwer
|
13 |
-
|
14 |
-
This model was trained from scratch on an unknown dataset.
|
15 |
-
It achieves the following results on the evaluation set:
|
16 |
|
|
|
17 |
|
18 |
## Model description
|
19 |
|
20 |
-
|
21 |
-
|
22 |
-
##
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
|
30 |
## Training procedure
|
31 |
|
32 |
-
###
|
33 |
-
|
34 |
-
The
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language: fr
|
3 |
+
license: mit
|
4 |
+
datasets:
|
5 |
+
- Sequoia
|
6 |
+
widget:
|
7 |
+
- text: Aucun financement politique occulte n'a pu être mis en évidence.
|
8 |
+
- text: L'excrétion de l'acide zolédronique dans le lait maternel n'est pas connue.
|
9 |
---
|
10 |
|
11 |
+
# Multiword expressions recognition.
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
|
13 |
+
A multiword expression (MWE) is a combination of words which exhibits lexical, morphosyntactic, semantic, pragmatic and/or statistical idiosyncrasies (Baldwin and Kim, 2010). The objective of Multiword Expression Recognition (MWER) is to automate the identification of these MWEs.
|
14 |
|
15 |
## Model description
|
16 |
|
17 |
+
`camembert-mwer` is a model that was fine-tuned from [CamemBERT](https://huggingface.co/camembert-base) as a token classification task specifically on the [Sequoia](http://deep-sequoia.inria.fr/) dataset for the MWER task.
|
18 |
+
|
19 |
+
## How to use
|
20 |
+
|
21 |
+
You can use this model directly with a pipeline for token classification:
|
22 |
+
|
23 |
+
```python
|
24 |
+
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
|
25 |
+
>>> tokenizer = AutoTokenizer.from_pretrained("bvantuan/camembert-mwer")
|
26 |
+
>>> model = AutoModelForTokenClassification.from_pretrained("bvantuan/camembert-mwer")
|
27 |
+
>>> mwe_classifier = pipeline('token-classification', model=model, tokenizer=tokenizer)
|
28 |
+
>>> sentence = "Pour ce premier rendez-vous, l'animateur a pu faire partager sa passion et présenter quelques oeuvres pour mettre en bouche les participants."
|
29 |
+
>>> mwes = mwe_classifier(sentence)
|
30 |
+
|
31 |
+
[{'entity': 'B-MWE',
|
32 |
+
'score': 0.99492574,
|
33 |
+
'index': 4,
|
34 |
+
'word': '▁rendez',
|
35 |
+
'start': 15,
|
36 |
+
'end': 22},
|
37 |
+
{'entity': 'I-MWE',
|
38 |
+
'score': 0.9344883,
|
39 |
+
'index': 5,
|
40 |
+
'word': '-',
|
41 |
+
'start': 22,
|
42 |
+
'end': 23},
|
43 |
+
{'entity': 'I-MWE',
|
44 |
+
'score': 0.99398583,
|
45 |
+
'index': 6,
|
46 |
+
'word': 'vous',
|
47 |
+
'start': 23,
|
48 |
+
'end': 27},
|
49 |
+
{'entity': 'B-VID',
|
50 |
+
'score': 0.9827843,
|
51 |
+
'index': 22,
|
52 |
+
'word': '▁mettre',
|
53 |
+
'start': 106,
|
54 |
+
'end': 113},
|
55 |
+
{'entity': 'I-VID',
|
56 |
+
'score': 0.9835186,
|
57 |
+
'index': 23,
|
58 |
+
'word': '▁en',
|
59 |
+
'start': 113,
|
60 |
+
'end': 116},
|
61 |
+
{'entity': 'I-VID',
|
62 |
+
'score': 0.98324823,
|
63 |
+
'index': 24,
|
64 |
+
'word': '▁bouche',
|
65 |
+
'start': 116,
|
66 |
+
'end': 123}]
|
67 |
+
|
68 |
+
>>> mwe_classifier.group_entities(mwes)
|
69 |
+
|
70 |
+
[{'entity_group': 'MWE',
|
71 |
+
'score': 0.9744666,
|
72 |
+
'word': 'rendez-vous',
|
73 |
+
'start': 15,
|
74 |
+
'end': 27},
|
75 |
+
{'entity_group': 'VID',
|
76 |
+
'score': 0.9831837,
|
77 |
+
'word': 'mettre en bouche',
|
78 |
+
'start': 106,
|
79 |
+
'end': 123}]
|
80 |
+
```
|
81 |
+
|
82 |
+
## Training data
|
83 |
+
|
84 |
+
The Sequoia dataset is divided into train/dev/test sets:
|
85 |
+
|
86 |
+
| | Sequoia | train | dev | test |
|
87 |
+
| :----: | :---: | :----: | :---: | :----: |
|
88 |
+
| #sentences | 3099 | 1955 | 273 | 871 |
|
89 |
+
| #MWEs | 3450 | 2170 | 306 | 974 |
|
90 |
+
| #Unseen MWEs | _ | _ | 100 | 300 |
|
91 |
+
|
92 |
+
This dataset has 6 distinct categories:
|
93 |
+
* MWE: Non-verbal MWEs (e.g. **à peu près**)
|
94 |
+
* IRV: Inherently reflexive verb (e.g. **s'occuper**)
|
95 |
+
* LVC.cause: Causative light-verb construction (e.g. **causer** le **bouleversement**)
|
96 |
+
* LVC.full: Light-verb construction (e.g. **avoir pour but** de )
|
97 |
+
* MVC: Multi-verb construction (e.g. **faire remarquer**)
|
98 |
+
* VID: Verbal idiom (e.g. **voir le jour**)
|
99 |
|
100 |
## Training procedure
|
101 |
|
102 |
+
### Preprocessing
|
103 |
+
|
104 |
+
The employed sequential labeling scheme for this task is the Inside–outside–beginning (IOB2) methodology.
|
105 |
+
|
106 |
+
### Pretraining
|
107 |
+
|
108 |
+
The model was trained on train+dev sets with learning rate $3 × 10^{-5}$, batch size 10 and over the course of 15 epochs.
|
109 |
+
|
110 |
+
### Evaluation results
|
111 |
+
|
112 |
+
On the test set, this model achieves the following results:
|
113 |
+
|
114 |
+
<table>
|
115 |
+
<tr>
|
116 |
+
<td colspan="3">Global MWE-based</td>
|
117 |
+
<td colspan="3">Unseen MWE-based</td>
|
118 |
+
</tr>
|
119 |
+
<tr>
|
120 |
+
<td>Precision</td><td>Recall</td><td>F1</td>
|
121 |
+
<td>Precision</td><td>Recall</td><td>F1</td>
|
122 |
+
</tr>
|
123 |
+
<tr>
|
124 |
+
<td>83.78</td><td>83.78</td><td>83.78</td>
|
125 |
+
<td>57.05</td><td>60.67</td><td>58.80</td>
|
126 |
+
</tr>
|
127 |
+
</table>
|
128 |
+
|
129 |
+
### BibTeX entry and citation info
|
130 |
+
|
131 |
+
```bibtex
|
132 |
+
@article{martin2019camembert,
|
133 |
+
title={CamemBERT: a tasty French language model},
|
134 |
+
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de La Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
|
135 |
+
journal={arXiv preprint arXiv:1911.03894},
|
136 |
+
year={2019}
|
137 |
+
}
|
138 |
+
|
139 |
+
@article{candito2020french,
|
140 |
+
title={A French corpus annotated for multiword expressions and named entities},
|
141 |
+
author={Candito, Marie and Constant, Mathieu and Ramisch, Carlos and Savary, Agata and Guillaume, Bruno and Parmentier, Yannick and Cordeiro, Silvio Ricardo},
|
142 |
+
journal={Journal of Language Modelling},
|
143 |
+
volume={8},
|
144 |
+
number={2},
|
145 |
+
year={2020},
|
146 |
+
publisher={Polska Akademia Nauk. Instytut Podstaw Informatyki PAN}
|
147 |
+
}
|
148 |
+
|
149 |
+
```
|