bvantuan
/

camembert-mwer

@@ -1,47 +1,149 @@
 ---
-tags:
-- generated_from_keras_callback
-model-index:
-- name: camembert-mwer
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information Keras had access to. You should
-probably proofread and complete it, then remove this comment. -->
-# camembert-mwer
-This model was trained from scratch on an unknown dataset.
-It achieves the following results on the evaluation set:
 ## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
 ## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- optimizer: None
-- training_precision: float32
-### Training results
-### Framework versions
-- Transformers 4.30.2
-- TensorFlow 2.12.0
-- Datasets 2.13.0
-- Tokenizers 0.13.2

 ---
+language: fr
+license: mit
+datasets:
+- Sequoia
+widget:
+  - text: Aucun financement politique occulte n'a pu être mis en évidence.
+  - text: L'excrétion de l'acide zolédronique dans le lait maternel n'est pas connue.
 ---
+# Multiword expressions recognition.
+A multiword expression (MWE) is a combination of words which exhibits lexical, morphosyntactic, semantic, pragmatic and/or statistical idiosyncrasies (Baldwin and Kim, 2010). The objective of Multiword Expression Recognition (MWER) is to automate the identification of these MWEs.
 ## Model description
+`camembert-mwer` is a model that was fine-tuned from [CamemBERT](https://huggingface.co/camembert-base) as a token classification task specifically on the [Sequoia](http://deep-sequoia.inria.fr/) dataset for the MWER task.
+## How to use
+You can use this model directly with a pipeline for token classification:
+```python
+>>> from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
+>>> tokenizer = AutoTokenizer.from_pretrained("bvantuan/camembert-mwer")
+>>> model = AutoModelForTokenClassification.from_pretrained("bvantuan/camembert-mwer")
+>>> mwe_classifier = pipeline('token-classification', model=model, tokenizer=tokenizer)
+>>> sentence = "Pour ce premier rendez-vous, l'animateur a pu faire partager sa passion et présenter quelques oeuvres pour mettre en bouche les participants."
+>>> mwes = mwe_classifier(sentence)
+[{'entity': 'B-MWE',
+  'score': 0.99492574,
+  'index': 4,
+  'word': '▁rendez',
+  'start': 15,
+  'end': 22},
+ {'entity': 'I-MWE',
+  'score': 0.9344883,
+  'index': 5,
+  'word': '-',
+  'start': 22,
+  'end': 23},
+ {'entity': 'I-MWE',
+  'score': 0.99398583,
+  'index': 6,
+  'word': 'vous',
+  'start': 23,
+  'end': 27},
+ {'entity': 'B-VID',
+  'score': 0.9827843,
+  'index': 22,
+  'word': '▁mettre',
+  'start': 106,
+  'end': 113},
+ {'entity': 'I-VID',
+  'score': 0.9835186,
+  'index': 23,
+  'word': '▁en',
+  'start': 113,
+  'end': 116},
+ {'entity': 'I-VID',
+  'score': 0.98324823,
+  'index': 24,
+  'word': '▁bouche',
+  'start': 116,
+  'end': 123}]
+>>> mwe_classifier.group_entities(mwes)
+[{'entity_group': 'MWE',
+  'score': 0.9744666,
+  'word': 'rendez-vous',
+  'start': 15,
+  'end': 27},
+ {'entity_group': 'VID',
+  'score': 0.9831837,
+  'word': 'mettre en bouche',
+  'start': 106,
+  'end': 123}]
+```
+## Training data
+The Sequoia dataset is divided into train/dev/test sets:
+|               | Sequoia     | train       | dev           | test        |
+| :----:        | :---:       |    :----:   |  :---:        | :----:      |
+| #sentences    | 3099        | 1955        | 273           |     871        |
+| #MWEs         | 3450        | 2170        | 306           |     974        |
+| #Unseen MWEs  | _           | _           | 100           |      300       |
+This dataset has 6 distinct categories:
+* MWE: Non-verbal MWEs (e.g. **à peu près**)
+* IRV: Inherently reflexive verb (e.g. **s'occuper**)
+* LVC.cause: Causative light-verb construction (e.g. **causer** le **bouleversement**)
+* LVC.full: Light-verb construction (e.g. **avoir pour but** de )
+* MVC: Multi-verb construction (e.g. **faire remarquer**)
+* VID: Verbal idiom (e.g. **voir le jour**)
 ## Training procedure
+### Preprocessing
+The employed sequential labeling scheme for this task is the Inside–outside–beginning (IOB2) methodology.
+### Pretraining
+The model was trained on train+dev sets with learning rate $3 × 10^{-5}$, batch size 10 and over the course of  15 epochs.
+### Evaluation results
+On the test set, this model achieves the following results:
+<table>
+  <tr>
+    <td colspan="3">Global MWE-based</td>
+    <td colspan="3">Unseen MWE-based</td>
+  </tr>
+  <tr>
+    <td>Precision</td><td>Recall</td><td>F1</td>
+    <td>Precision</td><td>Recall</td><td>F1</td>
+  </tr>
+  <tr>
+    <td>83.78</td><td>83.78</td><td>83.78</td>
+    <td>57.05</td><td>60.67</td><td>58.80</td>
+  </tr>
+</table>
+### BibTeX entry and citation info
+```bibtex
+@article{martin2019camembert,
+  title={CamemBERT: a tasty French language model},
+  author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de La Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
+  journal={arXiv preprint arXiv:1911.03894},
+  year={2019}
+}
+@article{candito2020french,
+  title={A French corpus annotated for multiword expressions and named entities},
+  author={Candito, Marie and Constant, Mathieu and Ramisch, Carlos and Savary, Agata and Guillaume, Bruno and Parmentier, Yannick and Cordeiro, Silvio Ricardo},
+  journal={Journal of Language Modelling},
+  volume={8},
+  number={2},
+  year={2020},
+  publisher={Polska Akademia Nauk. Instytut Podstaw Informatyki PAN}
+}
+```