Multiword expressions recognition.
A multiword expression (MWE) is a combination of words which exhibits lexical, morphosyntactic, semantic, pragmatic and/or statistical idiosyncrasies (Baldwin and Kim, 2010). The objective of Multiword Expression Recognition (MWER) is to automate the identification of these MWEs.
Model description
camembert-mwer
is a model that was fine-tuned from CamemBERT as a token classification task specifically on the Sequoia dataset for the MWER task.
How to use
You can use this model directly with a pipeline for token classification:
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
>>> tokenizer = AutoTokenizer.from_pretrained("bvantuan/camembert-mwer")
>>> model = AutoModelForTokenClassification.from_pretrained("bvantuan/camembert-mwer")
>>> mwe_classifier = pipeline('token-classification', model=model, tokenizer=tokenizer)
>>> sentence = "Pour ce premier rendez-vous, l'animateur a pu faire partager sa passion et présenter quelques oeuvres pour mettre en bouche les participants."
>>> mwes = mwe_classifier(sentence)
[{'entity': 'B-MWE',
'score': 0.99492574,
'index': 4,
'word': '▁rendez',
'start': 15,
'end': 22},
{'entity': 'I-MWE',
'score': 0.9344883,
'index': 5,
'word': '-',
'start': 22,
'end': 23},
{'entity': 'I-MWE',
'score': 0.99398583,
'index': 6,
'word': 'vous',
'start': 23,
'end': 27},
{'entity': 'B-VID',
'score': 0.9827843,
'index': 22,
'word': '▁mettre',
'start': 106,
'end': 113},
{'entity': 'I-VID',
'score': 0.9835186,
'index': 23,
'word': '▁en',
'start': 113,
'end': 116},
{'entity': 'I-VID',
'score': 0.98324823,
'index': 24,
'word': '▁bouche',
'start': 116,
'end': 123}]
>>> mwe_classifier.group_entities(mwes)
[{'entity_group': 'MWE',
'score': 0.9744666,
'word': 'rendez-vous',
'start': 15,
'end': 27},
{'entity_group': 'VID',
'score': 0.9831837,
'word': 'mettre en bouche',
'start': 106,
'end': 123}]
Training data
The Sequoia dataset is divided into train/dev/test sets:
Sequoia | train | dev | test | |
---|---|---|---|---|
#sentences | 3099 | 1955 | 273 | 871 |
#MWEs | 3450 | 2170 | 306 | 974 |
#Unseen MWEs | _ | _ | 100 | 300 |
This dataset has 6 distinct categories:
- MWE: Non-verbal MWEs (e.g. à peu près)
- IRV: Inherently reflexive verb (e.g. s'occuper)
- LVC.cause: Causative light-verb construction (e.g. causer le bouleversement)
- LVC.full: Light-verb construction (e.g. avoir pour but de )
- MVC: Multi-verb construction (e.g. faire remarquer)
- VID: Verbal idiom (e.g. voir le jour)
Training procedure
Preprocessing
The employed sequential labeling scheme for this task is the Inside–outside–beginning (IOB2) methodology.
Pretraining
The model was trained on train+dev sets with learning rate $3 × 10^{-5}$, batch size 10 and over the course of 15 epochs.
Evaluation results
On the test set, this model achieves the following results:
Global MWE-based | Unseen MWE-based | ||||
Precision | Recall | F1 | Precision | Recall | F1 |
83.78 | 83.78 | 83.78 | 57.05 | 60.67 | 58.80 |
BibTeX entry and citation info
@article{martin2019camembert,
title={CamemBERT: a tasty French language model},
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de La Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
journal={arXiv preprint arXiv:1911.03894},
year={2019}
}
@article{candito2020french,
title={A French corpus annotated for multiword expressions and named entities},
author={Candito, Marie and Constant, Mathieu and Ramisch, Carlos and Savary, Agata and Guillaume, Bruno and Parmentier, Yannick and Cordeiro, Silvio Ricardo},
journal={Journal of Language Modelling},
volume={8},
number={2},
year={2020},
publisher={Polska Akademia Nauk. Instytut Podstaw Informatyki PAN}
}
- Downloads last month
- 18