Edit model card

Multiword expressions recognition.

A multiword expression (MWE) is a combination of words which exhibits lexical, morphosyntactic, semantic, pragmatic and/or statistical idiosyncrasies (Baldwin and Kim, 2010). The objective of Multiword Expression Recognition (MWER) is to automate the identification of these MWEs.

Model description

camembert-mwer is a model that was fine-tuned from CamemBERT as a token classification task specifically on the Sequoia dataset for the MWER task.

How to use

You can use this model directly with a pipeline for token classification:

>>> from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
>>> tokenizer = AutoTokenizer.from_pretrained("bvantuan/camembert-mwer")
>>> model = AutoModelForTokenClassification.from_pretrained("bvantuan/camembert-mwer")
>>> mwe_classifier = pipeline('token-classification', model=model, tokenizer=tokenizer)
>>> sentence = "Pour ce premier rendez-vous, l'animateur a pu faire partager sa passion et présenter quelques oeuvres pour mettre en bouche les participants."
>>> mwes = mwe_classifier(sentence)

[{'entity': 'B-MWE',
  'score': 0.99492574,
  'index': 4,
  'word': '▁rendez',
  'start': 15,
  'end': 22},
 {'entity': 'I-MWE',
  'score': 0.9344883,
  'index': 5,
  'word': '-',
  'start': 22,
  'end': 23},
 {'entity': 'I-MWE',
  'score': 0.99398583,
  'index': 6,
  'word': 'vous',
  'start': 23,
  'end': 27},
 {'entity': 'B-VID',
  'score': 0.9827843,
  'index': 22,
  'word': '▁mettre',
  'start': 106,
  'end': 113},
 {'entity': 'I-VID',
  'score': 0.9835186,
  'index': 23,
  'word': '▁en',
  'start': 113,
  'end': 116},
 {'entity': 'I-VID',
  'score': 0.98324823,
  'index': 24,
  'word': '▁bouche',
  'start': 116,
  'end': 123}]

>>> mwe_classifier.group_entities(mwes)

[{'entity_group': 'MWE',
  'score': 0.9744666,
  'word': 'rendez-vous',
  'start': 15,
  'end': 27},
 {'entity_group': 'VID',
  'score': 0.9831837,
  'word': 'mettre en bouche',
  'start': 106,
  'end': 123}]

Training data

The Sequoia dataset is divided into train/dev/test sets:

Sequoia train dev test
#sentences 3099 1955 273 871
#MWEs 3450 2170 306 974
#Unseen MWEs _ _ 100 300

This dataset has 6 distinct categories:

  • MWE: Non-verbal MWEs (e.g. à peu près)
  • IRV: Inherently reflexive verb (e.g. s'occuper)
  • LVC.cause: Causative light-verb construction (e.g. causer le bouleversement)
  • LVC.full: Light-verb construction (e.g. avoir pour but de )
  • MVC: Multi-verb construction (e.g. faire remarquer)
  • VID: Verbal idiom (e.g. voir le jour)

Training procedure

Preprocessing

The employed sequential labeling scheme for this task is the Inside–outside–beginning (IOB2) methodology.

Pretraining

The model was trained on train+dev sets with learning rate $3 × 10^{-5}$, batch size 10 and over the course of 15 epochs.

Evaluation results

On the test set, this model achieves the following results:

Global MWE-based Unseen MWE-based
PrecisionRecallF1 PrecisionRecallF1
83.7883.7883.78 57.0560.6758.80

BibTeX entry and citation info

@article{martin2019camembert,
  title={CamemBERT: a tasty French language model},
  author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de La Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
  journal={arXiv preprint arXiv:1911.03894},
  year={2019}
}

@article{candito2020french,
  title={A French corpus annotated for multiword expressions and named entities},
  author={Candito, Marie and Constant, Mathieu and Ramisch, Carlos and Savary, Agata and Guillaume, Bruno and Parmentier, Yannick and Cordeiro, Silvio Ricardo},
  journal={Journal of Language Modelling},
  volume={8},
  number={2},
  year={2020},
  publisher={Polska Akademia Nauk. Instytut Podstaw Informatyki PAN}
}
Downloads last month
25
Safetensors
Model size
336M params
Tensor type
I64
·
F32
·
Inference API
This model can be loaded on Inference API (serverless).