|
--- |
|
language: fr |
|
datasets: |
|
- Jean-Baptiste/wikiner_fr |
|
widget: |
|
- text: "Je m'appelle jean-baptiste et j'habite à montréal depuis fevr 2012" |
|
license: mit |
|
--- |
|
|
|
# camembert-ner: model fine-tuned from camemBERT for NER task (including DATE tag). |
|
|
|
## Introduction |
|
|
|
[camembert-ner-with-dates] is an extension of french camembert-ner model with an additionnal tag for dates. |
|
Model was trained on enriched version of wikiner-fr dataset (~170 634 sentences). |
|
|
|
On my test data (mix of chat and email), this model got an f1 score of ~83% (in comparison dateparser was ~70%). |
|
Dateparser library can still be be used on the output of this model in order to convert text to python datetime object |
|
(https://dateparser.readthedocs.io/en/latest/). |
|
|
|
|
|
## How to use camembert-ner-with-dates with HuggingFace |
|
|
|
##### Load camembert-ner-with-dates and its sub-word tokenizer : |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/camembert-ner-with-dates") |
|
model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/camembert-ner-with-dates") |
|
|
|
|
|
##### Process text sample (from wikipedia) |
|
|
|
from transformers import pipeline |
|
|
|
nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple") |
|
nlp("Apple est créée le 1er avril 1976 dans le garage de la maison d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak et Ronald Wayne14, puis constituée sous forme de société le 3 janvier 1977 à l'origine sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification de ses produits, le mot « computer » est retiré le 9 janvier 2015.") |
|
|
|
|
|
[{'entity_group': 'ORG', |
|
'score': 0.9776379466056824, |
|
'word': 'Apple', |
|
'start': 0, |
|
'end': 5}, |
|
{'entity_group': 'DATE', |
|
'score': 0.9793774570737567, |
|
'word': 'le 1er avril 1976 dans le', |
|
'start': 15, |
|
'end': 41}, |
|
{'entity_group': 'PER', |
|
'score': 0.9958226680755615, |
|
'word': 'Steve Jobs', |
|
'start': 74, |
|
'end': 85}, |
|
{'entity_group': 'LOC', |
|
'score': 0.995087186495463, |
|
'word': 'Los Altos', |
|
'start': 87, |
|
'end': 97}, |
|
{'entity_group': 'LOC', |
|
'score': 0.9953305125236511, |
|
'word': 'Californie', |
|
'start': 100, |
|
'end': 111}, |
|
{'entity_group': 'PER', |
|
'score': 0.9961076378822327, |
|
'word': 'Steve Jobs', |
|
'start': 115, |
|
'end': 126}, |
|
{'entity_group': 'PER', |
|
'score': 0.9960325956344604, |
|
'word': 'Steve Wozniak', |
|
'start': 127, |
|
'end': 141}, |
|
{'entity_group': 'PER', |
|
'score': 0.9957776467005411, |
|
'word': 'Ronald Wayne', |
|
'start': 144, |
|
'end': 157}, |
|
{'entity_group': 'DATE', |
|
'score': 0.994030773639679, |
|
'word': 'le 3 janvier 1977 à', |
|
'start': 198, |
|
'end': 218}, |
|
{'entity_group': 'ORG', |
|
'score': 0.9720810294151306, |
|
'word': "d'Apple Computer", |
|
'start': 240, |
|
'end': 257}, |
|
{'entity_group': 'DATE', |
|
'score': 0.9924157659212748, |
|
'word': '30 ans et', |
|
'start': 272, |
|
'end': 282}, |
|
{'entity_group': 'DATE', |
|
'score': 0.9934852868318558, |
|
'word': 'le 9 janvier 2015.', |
|
'start': 363, |
|
'end': 382}] |
|
|
|
``` |
|
|
|
|
|
## Model performances (metric: seqeval) |
|
|
|
Global |
|
``` |
|
'precision': 0.928 |
|
'recall': 0.928 |
|
'f1': 0.928 |
|
``` |
|
|
|
By entity |
|
``` |
|
Label LOC: (precision:0.929, recall:0.932, f1:0.931, support:9510) |
|
Label PER: (precision:0.952, recall:0.965, f1:0.959, support:9399) |
|
Label MISC: (precision:0.878, recall:0.844, f1:0.860, support:5364) |
|
Label ORG: (precision:0.848, recall:0.883, f1:0.865, support:2299) |
|
Label DATE: Not relevant because of method used to add date tag on wikiner dataset (estimated f1 ~90%) |
|
|
|
|
|
``` |
|
|
|
|