|
--- |
|
language: fr |
|
datasets: |
|
- Jean-Baptiste/wikiner_fr |
|
widget: |
|
- text: "Je m'appelle jean-baptiste et je vis à montréal" |
|
- text: "george washington est allé à washington" |
|
license: mit |
|
--- |
|
|
|
# camembert-ner: model fine-tuned from camemBERT for NER task. |
|
|
|
## Introduction |
|
|
|
[camembert-ner] is a NER model that was fine-tuned from camemBERT on wikiner-fr dataset. |
|
Model was trained on wikiner-fr dataset (~170 634 sentences). |
|
Model was validated on emails/chat data and overperformed other models on this type of data specifically. |
|
In particular the model seems to work better on entity that don't start with an upper case. |
|
|
|
## Training data |
|
Training data was classified as follow: |
|
|
|
Abbreviation|Description |
|
-|- |
|
O |Outside of a named entity |
|
MISC |Miscellaneous entity |
|
PER |Person’s name |
|
ORG |Organization |
|
LOC |Location |
|
|
|
|
|
## How to use camembert-ner with HuggingFace |
|
|
|
##### Load camembert-ner and its sub-word tokenizer : |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/camembert-ner") |
|
model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/camembert-ner") |
|
|
|
|
|
##### Process text sample (from wikipedia) |
|
|
|
from transformers import pipeline |
|
|
|
nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple") |
|
nlp("Apple est créée le 1er avril 1976 dans le garage de la maison d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak et Ronald Wayne14, puis constituée sous forme de société le 3 janvier 1977 à l'origine sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification de ses produits, le mot « computer » est retiré le 9 janvier 2015.") |
|
|
|
|
|
[{'entity_group': 'ORG', |
|
'score': 0.9472818374633789, |
|
'word': 'Apple', |
|
'start': 0, |
|
'end': 5}, |
|
{'entity_group': 'PER', |
|
'score': 0.9838564991950989, |
|
'word': 'Steve Jobs', |
|
'start': 74, |
|
'end': 85}, |
|
{'entity_group': 'LOC', |
|
'score': 0.9831605950991312, |
|
'word': 'Los Altos', |
|
'start': 87, |
|
'end': 97}, |
|
{'entity_group': 'LOC', |
|
'score': 0.9834540486335754, |
|
'word': 'Californie', |
|
'start': 100, |
|
'end': 111}, |
|
{'entity_group': 'PER', |
|
'score': 0.9841555754343668, |
|
'word': 'Steve Jobs', |
|
'start': 115, |
|
'end': 126}, |
|
{'entity_group': 'PER', |
|
'score': 0.9843501806259155, |
|
'word': 'Steve Wozniak', |
|
'start': 127, |
|
'end': 141}, |
|
{'entity_group': 'PER', |
|
'score': 0.9841533899307251, |
|
'word': 'Ronald Wayne', |
|
'start': 144, |
|
'end': 157}, |
|
{'entity_group': 'ORG', |
|
'score': 0.9468960364659628, |
|
'word': 'Apple Computer', |
|
'start': 243, |
|
'end': 257}] |
|
|
|
``` |
|
|
|
|
|
## Model performances (metric: seqeval) |
|
|
|
Overall |
|
|
|
precision|recall|f1 |
|
-|-|- |
|
0.8859|0.8971|0.8914 |
|
|
|
By entity |
|
|
|
entity|precision|recall|f1 |
|
-|-|-|- |
|
PER|0.9372|0.9598|0.9483 |
|
ORG|0.8099|0.8265|0.8181 |
|
LOC|0.8905|0.9005|0.8955 |
|
MISC|0.8175|0.8117|0.8146 |
|
|
|
|
|
|
|
|
|
For those who could be interested, here is a short article on how I used the results of this model to train a LSTM model for signature detection in emails: |
|
https://medium.com/@jean-baptiste.polle/lstm-model-for-email-signature-detection-8e990384fefa |
|
|