File size: 3,110 Bytes
d8a6a96
 
a7d6832
 
d8a6a96
5867026
d8a6a96
 
 
 
 
 
3f61b30
5867026
ea19a79
5867026
d8a6a96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3272069
d8a6a96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bcb19a1
d8a6a96
3f61b30
 
5867026
 
 
3f61b30
d8a6a96
3f61b30
 
5867026
d0002b6
 
5867026
d0002b6
 
5867026
d0002b6
 
5867026
d0002b6
 
5867026
bcb19a1
d8a6a96
d0002b6
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
language: fr
datasets:
- Jean-Baptiste/wikiner_fr
widget:
- text: "Je m'appelle jean-baptiste et je vis à montréal"
---

# camembert-ner: model fine-tuned from camemBERT for NER task.

## Introduction

[camembert-ner] is a NER model that was fine-tuned from camemBERT on wikiner-fr dataset.
Model was trained on wikiner-fr dataset (~170 634  sentences).
Model was validated on emails/chat data and overperformed other models on this type of data specifically. 
In particular the model seems to work better on entity that don't start with an upper case.


## How to use camembert-ner with HuggingFace

##### Load camembert-ner and its sub-word tokenizer :

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/camembert-ner")
model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/camembert-ner")


##### Process text sample (from wikipedia)

from transformers import pipeline

nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
nlp("Apple est créée le 1er avril 1976 dans le garage de la maison d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak et Ronald Wayne14, puis constituée sous forme de société le 3 janvier 1977 à l'origine sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification de ses produits, le mot « computer » est retiré le 9 janvier 2015.")


[{'entity_group': 'ORG',
  'score': 0.9472818374633789,
  'word': 'Apple',
  'start': 0,
  'end': 5},
 {'entity_group': 'PER',
  'score': 0.9838564991950989,
  'word': 'Steve Jobs',
  'start': 74,
  'end': 85},
 {'entity_group': 'LOC',
  'score': 0.9831605950991312,
  'word': 'Los Altos',
  'start': 87,
  'end': 97},
 {'entity_group': 'LOC',
  'score': 0.9834540486335754,
  'word': 'Californie',
  'start': 100,
  'end': 111},
 {'entity_group': 'PER',
  'score': 0.9841555754343668,
  'word': 'Steve Jobs',
  'start': 115,
  'end': 126},
 {'entity_group': 'PER',
  'score': 0.9843501806259155,
  'word': 'Steve Wozniak',
  'start': 127,
  'end': 141},
 {'entity_group': 'PER',
  'score': 0.9841533899307251,
  'word': 'Ronald Wayne',
  'start': 144,
  'end': 157},
 {'entity_group': 'ORG',
  'score': 0.9468960364659628,
  'word': 'Apple Computer',
  'start': 243,
  'end': 257}]

```


## Model performances (metric: seqeval)

Global
```
'precision': 0.8859
'recall': 0.8971
'f1': 0.8914
```

By entity
```
'LOC': {'precision': 0.8905576596578294,
		'recall': 0.900554675118859,
		'f1': 0.8955282684352223},
'MISC': {'precision': 0.8175627240143369,
		 'recall': 0.8117437722419929,
		 'f1': 0.8146428571428571},
'ORG': {'precision': 0.8099480326651819,
		'recall': 0.8265151515151515,
		'f1': 0.8181477315335584},
'PER': {'precision': 0.9372509960159362,
		'recall': 0.959812321501428,
		'f1': 0.9483975005039308}

 ```

A short article on how I used the result of this model to train a LSTM model for signature detection in emails:  
https://medium.com/@jean-baptiste.polle/lstm-model-for-email-signature-detection-8e990384fefa