File size: 3,139 Bytes
12cca13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
language: en
datasets:
- conll2003
widget:
- text: "My name is jean-baptiste and I live in montreal"
- text: "My name is clara and I live in berkeley, california."
- text: "My name is wolfgang and I live in berlin"

---

# roberta-large-ner: model fine-tuned from roberta-large for NER task

## Introduction

[roberta-large-ner] is a NER model that was fine-tuned from roberta-large on conll2003 dataset. 
Model was validated on emails/chat data and outperformed other models on this type of data specifically. 
In particular the model seems to work better on entity that don't start with an upper case.


## Training data

Training data was classified as follow:

Abbreviation|Description
-|-
O| Outside of a named entity
MISC | Miscellaneous entity
PER  | Person’s name
ORG  | Organization
LOC  | Location

In order to simplify, the prefix B- or I- from original conll2003 was removed.
I used the train and test dataset from original conll2003 for training and the "validation" dataset for validation. This resulted in a dataset of size:
Train | 17494
Validation | 3250

## How to use camembert-ner with HuggingFace

##### Load camembert-ner and its sub-word tokenizer :

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/roberta-large-ner")
model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/roberta-large-ner")


##### Process text sample (from wikipedia)

from transformers import pipeline

nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
nlp("Apple was founded in 1976 by Steve Jobs, Steve Wozniak and Ronald Wayne to develop and sell Wozniak's Apple I personal computer")


[{'entity_group': 'ORG',
  'score': 0.99381506,
  'word': ' Apple',
  'start': 0,
  'end': 5},
 {'entity_group': 'PER',
  'score': 0.99970853,
  'word': ' Steve Jobs',
  'start': 29,
  'end': 39},
 {'entity_group': 'PER',
  'score': 0.99981767,
  'word': ' Steve Wozniak',
  'start': 41,
  'end': 54},
 {'entity_group': 'PER',
  'score': 0.99956465,
  'word': ' Ronald Wayne',
  'start': 59,
  'end': 71},
 {'entity_group': 'PER',
  'score': 0.9997918,
  'word': ' Wozniak',
  'start': 92,
  'end': 99},
 {'entity_group': 'MISC',
  'score': 0.99956393,
  'word': ' Apple I',
  'start': 102,
  'end': 109}]
```


## Model performances 

Model performances computed on conll2003 validation dataset (computed on the tokens predictions)
```
entity | precision | recall | f1
- | - | - | -
PER | 0.9914 | 0.9927 | 0.9920 
ORG | 0.9627 | 0.9661 |	0.9644
LOC | 0.9795 | 0.9862 |	0.9828
MISC | 0.9292 |	0.9262 | 0.9277
Overall | 0.9740 | 0.9766 |	0.9753
```

On private dataset (email, chat, informal discussion), computed on word predictions:
```
entity | precision | recall | f1
- | - | - | -
PER | 0.8823 | 0.9116 | 0.8967
ORG | 0.7694 | 0.7292 | 0.7487
LOC | 0.8619 | 0.7768 |	0.8171
```

Spacy (en_core_web_trf-3.2.0) on the same private dataset was giving:
```
entity | precision | recall | f1
- | - | - | -
PER | 0.9146 | 0.8287 | 0.8695
ORG | 0.7655 | 0.6437 |	0.6993
LOC | 0.8727 | 0.6180 |	0.7236
```