File size: 6,025 Bytes
e1762a5 3717806 e1762a5 1fa35c3 e1762a5 fff0b41 e1762a5 1fa35c3 e1762a5 688bbe6 5bb980d d260448 e1762a5 1fa35c3 e1762a5 0242bd8 e1762a5 1fa35c3 86d1889 1fa35c3 5c836b0 88059bb 1fa35c3 e1762a5 1fa35c3 e1762a5 709b4c4 a81a6b4 608b63c a81a6b4 608b63c a81a6b4 608b63c a81a6b4 608b63c a81a6b4 608b63c a81a6b4 608b63c a81a6b4 608b63c a81a6b4 608b63c a81a6b4 608b63c a81a6b4 608b63c a81a6b4 608b63c a81a6b4 ad13b48 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
---
language: fr
license: mit
datasets:
- Jean-Baptiste/wikiner_fr
widget:
- text: "Boulanger, habitant à Boulanger et travaillant dans le magasin Boulanger situé dans la ville de Boulanger. Boulanger a écrit notamment le très célèbre livre intitulé Boulanger édité par la maison d'édition Boulanger."
---
DistilCamemBERT-NER
===================
We present DistilCamemBERT-NER which is [DistilCamemBERT](https://huggingface.co/cmarkea/distilcamembert-base) fine tuned for the NER (Named Entity Recognition) task for the French language. The work is inspired by [Jean-Baptiste/camembert-ner](https://huggingface.co/Jean-Baptiste/camembert-ner) based on the [CamemBERT](https://huggingface.co/camembert-base) model. The problem of the modelizations based on CamemBERT is at the scaling moment, for the production phase for example. Indeed, inference cost can be a technological issue. To counteract this effect, we propose this modelization which **divides the inference time by 2** with the same consumption power thanks to [DistilCamemBERT](https://huggingface.co/cmarkea/distilcamembert-base).
Dataset
-------
The dataset used is [wikiner_fr](https://huggingface.co/datasets/Jean-Baptiste/wikiner_fr) which represents ~170k sentences labelized in 5 categories :
* PER: personality ;
* LOC: location ;
* ORG: organization ;
* MISC: miscellaneous entities (movies title, books, etc.) ;
* O: background (Outside entity).
Evaluation results
------------------
| **class** | **precision (%)** | **recall (%)** | **f1 (%)** | **support (#sub-word)** |
| :------------: | :---------------: | :------------: | :--------: | :---------------------: |
| **global** | 98.35 | 98.36 | 98.35 | 492'243 |
| **PER** | 96.22 | 97.41 | 96.81 | 27'842 |
| **LOC** | 93.93 | 93.50 | 93.72 | 31'431 |
| **ORG** | 85.13 | 87.08 | 86.10 | 7'662 |
| **MISC** | 88.55 | 81.84 | 85.06 | 13'553 |
| **O** | 99.40 | 99.55 | 99.47 | 411'755 |
Benchmark
---------
This model performance is compared to 2 reference models (see below) with the metric [MCC (Matthews Correlation Coefficient)](https://en.wikipedia.org/wiki/Phi_coefficient). The score is given with a factor x100 and the delta gain with DistilCamemBERT-NER used in reference is in parantheses. For the mean inference time measure, an AMD Ryzen 5 4500U @ 2.3GHz with 6 cores was used:
| **model** | **time (ms)** | **PER** | **LOC** | **ORG** | **MISC** | **O** |
| :---------------------------------------------------------------------------------------------------------------: | :----------------: | :--------------: | :--------------: | :--------------: | :--------------: | :------------- : |
| [cmarkea/distilcamembert-base-ner](https://huggingface.co/cmarkea/distilcamembert-base-ner) | **43.44** | **93.91** | **88.26** | **84.03** | **82.74** | **91.45** |
| [Davlan/bert-base-multilingual-cased-ner-hrl](https://huggingface.co/Davlan/bert-base-multilingual-cased-ner-hrl) | 87.56<br/>(+102%) | 79.93<br/>(-15%) | 70.39<br/>(-22%) | 60.26<br/>(-28%) | n/a<br/>(n/a%) | 69.95<br/>(-24%) |
| [flair/ner-french](https://huggingface.co/flair/ner-french) | 314.96<br/>(+625%) | 80.18<br/>(-15%) | 72.11<br/>(-18%) | 67.29<br/>(-20%) | 72.39<br/>(-17%) | 74.34<br/>(-19%) |
<!--- | [Jean-Baptiste/camembert-ner](https://huggingface.co/Jean-Baptiste/camembert-ner) | 83.70<br/>(+93%) | 95.20<br/>(+1%) | 90.85<br/>(+3%) | 89.50<br/>(+6%) | 89.02<br/>(+8%) | 92.86<br/>(+2%) | problème de sur-apprentissage car pas moyen de savoir quelles sont les observations d'éval --->
How to use DistilCamemBERT-NER
------------------------------
```python
from transformers import pipeline
ner = pipeline(
task='ner',
model="cmarkea/distilcamembert-base-ner",
tokenizer="cmarkea/distilcamembert-base-ner",
aggregation_strategy="simple"
)
result = ner(
"Le Crédit Mutuel Arkéa est une banque Française, elle comprend le CMB "
"qui est une banque située en Bretagne et le CMSO qui est une banque "
"qui se situe principalement en Aquitaine. C'est sous la présidence de "
"Louis Lichou, dans les années 1980 que différentes filiales sont créées "
"au sein du CMB et forment les principales filiales du groupe qui "
"existent encore aujourd'hui (Federal Finance, Suravenir, Financo, etc.)."
)
result
[{'entity_group': 'ORG',
'score': 0.9974479,
'word': 'Crédit Mutuel Arkéa',
'start': 3,
'end': 22},
{'entity_group': 'LOC',
'score': 0.9000358,
'word': 'Française',
'start': 38,
'end': 47},
{'entity_group': 'ORG',
'score': 0.9788757,
'word': 'CMB',
'start': 66,
'end': 69},
{'entity_group': 'LOC',
'score': 0.99919766,
'word': 'Bretagne',
'start': 99,
'end': 107},
{'entity_group': 'ORG',
'score': 0.9594884,
'word': 'CMSO',
'start': 114,
'end': 118},
{'entity_group': 'LOC',
'score': 0.99935514,
'word': 'Aquitaine',
'start': 169,
'end': 178},
{'entity_group': 'PER',
'score': 0.99911094,
'word': 'Louis Lichou',
'start': 208,
'end': 220},
{'entity_group': 'ORG',
'score': 0.96226394,
'word': 'CMB',
'start': 291,
'end': 294},
{'entity_group': 'ORG',
'score': 0.9983959,
'word': 'Federal Finance',
'start': 374,
'end': 389},
{'entity_group': 'ORG',
'score': 0.9984454,
'word': 'Suravenir',
'start': 391,
'end': 400},
{'entity_group': 'ORG',
'score': 0.9985084,
'word': 'Financo',
'start': 402,
'end': 409}]
``` |