Cyrile's picture
Update README.md
608b63c
|
raw
history blame
6.03 kB
metadata
language: fr
license: mit
datasets:
  - Jean-Baptiste/wikiner_fr
widget:
  - text: >-
      Boulanger, habitant à Boulanger et travaillant dans le magasin Boulanger
      situé dans la ville de Boulanger. Boulanger a écrit notamment le très
      célèbre livre intitulé Boulanger édité par la maison d'édition Boulanger.

DistilCamemBERT-NER

We present DistilCamemBERT-NER which is DistilCamemBERT fine tuned for the NER (Named Entity Recognition) task for the French language. The work is inspired by Jean-Baptiste/camembert-ner based on the CamemBERT model. The problem of the modelizations based on CamemBERT is at the scaling moment, for the production phase for example. Indeed, inference cost can be a technological issue. To counteract this effect, we propose this modelization which divides the inference time by 2 with the same consumption power thanks to DistilCamemBERT.

Dataset

The dataset used is wikiner_fr which represents ~170k sentences labelized in 5 categories : * PER: personality ; * LOC: location ; * ORG: organization ; * MISC: miscellaneous entities (movies title, books, etc.) ; * O: background (Outside entity). Evaluation results

class precision (%) recall (%) f1 (%) support (#sub-word)
global 98.35 98.36 98.35 492'243
PER 96.22 97.41 96.81 27'842
LOC 93.93 93.50 93.72 31'431
ORG 85.13 87.08 86.10 7'662
MISC 88.55 81.84 85.06 13'553
O 99.40 99.55 99.47 411'755

Benchmark

This model performance is compared to 2 reference models (see below) with the metric MCC (Matthews Correlation Coefficient). The score is given with a factor x100 and the delta gain with DistilCamemBERT-NER used in reference is in parantheses. For the mean inference time measure, an AMD Ryzen 5 4500U @ 2.3GHz with 6 cores was used:

| model | time (ms) | PER | LOC | ORG | MISC | O | | :---------------------------------------------------------------------------------------------------------------: | :----------------: | :--------------: | :--------------: | :--------------: | :--------------: | :------------- : | | cmarkea/distilcamembert-base-ner | 43.44 | 93.91 | 88.26 | 84.03 | 82.74 | 91.45 | | Davlan/bert-base-multilingual-cased-ner-hrl | 87.56
(+102%) | 79.93
(-15%) | 70.39
(-22%) | 60.26
(-28%) | n/a
(n/a%) | 69.95
(-24%) | | flair/ner-french | 314.96
(+625%) | 80.18
(-15%) | 72.11
(-18%) | 67.29
(-20%) | 72.39
(-17%) | 74.34
(-19%) |

How to use DistilCamemBERT-NER

from transformers import pipeline

ner = pipeline(
    task='ner',
    model="cmarkea/distilcamembert-base-ner",
    tokenizer="cmarkea/distilcamembert-base-ner",
    aggregation_strategy="simple"
)
result = ner(
    "Le Crédit Mutuel Arkéa est une banque Française, elle comprend le CMB "
    "qui est une banque située en Bretagne et le CMSO qui est une banque "
    "qui se situe principalement en Aquitaine. C'est sous la présidence de "
    "Louis Lichou, dans les années 1980 que différentes filiales sont créées "
    "au sein du CMB et forment les principales filiales du groupe qui "
    "existent encore aujourd'hui (Federal Finance, Suravenir, Financo, etc.)."
)

result
[{'entity_group': 'ORG',
  'score': 0.9974479,
  'word': 'Crédit Mutuel Arkéa',
  'start': 3,
  'end': 22},
 {'entity_group': 'LOC',
  'score': 0.9000358,
  'word': 'Française',
  'start': 38,
  'end': 47},
 {'entity_group': 'ORG',
  'score': 0.9788757,
  'word': 'CMB',
  'start': 66,
  'end': 69},
 {'entity_group': 'LOC',
  'score': 0.99919766,
  'word': 'Bretagne',
  'start': 99,
  'end': 107},
 {'entity_group': 'ORG',
  'score': 0.9594884,
  'word': 'CMSO',
  'start': 114,
  'end': 118},
 {'entity_group': 'LOC',
  'score': 0.99935514,
  'word': 'Aquitaine',
  'start': 169,
  'end': 178},
 {'entity_group': 'PER',
  'score': 0.99911094,
  'word': 'Louis Lichou',
  'start': 208,
  'end': 220},
 {'entity_group': 'ORG',
  'score': 0.96226394,
  'word': 'CMB',
  'start': 291,
  'end': 294},
 {'entity_group': 'ORG',
  'score': 0.9983959,
  'word': 'Federal Finance',
  'start': 374,
  'end': 389},
 {'entity_group': 'ORG',
  'score': 0.9984454,
  'word': 'Suravenir',
  'start': 391,
  'end': 400},
 {'entity_group': 'ORG',
  'score': 0.9985084,
  'word': 'Financo',
  'start': 402,
  'end': 409}]