Cyrile commited on
Commit
e1762a5
1 Parent(s): 4283dc1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -0
README.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: fr
3
+ license: mit
4
+ datasets:
5
+ - Jean-Baptiste/wikiner_fr
6
+ widget:
7
+ - text: "Boulanger, habitant à Boulanger, a acheté une télé à Boulanger."
8
+ ---
9
+ DistilCamemBERT-NER
10
+ ==================
11
+
12
+ We present DistilCamemBERT-NER which is [DistilCamemBERT](https://huggingface.co/cmarkea/distilcamembert-base) fine tuned for the NER (Named Entity Recognition) task for the French language. The work is inspired by [Jean-Baptiste/camembert-ner](https://huggingface.co/Jean-Baptiste/camembert-ner) based on the [CamemBERT](https://huggingface.co/camembert-base) model. The problem of the modelizations based on CamemBERT is at the scaling moment (for the production phase for example). Indeed, inference cost can be a technological issue. To counteract this effect, we propose this modelization which **divides the inference time by 2** with the same consumption power thanks to [DistilCamemBER](https://huggingface.co/cmarkea/distilcamembert-base).
13
+
14
+ Dataset
15
+ ----------
16
+
17
+ The dataset used is [wikiner_fr](https://huggingface.co/datasets/Jean-Baptiste/wikiner_fr) which represents ~170k sentences labelized in 5 categories :
18
+ * I-PER: personality ;
19
+ * I-LOC: location ;
20
+ * I-ORG: organization ;
21
+ * I-MISC: Miscellaneous entities ;
22
+ * O: background (Other).
23
+
24
+ Evaluation results
25
+ ------------------------
26
+
27
+ | class | precision (%) | recall (%) | f1 (%) | support |
28
+ | :----: | :---------: | :-----------: | :-----: | :------: |
29
+ | global | 98.35 | 98.36 | 98.35 | 492'243 |
30
+ | I-PER | 96.22 | 97.41 | 96.81 | 27'842 |
31
+ | I-LOC | 93.93 | 93.50 | 93.72 | 31'431 |
32
+ | I-ORG | 85.13 | 87.08 | 86.10 | 7'662 |
33
+ | I-MISC | 88.55 | 81.84 | 85.06 | 13'553 |
34
+ | O | 99.40 | 99.55 | 99.47 | 411'755 |
35
+
36
+ How to use DistilCamemBERT-NER
37
+ ------------------------------------------------
38
+
39
+ ```python
40
+ from transformers import pipeline
41
+
42
+ ner = pipeline('ner', model=cmarkea/distilcamembert-base-ner, tokenizer=cmarkea/distilcamembert-base-ner, aggregation_strategy="simple")
43
+ result = ner("Le Crédit Mutuel Arkéa est une banque Francaise et le CMB est une banque de Bretagne. C'est sous la présidence de Louis Lichou, dans les années 1980 que différentes filiales sont créées au sein du CMB et forme les principales filiales du groupe qui existent encore aujourd'hui (Federal Finance, Suravenir, Financo, etc.).")
44
+ # result
45
+ # [{'entity_group': 'ORG',
46
+ # 'score': 0.9882848,
47
+ # 'word': 'Crédit Mutuel Arkéa',
48
+ # 'start': 3,
49
+ # 'end': 22},
50
+ # {'entity_group': 'LOC',
51
+ # 'score': 0.94114804,
52
+ # 'word': 'Francaise',
53
+ # 'start': 38,
54
+ # 'end': 47},
55
+ # {'entity_group': 'ORG',
56
+ # 'score': 0.8854897,
57
+ # 'word': 'CMB',
58
+ # 'start': 54,
59
+ # 'end': 57},
60
+ # {'entity_group': 'LOC',
61
+ # 'score': 0.9873087,
62
+ # 'word': 'Bretagne',
63
+ # 'start': 76,
64
+ # 'end': 84},
65
+ # {'entity_group': 'PER',
66
+ # 'score': 0.9989073,
67
+ # 'word': 'Louis Lichou',
68
+ # 'start': 114,
69
+ # 'end': 126},
70
+ # {'entity_group': 'ORG',
71
+ # 'score': 0.89991987,
72
+ # 'word': 'CMB',
73
+ # 'start': 197,
74
+ # 'end': 200},
75
+ # {'entity_group': 'ORG',
76
+ # 'score': 0.9965075,
77
+ # 'word': 'Federal Finance',
78
+ # 'start': 278,
79
+ # 'end': 293},
80
+ # {'entity_group': 'ORG',
81
+ # 'score': 0.99657035,
82
+ # 'word': 'Suravenir',
83
+ # 'start': 295,
84
+ # 'end': 304},
85
+ # {'entity_group': 'ORG',
86
+ # 'score': 0.9965148,
87
+ # 'word': 'Financo',
88
+ # 'start': 306,
89
+ # 'end': 313}]
90
+ ```