dumitrescustefan
commited on
Commit
•
23ee87a
1
Parent(s):
8ec9a62
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,94 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: ro
|
3 |
+
datasets:
|
4 |
+
- ronecv2
|
5 |
+
license: mit
|
6 |
+
---
|
7 |
+
# bert-base-romanian-ner
|
8 |
+
|
9 |
+
## Model description
|
10 |
+
|
11 |
+
**bert-base-romanian-ner** is a fine-tuned BERT model that is ready to use for **Named Entity Recognition** and achieves **state-of-the-art performance** for the NER task. It has been trained to recognize **15** types of entities: persons, geo-political entities, locations, organizations, languages, national_religious_political entities, datetime, period, quantity, money, numeric, ordinal, facilities, works of art and events.
|
12 |
+
|
13 |
+
Specifically, this model is a [bert-base-romanian-cased-v1](https://huggingface.co/dumitrescustefan/bert-base-romanian-cased-v1) model that was fine-tuned on [RONEC version 2.0](https://github.com/dumitrescustefan/ronec), which holds 12330 sentences with over 0.5M tokens, to a total of 80.283 distinctly annotated entities. RONECv2 is a BIO2 annotated corpus, meaning this model will generate "B-" and "I-" style labels for entities.
|
14 |
+
|
15 |
+
### How to use
|
16 |
+
|
17 |
+
There are 2 ways to use this model:
|
18 |
+
|
19 |
+
#### Directly in Transformers:
|
20 |
+
|
21 |
+
You can use this model with Transformers *pipeline* for NER; you will have to handle word tokenization in multiple subtokens cases with different labels.
|
22 |
+
|
23 |
+
```python
|
24 |
+
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
25 |
+
from transformers import pipeline
|
26 |
+
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
|
27 |
+
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
|
28 |
+
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
|
29 |
+
example = "Alex cumpără un bilet pentru trenul 3118 în direcția Cluj cu plecare la ora 13:00."
|
30 |
+
ner_results = nlp(example)
|
31 |
+
print(ner_results)
|
32 |
+
```
|
33 |
+
|
34 |
+
#### Use in a Python package
|
35 |
+
|
36 |
+
Install package
|
37 |
+
Use named_persons_only
|
38 |
+
|
39 |
+
|
40 |
+
#### Don't forget!
|
41 |
+
|
42 |
+
Remember to always sanitize your text! Replace _s_ and _t_ cedilla-letters to comma-letters **before processing your text** with these models, with :
|
43 |
+
|
44 |
+
```
|
45 |
+
text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
|
46 |
+
```
|
47 |
+
|
48 |
+
## NER evaluation results
|
49 |
+
metric|dev|test
|
50 |
+
-|-|-
|
51 |
+
f1 |95.1 |91.3
|
52 |
+
precision |95.0 |90.7
|
53 |
+
recall |95.3 |91.9
|
54 |
+
|
55 |
+
## Corpus details
|
56 |
+
|
57 |
+
The corpus has the following classes and distribution in the train/valid/test splits:
|
58 |
+
|
59 |
+
| Classes | Total | Train | | Valid | | Test | |
|
60 |
+
|------------- |:------: |:------: |:-------: |:------: |:-------: |:------: |:-------: |
|
61 |
+
| | # | # | % | # | % | # | % |
|
62 |
+
| PERSON | **26130** | 19167 | 73.35 | 2733 | 10.46 | 4230 | 16.19 |
|
63 |
+
| GPE | **11103** | 8193 | 73.79 | 1182 | 10.65 | 1728 | 15.56 |
|
64 |
+
| LOC | **2467** | 1824 | 73.94 | 270 | 10.94 | 373 | 15.12 |
|
65 |
+
| ORG | **7880** | 5688 | 72.18 | 880 | 11.17 | 1312 | 16.65 |
|
66 |
+
| LANGUAGE | **467** | 342 | 73.23 | 52 | 11.13 | 73 | 15.63 |
|
67 |
+
| NAT_REL_POL | **4970** | 3673 | 73.90 | 516 | 10.38 | 781 | 15.71 |
|
68 |
+
| DATETIME | **9614** | 6960 | 72.39 | 1029 | 10.7 | 1625 | 16.9 |
|
69 |
+
| PERIOD | **1188** | 862 | 72.56 | 129 | 10.86 | 197 | 16.58 |
|
70 |
+
| QUANTITY | **1588** | 1161 | 73.11 | 181 | 11.4 | 246 | 15.49 |
|
71 |
+
| MONEY | **1424** | 1041 | 73.10 | 159 | 11.17 | 224 | 15.73 |
|
72 |
+
| NUMERIC | **7735** | 5734 | 74.13 | 814 | 10.52 | 1187 | 15.35 |
|
73 |
+
| ORDINAL | **1893** | 1377 | 72.74 | 212 | 11.2 | 304 | 16.06 |
|
74 |
+
| FACILITY | **1126** | 840 | 74.6 | 113 | 10.04 | 173 | 15.36 |
|
75 |
+
| WORK_OF_ART | **1596** | 1157 | 72.49 | 176 | 11.03 | 263 | 16.48 |
|
76 |
+
| EVENT | **1102** | 826 | 74.95 | 107 | 9.71 | 169 | 15.34 |
|
77 |
+
|
78 |
+
|
79 |
+
|
80 |
+
### BibTeX entry and citation info
|
81 |
+
|
82 |
+
Please consider citing the following [paper](https://arxiv.org/abs/1909.01247) as a thank you to the authors of the RONEC, even if it describes v1 of the corpus and you are using a model trained on v2:
|
83 |
+
```
|
84 |
+
Dumitrescu, Stefan Daniel, and Andrei-Marius Avram. "Introducing RONEC--the Romanian Named Entity Corpus." arXiv preprint arXiv:1909.01247 (2019).
|
85 |
+
```
|
86 |
+
or in .bibtex format:
|
87 |
+
```
|
88 |
+
@article{dumitrescu2019introducing,
|
89 |
+
title={Introducing RONEC--the Romanian Named Entity Corpus},
|
90 |
+
author={Dumitrescu, Stefan Daniel and Avram, Andrei-Marius},
|
91 |
+
journal={arXiv preprint arXiv:1909.01247},
|
92 |
+
year={2019}
|
93 |
+
}
|
94 |
+
```
|