File size: 1,625 Bytes
0841f26
bf20724
 
 
 
 
96be52a
 
 
bf20724
 
 
 
 
0841f26
 
bf20724
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
license: cc-by-4.0
datasets:
- wikiann
language:
- bg
model-index:
- name: bert-base-ner-bulgarian
  results: []
metrics:
- f1
pipeline_tag: text-classification
widget:
- text: 'Философът Барух Спиноза е роден в Амстердам.'
---

# 🇧🇬 BERT - Bulgarian Named Entity Recognition
The model [rmihaylov/bert-base-bg](https://huggingface.co/rmihaylov/bert-base-bg) fine-tuned on a Bulgarian subset of [wikiann](https://huggingface.co/datasets/wikiann). 
It achieves *0.99* F1-score on that dataset.

## Usage
Import the libraries:
```python
from pprint import pprint

import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
```

Load the model:
```python
MODEL_ID = "auhide/bert-base-ner-bulgarian"
model = AutoModelForTokenClassification.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

ner = pipeline(task="ner", model=model, tokenizer=tokenizer)
```

Do inference:
```python
text = "Философът Барух Спиноза е роден в Амстердам."
pprint(ner(text))
```

```sh
[{'end': 13,
  'entity': 'B-PER',
  'index': 3,
  'score': 0.9954899,
  'start': 9,
  'word': '▁Бар'},
 {'end': 15,
  'entity': 'I-PER',
  'index': 4,
  'score': 0.9660787,
  'start': 13,
  'word': 'ух'},
 {'end': 23,
  'entity': 'I-PER',
  'index': 5,
  'score': 0.99728084,
  'start': 15,
  'word': '▁Спиноза'},
 {'end': 43,
  'entity': 'B-LOC',
  'index': 9,
  'score': 0.8990479,
  'start': 33,
  'word': '▁Амстердам'}]
```

Note: There are three types of entities - `PER`, `ORG`, `LOC`.