File size: 5,180 Bytes
27af1af
 
1a852b6
 
 
 
0162f83
 
1a852b6
 
6174c03
 
dd71858
6174c03
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30b9b83
 
6174c03
 
 
 
 
373cfeb
 
 
 
 
6174c03
 
5afd4b8
25dc135
 
5afd4b8
 
 
6174c03
 
 
 
 
25dc135
 
 
 
 
 
 
 
373cfeb
 
 
 
 
 
8b7f8eb
9a51ae0
 
0162f83
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a51ae0
8b7f8eb
 
 
 
 
 
 
 
 
9828cf6
 
 
bded733
9828cf6
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
license: mit
language:
- fi
metrics:
- f1
- precision
- recall
library_name: transformers
pipeline_tag: token-classification
---

## Finnish named entity recognition ** WORK IN PROGRESS ** 

The model performs named entity recognition from text input in Finnish. 
It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
using 10 named entity categories. Training data contains the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
as well as an annotated dataset consisting of Finnish document daa from the 1970s onwards, digitized by the National Archives of Finland. 
Since the latter dataset contains also sensitive data, it has not been made publicly available.


## Intended uses & limitations

The model has been trained to recognize the following named entities from a text in Finnish:

- PERSON (person names)
- ORG (organizations)
- LOC (locations)
- GPE (geopolitical locations)
- PRODUCT (products)
- EVENT (events)
- DATE (dates)
- JON (Finnish journal numbers (diaarinumero))
- FIBC (Finnish business identity codes (y-tunnus))
- NORP (nationality, religious and political groups)

Some entities, like EVENT, LOC and JON, are less common in the training data than the others, which means that
recognition accuracy for these entities also tends to be lower.

The training data is relatively recent, so that the model might face difficulties when the input 
contains for example old names or writing styles.

## How to use

The easiest way to use the model is by utilizing the Transformers pipeline for token classification:

```python
from transformers import pipeline

model_checkpoint = "Kansallisarkisto/finbert-ner"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)
predictions = token_classifier("'Helsingistä tuli Suomen suuriruhtinaskunnan pääkaupunki vuonna 1812.")
print(predictions)
```

## Training data

Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
dataset were filtered out from the dataset used for training the model. 

In addition to this dataset, OCR'd and annotated content of 
digitized documents from Finnish public administration was also used for model training. 
The number of entities belonging to the different 
entity classes contained in training, validation and test datasets are listed below:

### Number of entity types in the data
Dataset|PERSON|ORG|LOC|GPE|PRODUCT|EVENT|DATE|JON|FIBC|NORP
-|-|-|-|-|-|-|-|-|-|-
Train|11691|30026|868|12999|7473|1184|14918|01360|1879|2068
Val|1542|4042|108|1654|879|160|1858|177|257|299
Test|1267|3698|86|1713|901|137|1843|174|233|260

## Training procedure

This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters:

- learning rate: 2e-05
- train batch size: 16
- epochs: 10
- optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
- scheduler: linear scheduler with num_warmup_steps=round(len(train_dataloader)/5) and num_training_steps=len(train_dataloader)*epochs
- maximum length of data sequence: 512
- patience: 2 epochs

In the prerocessing stage, the input texts were split into chunks with a maximum length of 300 tokens,
in order to avoid the tokenized chunks exceeding the maximum length of 512. Tokenization was performed
using the tokenizer for the [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1)
model.

The training code with instructions will be available soon [here](https://github.com/DALAI-hanke/BERT_NER).

## Evaluation results

Evaluation results using the test dataset are listed below:

||Precision|Recall|F1-score
-|-|-|-
PERSON|0.91|0.91|0.91
ORG|0.88|0.89|0.89
LOC|0.87|0.89|0.88
GPE|0.93|0.94|0.93
PRODUCT|0.77|0.82|0.80
EVENT|0.66|0.71|0.69
DATE|0.89|0.92|0.91
JON|0.78|0.83|0.80
FIBC|0.88|0.94|0.69
NORP|0.91|0.95|0.93

The metrics were calculated using the [seqeval](https://github.com/chakki-works/seqeval) library.

## Acknowledgements

The model was developed in an ERDF-funded project "Using Artificial Intelligence to Improve the Quality and Usability of Digital Records" 
(Dalai) in 2021-2023. The purpose of the project was to develop the automation of the digitisation of cultural heritage materials and the 
automated description of such materials through artificial intelligence. The main target group comprises memory organisations, archives, 
museums and libraries that digitise and provide digital materials to their customers, as well as companies that develop services related 
to digitisation and the processing of digital materials.

Project partners were the National Archives of Finland, Central Archives for Finnish Business Records (Elka), 
South-Eastern Finland University of Applied Sciences Ltd (Xamk) and Disec Ltd.

The selection and definition of the named entity categories, the formulation of the annotation guidelines and the annotation process have been 
carried out in cooperation with the [FIN-CLARIAH research infrastructure / University of Jyväskylä](https://jyu.fi/fin-clariah).