|
--- |
|
license: mit |
|
language: |
|
- fi |
|
metrics: |
|
- f1 |
|
- precision |
|
- recall |
|
library_name: transformers |
|
pipeline_tag: token-classification |
|
--- |
|
|
|
## Finnish named entity recognition ** WORK IN PROGRESS ** |
|
|
|
The model performs named entity recognition from text input in Finnish. |
|
It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1), |
|
using 10 named entity categories. Training data contains the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one) |
|
as well as an annotated dataset consisting of Finnish document daa from the 1970s onwards, digitized by the National Archives of Finland. |
|
Since the latter dataset contains also sensitive data, it has not been made publicly available. |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
The model has been trained to recognize the following named entities from a text in Finnish: |
|
|
|
- PERSON (person names) |
|
- ORG (organizations) |
|
- LOC (locations) |
|
- GPE (geopolitical locations) |
|
- PRODUCT (products) |
|
- EVENT (events) |
|
- DATE (dates) |
|
- JON (Finnish journal numbers (diaarinumero)) |
|
- FIBC (Finnish business identity codes (y-tunnus)) |
|
- NORP (nationality, religious and political groups) |
|
|
|
Some entities, like EVENT, LOC and JON, are less common in the training data than the others, which means that |
|
recognition accuracy for these entities also tends to be lower. |
|
|
|
The training data is relatively recent, so that the model might face difficulties when the input |
|
contains for example old names or writing styles. |
|
|
|
## How to use |
|
|
|
The easiest way to use the model is by utilizing the Transformers pipeline for token classification: |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
model_checkpoint = "Kansallisarkisto/finbert-ner" |
|
token_classifier = pipeline( |
|
"token-classification", model=model_checkpoint, aggregation_strategy="simple" |
|
) |
|
predictions = token_classifier("'Helsingistä tuli Suomen suuriruhtinaskunnan pääkaupunki vuonna 1812.") |
|
print(predictions) |
|
``` |
|
|
|
## Training data |
|
|
|
Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one) |
|
dataset were filtered out from the dataset used for training the model. |
|
|
|
In addition to this dataset, OCR'd and annotated content of |
|
digitized documents from Finnish public administration was also used for model training. |
|
The number of entities belonging to the different |
|
entity classes contained in training, validation and test datasets are listed below: |
|
|
|
### Number of entity types in the data |
|
Dataset|PERSON|ORG|LOC|GPE|PRODUCT|EVENT|DATE|JON|FIBC|NORP |
|
-|-|-|-|-|-|-|-|-|-|- |
|
Train|11691|30026|868|12999|7473|1184|14918|01360|1879|2068 |
|
Val|1542|4042|108|1654|879|160|1858|177|257|299 |
|
Test|1267|3698|86|1713|901|137|1843|174|233|260 |
|
|
|
## Training procedure |
|
|
|
This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters: |
|
|
|
- learning rate: 2e-05 |
|
- train batch size: 16 |
|
- epochs: 10 |
|
- optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08 |
|
- scheduler: linear scheduler with num_warmup_steps=round(len(train_dataloader)/5) and num_training_steps=len(train_dataloader)*epochs |
|
- maximum length of data sequence: 512 |
|
- patience: 2 epochs |
|
|
|
In the prerocessing stage, the input texts were split into chunks with a maximum length of 300 tokens, |
|
in order to avoid the tokenized chunks exceeding the maximum length of 512. Tokenization was performed |
|
using the tokenizer for the [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1) |
|
model. |
|
|
|
The training code with instructions will be available soon [here](https://github.com/DALAI-hanke/BERT_NER). |
|
|
|
## Evaluation results |
|
|
|
Evaluation results using the test dataset are listed below: |
|
|
|
||Precision|Recall|F1-score |
|
-|-|-|- |
|
PERSON|0.91|0.91|0.91 |
|
ORG|0.88|0.89|0.89 |
|
LOC|0.87|0.89|0.88 |
|
GPE|0.93|0.94|0.93 |
|
PRODUCT|0.77|0.82|0.80 |
|
EVENT|0.66|0.71|0.69 |
|
DATE|0.89|0.92|0.91 |
|
JON|0.78|0.83|0.80 |
|
FIBC|0.88|0.94|0.69 |
|
NORP|0.91|0.95|0.93 |
|
|
|
The metrics were calculated using the [seqeval](https://github.com/chakki-works/seqeval) library. |
|
|
|
## Acknowledgements |
|
|
|
The model was developed in an ERDF-funded project "Using Artificial Intelligence to Improve the Quality and Usability of Digital Records" |
|
(Dalai) in 2021-2023. The purpose of the project was to develop the automation of the digitisation of cultural heritage materials and the |
|
automated description of such materials through artificial intelligence. The main target group comprises memory organisations, archives, |
|
museums and libraries that digitise and provide digital materials to their customers, as well as companies that develop services related |
|
to digitisation and the processing of digital materials. |
|
|
|
Project partners were the National Archives of Finland, Central Archives for Finnish Business Records (Elka), |
|
South-Eastern Finland University of Applied Sciences Ltd (Xamk) and Disec Ltd. |
|
|
|
The selection and definition of the named entity categories, the formulation of the annotation guidelines and the annotation process have been |
|
carried out in cooperation with the [FIN-CLARIAH research infrastructure / University of Jyväskylä](https://jyu.fi/fin-clariah). |
|
|
|
|
|
|