|
--- |
|
license: mit |
|
language: |
|
- fi |
|
metrics: |
|
- f1 |
|
- accuracy |
|
library_name: transformers |
|
pipeline_tag: token-classification |
|
--- |
|
|
|
## Finnish named entity recognition ** WORK IN PROGRESS ** |
|
|
|
The model performs named entity recognition from text input in Finnish. |
|
It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1), |
|
using 10 named entity categories. Training data contains the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one) |
|
as well as an annotated dataset consisting of Finnish document daa from the 1970s onwards, digitized by the National Archives of Finland. |
|
Since the latter dataset contains also sensitive data, it has not been made publicly available. |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
The model has been trained to recognize the following named entities from a text in Finnish: |
|
|
|
- PERSON (person names) |
|
- ORG (organizations) |
|
- LOC (locations) |
|
- GPE (geopolitical locations) |
|
- PRODUCT (products) |
|
- EVENT (events) |
|
- DATE (dates) |
|
- JON (Finnish journal numbers (diaarinumero)) |
|
- FIBC (Finnish business identity codes (y-tunnus)) |
|
- NORP (nationality, religious and political groups) |
|
|
|
Some entities, like EVENT, LOC and JON, are less common in the training data than the others, which means that |
|
recognition accuracy for these entities also tends to be lower. |
|
|
|
The training data is relatively recent, so that the model might face difficulties when the input |
|
contains for example old names or writing styles. |
|
|
|
## How to use |
|
|
|
The easiest way to use the model is by utilizing the Transformers pipeline for token classification: |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
model_checkpoint = "Kansallisarkisto/finbert-ner" |
|
token_classifier = pipeline( |
|
"token-classification", model=model_checkpoint, aggregation_strategy="simple" |
|
) |
|
token_classifier("'Helsingistä tuli Suomen suuriruhtinaskunnan pääkaupunki vuonna 1812.") |
|
``` |
|
|
|
## Training data |
|
|
|
Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one) |
|
dataset were filtered out from the dataset used for training the model. In addition to this dataset, OCR'd and annotated content of |
|
digitized documents from Finnish public administration was also used for model training. The number of entities belonging to the different |
|
entity classes contained in training, validation and test datasets are listed below: |
|
|
|
Number of entity types in the data |
|
Dataset|PERSON|ORG|LOC|GPE|PRODUCT|EVENT|DATE|JON|FIBC|NORP |
|
-|-|-|-|-|-|-|-|-|-|- |
|
Train|0|0|0|0|0|0|0|0|0|0 |
|
Val|1560|4077|108|1643|880|165|1897|185|265|299 |
|
Test|1284|3742|87|1713|906|137|1864|179|234|261 |
|
|
|
## Training procedure |
|
|
|
This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters: |
|
|
|
- learning rate: 2e-05 |
|
- train batch size: 16 |
|
- epochs: 10 |
|
- optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08 |
|
- scheduler: linear scheduler with num_warmup_steps=round(len(train_dataloader)/5) and num_training_steps=len(train_dataloader)*epochs |
|
- maximum length of data sequence: 512 |
|
- patience: 2 epochs |
|
|
|
The training code with instructions is available [here](https://github.com/DALAI-hanke/BERT_NER). |
|
|