Kansallisarkisto
/

finbert-ner

@@ -14,7 +14,8 @@ pipeline_tag: token-classification
 The model performs named entity recognition from text input in Finnish.
 It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
-using 10 named entity categories. Training data contains the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
 as well as an annotated dataset consisting of Finnish document data from the 1970s onwards, digitized by the National Archives of Finland.
 Since the latter dataset contains also sensitive data, it has not been made publicly available.
@@ -34,10 +35,10 @@ The model has been trained to recognize the following named entities from a text
 - FIBC (Finnish business identity codes (y-tunnus))
 - NORP (nationality, religious and political groups)
-Some entities, like EVENT, LOC and JON, are less common in the training data than the others, which means that
 recognition accuracy for these entities also tends to be lower.
-The training data is relatively recent, so that the model might face difficulties when the input
 contains for example old names or writing styles.
 ## How to use
@@ -58,7 +59,17 @@ print(predictions)
 ## Training data
 Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
-dataset were filtered out from the dataset used for training the model.
 In addition to this dataset, OCR'd and annotated content of
 digitized documents from Finnish public administration was also used for model training.
@@ -124,4 +135,3 @@ South-Eastern Finland University of Applied Sciences Ltd (Xamk) and Disec Ltd.
 The selection and definition of the named entity categories, the formulation of the annotation guidelines and the annotation process have been
 carried out in cooperation with the [FIN-CLARIAH research infrastructure / University of Jyväskylä](https://jyu.fi/fin-clariah).

 The model performs named entity recognition from text input in Finnish.
 It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
+using 10 named entity categories. Training data contains for instance the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one),
+the Finnish part of the [NewsEye dataset](https://zenodo.org/record/4573313)
 as well as an annotated dataset consisting of Finnish document data from the 1970s onwards, digitized by the National Archives of Finland.
 Since the latter dataset contains also sensitive data, it has not been made publicly available.
 - FIBC (Finnish business identity codes (y-tunnus))
 - NORP (nationality, religious and political groups)
+Some entities, like EVENT and LOC, are less common in the training data than the others, which means that
 recognition accuracy for these entities also tends to be lower.
+Most of the training data is relatively recent, so that the model might face difficulties when the input
 contains for example old names or writing styles.
 ## How to use
 ## Training data
 Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
+dataset were filtered out from the dataset used for training the model. On the other hand, entities that were missing from the [NewsEye dataset](https://zenodo.org/record/4573313)
+were added during the annotation process. The different data sources used in model training are listed below:
+Dataset|Period covered by the texts|Text type
+-|-|-
+[Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)|2000s|Online texts
+[NewsEye dataset](https://zenodo.org/record/4573313)|1850-1950|OCR'd newspaper articles
+Finnish document data digitized by the National Archives of Finland|1970s - 2000s|OCR'd
 In addition to this dataset, OCR'd and annotated content of
 digitized documents from Finnish public administration was also used for model training.
 The selection and definition of the named entity categories, the formulation of the annotation guidelines and the annotation process have been
 carried out in cooperation with the [FIN-CLARIAH research infrastructure / University of Jyväskylä](https://jyu.fi/fin-clariah).