MikkoLipsanen commited on
Commit
a3c4f82
1 Parent(s): 3186d31

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -5
README.md CHANGED
@@ -14,7 +14,8 @@ pipeline_tag: token-classification
14
 
15
  The model performs named entity recognition from text input in Finnish.
16
  It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
17
- using 10 named entity categories. Training data contains the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
 
18
  as well as an annotated dataset consisting of Finnish document data from the 1970s onwards, digitized by the National Archives of Finland.
19
  Since the latter dataset contains also sensitive data, it has not been made publicly available.
20
 
@@ -34,10 +35,10 @@ The model has been trained to recognize the following named entities from a text
34
  - FIBC (Finnish business identity codes (y-tunnus))
35
  - NORP (nationality, religious and political groups)
36
 
37
- Some entities, like EVENT, LOC and JON, are less common in the training data than the others, which means that
38
  recognition accuracy for these entities also tends to be lower.
39
 
40
- The training data is relatively recent, so that the model might face difficulties when the input
41
  contains for example old names or writing styles.
42
 
43
  ## How to use
@@ -58,7 +59,17 @@ print(predictions)
58
  ## Training data
59
 
60
  Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
61
- dataset were filtered out from the dataset used for training the model.
 
 
 
 
 
 
 
 
 
 
62
 
63
  In addition to this dataset, OCR'd and annotated content of
64
  digitized documents from Finnish public administration was also used for model training.
@@ -124,4 +135,3 @@ South-Eastern Finland University of Applied Sciences Ltd (Xamk) and Disec Ltd.
124
  The selection and definition of the named entity categories, the formulation of the annotation guidelines and the annotation process have been
125
  carried out in cooperation with the [FIN-CLARIAH research infrastructure / University of Jyväskylä](https://jyu.fi/fin-clariah).
126
 
127
-
 
14
 
15
  The model performs named entity recognition from text input in Finnish.
16
  It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
17
+ using 10 named entity categories. Training data contains for instance the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one),
18
+ the Finnish part of the [NewsEye dataset](https://zenodo.org/record/4573313)
19
  as well as an annotated dataset consisting of Finnish document data from the 1970s onwards, digitized by the National Archives of Finland.
20
  Since the latter dataset contains also sensitive data, it has not been made publicly available.
21
 
 
35
  - FIBC (Finnish business identity codes (y-tunnus))
36
  - NORP (nationality, religious and political groups)
37
 
38
+ Some entities, like EVENT and LOC, are less common in the training data than the others, which means that
39
  recognition accuracy for these entities also tends to be lower.
40
 
41
+ Most of the training data is relatively recent, so that the model might face difficulties when the input
42
  contains for example old names or writing styles.
43
 
44
  ## How to use
 
59
  ## Training data
60
 
61
  Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
62
+ dataset were filtered out from the dataset used for training the model. On the other hand, entities that were missing from the [NewsEye dataset](https://zenodo.org/record/4573313)
63
+ were added during the annotation process. The different data sources used in model training are listed below:
64
+
65
+ Dataset|Period covered by the texts|Text type
66
+ -|-|-
67
+ [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)|2000s|Online texts
68
+ [NewsEye dataset](https://zenodo.org/record/4573313)|1850-1950|OCR'd newspaper articles
69
+ Finnish document data digitized by the National Archives of Finland|1970s - 2000s|OCR'd
70
+
71
+
72
+
73
 
74
  In addition to this dataset, OCR'd and annotated content of
75
  digitized documents from Finnish public administration was also used for model training.
 
135
  The selection and definition of the named entity categories, the formulation of the annotation guidelines and the annotation process have been
136
  carried out in cooperation with the [FIN-CLARIAH research infrastructure / University of Jyväskylä](https://jyu.fi/fin-clariah).
137