MikkoLipsanen
commited on
Commit
•
a3c4f82
1
Parent(s):
3186d31
Update README.md
Browse files
README.md
CHANGED
@@ -14,7 +14,8 @@ pipeline_tag: token-classification
|
|
14 |
|
15 |
The model performs named entity recognition from text input in Finnish.
|
16 |
It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
|
17 |
-
using 10 named entity categories. Training data contains the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
|
|
|
18 |
as well as an annotated dataset consisting of Finnish document data from the 1970s onwards, digitized by the National Archives of Finland.
|
19 |
Since the latter dataset contains also sensitive data, it has not been made publicly available.
|
20 |
|
@@ -34,10 +35,10 @@ The model has been trained to recognize the following named entities from a text
|
|
34 |
- FIBC (Finnish business identity codes (y-tunnus))
|
35 |
- NORP (nationality, religious and political groups)
|
36 |
|
37 |
-
Some entities, like EVENT
|
38 |
recognition accuracy for these entities also tends to be lower.
|
39 |
|
40 |
-
|
41 |
contains for example old names or writing styles.
|
42 |
|
43 |
## How to use
|
@@ -58,7 +59,17 @@ print(predictions)
|
|
58 |
## Training data
|
59 |
|
60 |
Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
|
61 |
-
dataset were filtered out from the dataset used for training the model.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
62 |
|
63 |
In addition to this dataset, OCR'd and annotated content of
|
64 |
digitized documents from Finnish public administration was also used for model training.
|
@@ -124,4 +135,3 @@ South-Eastern Finland University of Applied Sciences Ltd (Xamk) and Disec Ltd.
|
|
124 |
The selection and definition of the named entity categories, the formulation of the annotation guidelines and the annotation process have been
|
125 |
carried out in cooperation with the [FIN-CLARIAH research infrastructure / University of Jyväskylä](https://jyu.fi/fin-clariah).
|
126 |
|
127 |
-
|
|
|
14 |
|
15 |
The model performs named entity recognition from text input in Finnish.
|
16 |
It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
|
17 |
+
using 10 named entity categories. Training data contains for instance the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one),
|
18 |
+
the Finnish part of the [NewsEye dataset](https://zenodo.org/record/4573313)
|
19 |
as well as an annotated dataset consisting of Finnish document data from the 1970s onwards, digitized by the National Archives of Finland.
|
20 |
Since the latter dataset contains also sensitive data, it has not been made publicly available.
|
21 |
|
|
|
35 |
- FIBC (Finnish business identity codes (y-tunnus))
|
36 |
- NORP (nationality, religious and political groups)
|
37 |
|
38 |
+
Some entities, like EVENT and LOC, are less common in the training data than the others, which means that
|
39 |
recognition accuracy for these entities also tends to be lower.
|
40 |
|
41 |
+
Most of the training data is relatively recent, so that the model might face difficulties when the input
|
42 |
contains for example old names or writing styles.
|
43 |
|
44 |
## How to use
|
|
|
59 |
## Training data
|
60 |
|
61 |
Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
|
62 |
+
dataset were filtered out from the dataset used for training the model. On the other hand, entities that were missing from the [NewsEye dataset](https://zenodo.org/record/4573313)
|
63 |
+
were added during the annotation process. The different data sources used in model training are listed below:
|
64 |
+
|
65 |
+
Dataset|Period covered by the texts|Text type
|
66 |
+
-|-|-
|
67 |
+
[Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)|2000s|Online texts
|
68 |
+
[NewsEye dataset](https://zenodo.org/record/4573313)|1850-1950|OCR'd newspaper articles
|
69 |
+
Finnish document data digitized by the National Archives of Finland|1970s - 2000s|OCR'd
|
70 |
+
|
71 |
+
|
72 |
+
|
73 |
|
74 |
In addition to this dataset, OCR'd and annotated content of
|
75 |
digitized documents from Finnish public administration was also used for model training.
|
|
|
135 |
The selection and definition of the named entity categories, the formulation of the annotation guidelines and the annotation process have been
|
136 |
carried out in cooperation with the [FIN-CLARIAH research infrastructure / University of Jyväskylä](https://jyu.fi/fin-clariah).
|
137 |
|
|