MikkoLipsanen
commited on
Commit
·
0a5858c
1
Parent(s):
bded733
Update README.md
Browse files
README.md
CHANGED
@@ -15,7 +15,7 @@ pipeline_tag: token-classification
|
|
15 |
The model performs named entity recognition from text input in Finnish.
|
16 |
It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
|
17 |
using 10 named entity categories. Training data contains the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
|
18 |
-
as well as an annotated dataset consisting of Finnish document
|
19 |
Since the latter dataset contains also sensitive data, it has not been made publicly available.
|
20 |
|
21 |
|
@@ -84,7 +84,7 @@ This model was trained using a NVIDIA RTX A6000 GPU with the following hyperpara
|
|
84 |
- maximum length of data sequence: 512
|
85 |
- patience: 2 epochs
|
86 |
|
87 |
-
In the
|
88 |
in order to avoid the tokenized chunks exceeding the maximum length of 512. Tokenization was performed
|
89 |
using the tokenizer for the [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1)
|
90 |
model.
|
|
|
15 |
The model performs named entity recognition from text input in Finnish.
|
16 |
It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
|
17 |
using 10 named entity categories. Training data contains the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
|
18 |
+
as well as an annotated dataset consisting of Finnish document data from the 1970s onwards, digitized by the National Archives of Finland.
|
19 |
Since the latter dataset contains also sensitive data, it has not been made publicly available.
|
20 |
|
21 |
|
|
|
84 |
- maximum length of data sequence: 512
|
85 |
- patience: 2 epochs
|
86 |
|
87 |
+
In the preprocessing stage, the input texts were split into chunks with a maximum length of 300 tokens,
|
88 |
in order to avoid the tokenized chunks exceeding the maximum length of 512. Tokenization was performed
|
89 |
using the tokenizer for the [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1)
|
90 |
model.
|