MikkoLipsanen commited on
Commit
373cfeb
1 Parent(s): 5afd4b8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -3
README.md CHANGED
@@ -56,8 +56,11 @@ token_classifier("'Helsingistä tuli Suomen suuriruhtinaskunnan pääkaupunki vu
56
  ## Training data
57
 
58
  Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
59
- dataset were filtered out from the dataset used for training the model. In addition to this dataset, OCR'd and annotated content of
60
- digitized documents from Finnish public administration was also used for model training. The number of entities belonging to the different
 
 
 
61
  entity classes contained in training, validation and test datasets are listed below:
62
 
63
  ### Number of entity types in the data
@@ -67,6 +70,10 @@ Train|11691|30026|868|12999|7473|1184|14918|01360|1879|2068
67
  Val|1542|4042|108|1654|879|160|1858|177|257|299
68
  Test|1267|3698|86|1713|901|137|1843|174|233|260
69
 
 
 
 
 
70
  ## Training procedure
71
 
72
  This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters:
@@ -79,4 +86,9 @@ This model was trained using a NVIDIA RTX A6000 GPU with the following hyperpara
79
  - maximum length of data sequence: 512
80
  - patience: 2 epochs
81
 
82
- The training code with instructions is available [here](https://github.com/DALAI-hanke/BERT_NER).
 
 
 
 
 
 
56
  ## Training data
57
 
58
  Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
59
+ dataset were filtered out from the dataset used for training the model.
60
+
61
+ In addition to this dataset, OCR'd and annotated content of
62
+ digitized documents from Finnish public administration was also used for model training.
63
+ The number of entities belonging to the different
64
  entity classes contained in training, validation and test datasets are listed below:
65
 
66
  ### Number of entity types in the data
 
70
  Val|1542|4042|108|1654|879|160|1858|177|257|299
71
  Test|1267|3698|86|1713|901|137|1843|174|233|260
72
 
73
+ The annotation of the data was performed as a cooperation between the National Archives of Finland
74
+ and the [FIN-CLARIAH](https://www.kielipankki.fi/organization/fin-clariah/) research infrastructure
75
+ for Social Sciences and Humanities.
76
+
77
  ## Training procedure
78
 
79
  This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters:
 
86
  - maximum length of data sequence: 512
87
  - patience: 2 epochs
88
 
89
+ In the prerocessing stage, the input texts were split into chunks with a maximum length of 300 tokens,
90
+ in order to avoid the tokenized chunks exceeding the maximum length of 512. Tokenization was performed
91
+ using the tokenizer for the [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1)
92
+ model.
93
+
94
+ The training code with instructions will be available soon [here](https://github.com/DALAI-hanke/BERT_NER).