Update README.md
Browse files
README.md
CHANGED
@@ -134,7 +134,7 @@ Contrary to BERT, the masking is done dynamically during pretraining (e.g., it c
|
|
134 |
|
135 |
### Pretraining
|
136 |
|
137 |
-
The model was trained on TPUv3-8 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/), for 2 epochs with a sequence length of 128 and continuing for
|
138 |
|
139 |
## Evaluation results
|
140 |
|
@@ -147,7 +147,7 @@ When fine-tuned on those datasets, this model (the first row of the table) achie
|
|
147 |
|Finnish-NLP/roberta-large-finnish |88.02 |94.53 |95.23 |74.30 |
|
148 |
|TurkuNLP/bert-base-finnish-cased-v1 |**88.82** |**94.90** |**95.49** |**76.07** |
|
149 |
|
150 |
-
To conclude, this model didn't improve
|
151 |
|
152 |
## Team Members
|
153 |
|
|
|
134 |
|
135 |
### Pretraining
|
136 |
|
137 |
+
The model was trained on TPUv3-8 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/), for 520k train steps (2 epochs, batch size 512) with a sequence length of 128 and continuing for 520k steps (1 epoch, batch size 64) with a sequence length of 512. The optimizer used for the 128 sequence training was AdamW, and for the 512 sequence training it was Adafactor (to save memory). Learning rate was 2e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and \\(\epsilon = 1e-6\\), learning rate warmup for 1500 steps and linear decay of the learning rate after.
|
138 |
|
139 |
## Evaluation results
|
140 |
|
|
|
147 |
|Finnish-NLP/roberta-large-finnish |88.02 |94.53 |95.23 |74.30 |
|
148 |
|TurkuNLP/bert-base-finnish-cased-v1 |**88.82** |**94.90** |**95.49** |**76.07** |
|
149 |
|
150 |
+
To conclude, this model didn't significantly improve compared to our previous [Finnish-NLP/roberta-large-finnish](https://huggingface.co/Finnish-NLP/roberta-large-finnish) model. This model is also slightly (~ 1%) losing to the [FinBERT (Finnish BERT)](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1) model.
|
151 |
|
152 |
## Team Members
|
153 |
|