aapot commited on
Commit
fef751f
1 Parent(s): d172bb3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -106,18 +106,19 @@ vocabulary size of 50,257. The inputs are sequences of 512 consecutive tokens.
106
 
107
  ### Pretraining
108
 
109
- The model was trained on TPUv3-8 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/), for 300k steps. The optimizer used was a second-order optimization method called [Distributed Shampoo](https://github.com/google-research/google-research/tree/master/scalable_shampoo) with learning rate 1e-4, learning rate warmup for 4000 steps and cosine decay of the learning rate after.
110
 
111
  At first, commonly used Adam optimizer was tried but there were significant issues getting the model to converge even with multiple different learning rate trials so then Adam optimizer was replaced with the Distributed Shampoo which worked a lot better.
112
 
113
  ## Evaluation results
114
 
115
- Evaluation was done using the *validation* split of the [mc4_fi_cleaned](https://huggingface.co/datasets/Finnish-NLP/mc4_fi_cleaned) dataset with [Perplexity](https://huggingface.co/course/chapter7/3#perplexity-for-language-models) (smaller score the better) as the evaluation metric. As seen from the table below, this model (the first row of the table) loses to our bigger [gpt2-medium-finnish](https://huggingface.co/Finnish-NLP/gpt2-medium-finnish) model variant.
116
 
117
  | | Perplexity |
118
  |------------------------------------------|------------|
119
  |Finnish-NLP/gpt2-finnish |44.19 |
120
- |Finnish-NLP/gpt2-medium-finnish |**34.08** |
 
121
 
122
  ## Team Members
123
 
 
106
 
107
  ### Pretraining
108
 
109
+ The model was trained on TPUv3-8 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/), for 300k steps (a bit over 2 epochs, 256 batch size). The optimizer used was a second-order optimization method called [Distributed Shampoo](https://github.com/google-research/google-research/tree/master/scalable_shampoo) with learning rate 1e-4, learning rate warmup for 4000 steps and cosine decay of the learning rate after.
110
 
111
  At first, commonly used Adam optimizer was tried but there were significant issues getting the model to converge even with multiple different learning rate trials so then Adam optimizer was replaced with the Distributed Shampoo which worked a lot better.
112
 
113
  ## Evaluation results
114
 
115
+ Evaluation was done using the *validation* split of the [mc4_fi_cleaned](https://huggingface.co/datasets/Finnish-NLP/mc4_fi_cleaned) dataset with [Perplexity](https://huggingface.co/course/chapter7/3#perplexity-for-language-models) (smaller score the better) as the evaluation metric. As seen from the table below, this model (the first row of the table) loses to our bigger model variants.
116
 
117
  | | Perplexity |
118
  |------------------------------------------|------------|
119
  |Finnish-NLP/gpt2-finnish |44.19 |
120
+ |Finnish-NLP/gpt2-medium-finnish |34.08 |
121
+ |Finnish-NLP/gpt2-large-finnish |**30.74** |
122
 
123
  ## Team Members
124