Update README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1 |
This model was pretrained on the bookcorpus dataset using knowledge distillation.
|
2 |
|
3 |
-
The particularity of this model is that even though it shares the same architecture as BERT, it has a hidden size of
|
4 |
|
5 |
The knowledge distillation was performed using multiple loss functions.
|
6 |
|
|
|
1 |
This model was pretrained on the bookcorpus dataset using knowledge distillation.
|
2 |
|
3 |
+
The particularity of this model is that even though it shares the same architecture as BERT, it has a hidden size of 256. Since it has 4 attention heads, the head size is 64 just as for the BERT base model.
|
4 |
|
5 |
The knowledge distillation was performed using multiple loss functions.
|
6 |
|