lysandre HF staff commited on
Commit
6b6560e
1 Parent(s): a1e0407

Add size details

Browse files
Files changed (1) hide show
  1. README.md +10 -0
README.md CHANGED
@@ -36,6 +36,16 @@ classifier using the features produced by the ALBERT model as inputs.
36
 
37
  ALBERT is particular in that it shares its layers across its Transformer. Therefore, all layers have the same weights. Using repeating layers results in a small memory footprint, however, the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.
38
 
 
 
 
 
 
 
 
 
 
 
39
  ## Intended uses & limitations
40
 
41
  You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
 
36
 
37
  ALBERT is particular in that it shares its layers across its Transformer. Therefore, all layers have the same weights. Using repeating layers results in a small memory footprint, however, the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.
38
 
39
+ This is the second version of the base model. Version 2 is different from version 1 due to different dropout rates, additional training data, and longer training. It has better results in nearly all downstream tasks.
40
+
41
+ This model has the following configuration:
42
+
43
+ - 12 repeating layers
44
+ - 128 embedding dimension
45
+ - 768 hidden dimension
46
+ - 12 attention heads
47
+ - 11M parameters
48
+
49
  ## Intended uses & limitations
50
 
51
  You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to