albert
/

albert-xlarge-v2

@@ -36,15 +36,15 @@ classifier using the features produced by the ALBERT model as inputs.
 ALBERT is particular in that it shares its layers across its Transformer. Therefore, all layers have the same weights. Using repeating layers results in a small memory footprint, however, the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.
-This is the first version of the xxlarge model. Version 2 is different from version 1 due to different dropout rates, additional training data, and longer training. It has better results in nearly all downstream tasks.
 This model has the following configuration:
-- 12 repeating layers
 - 128 embedding dimension
-- 4096 hidden dimension
-- 64 attention heads
-- 223M parameters
 ## Intended uses & limitations
@@ -62,7 +62,7 @@ You can use this model directly with a pipeline for masked language modeling:
 ```python
 >>> from transformers import pipeline
->>> unmasker = pipeline('fill-mask', model='albert-xxlarge-v1')
 >>> unmasker("Hello I'm a [MASK] model.")
 [
    {
@@ -102,8 +102,8 @@ Here is how to use this model to get the features of a given text in PyTorch:
 ```python
 from transformers import AlbertTokenizer, AlbertModel
-tokenizer = AlbertTokenizer.from_pretrained('albert-xxlarge-v1')
-model = AlbertModel.from_pretrained("albert-xxlarge-v1")
 text = "Replace me by any text you'd like."
 encoded_input = tokenizer(text, return_tensors='pt')
 output = model(**encoded_input)
@@ -113,8 +113,8 @@ and in TensorFlow:
 ```python
 from transformers import AlbertTokenizer, TFAlbertModel
-tokenizer = AlbertTokenizer.from_pretrained('albert-xxlarge-v1')
-model = TFAlbertModel.from_pretrained("albert-xxlarge-v1")
 text = "Replace me by any text you'd like."
 encoded_input = tokenizer(text, return_tensors='tf')
 output = model(encoded_input)
@@ -127,7 +127,7 @@ predictions:
 ```python
 >>> from transformers import pipeline
->>> unmasker = pipeline('fill-mask', model='albert-xxlarge-v1')
 >>> unmasker("The man worked as a [MASK].")
 [

 ALBERT is particular in that it shares its layers across its Transformer. Therefore, all layers have the same weights. Using repeating layers results in a small memory footprint, however, the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.
+This is the second version of the xlarge model. Version 2 is different from version 1 due to different dropout rates, additional training data, and longer training. It has better results in nearly all downstream tasks.
 This model has the following configuration:
+- 24 repeating layers
 - 128 embedding dimension
+- 2048 hidden dimension
+- 16 attention heads
+- 58M parameters
 ## Intended uses & limitations
 ```python
 >>> from transformers import pipeline
+>>> unmasker = pipeline('fill-mask', model='albert-xlarge-v2')
 >>> unmasker("Hello I'm a [MASK] model.")
 [
    {
 ```python
 from transformers import AlbertTokenizer, AlbertModel
+tokenizer = AlbertTokenizer.from_pretrained('albert-xlarge-v2')
+model = AlbertModel.from_pretrained("albert-xlarge-v2")
 text = "Replace me by any text you'd like."
 encoded_input = tokenizer(text, return_tensors='pt')
 output = model(**encoded_input)
 ```python
 from transformers import AlbertTokenizer, TFAlbertModel
+tokenizer = AlbertTokenizer.from_pretrained('albert-xlarge-v2')
+model = TFAlbertModel.from_pretrained("albert-xlarge-v2")
 text = "Replace me by any text you'd like."
 encoded_input = tokenizer(text, return_tensors='tf')
 output = model(encoded_input)
 ```python
 >>> from transformers import pipeline
+>>> unmasker = pipeline('fill-mask', model='albert-xlarge-v2')
 >>> unmasker("The man worked as a [MASK].")
 [