# ALBERT

Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in

[this paper](https://arxiv.org/abs/1909.11942) and first released in

ALBERT is particular in that it shares its layers across its Transformer. Therefore, all layers have the same weights. Using repeating layers results in a small memory footprint, however, the computational cost remains similar to a BERTlike architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.

This is the first version of the

This model has the following configuration:

 24 repeating layers

 128 embedding dimension

 16 attention heads

## Intended uses & limitations

```python

>>> from transformers import pipeline

>>> unmasker = pipeline('fillmask', model='albert

>>> unmasker("Hello I'm a [MASK] model.")

[

{

```python

from transformers import AlbertTokenizer, AlbertModel

tokenizer = AlbertTokenizer.from_pretrained('albert

model = AlbertModel.from_pretrained("albert

text = "Replace me by any text you'd like."

encoded_input = tokenizer(text, return_tensors='pt')

output = model(**encoded_input)

```python

from transformers import AlbertTokenizer, TFAlbertModel

tokenizer = AlbertTokenizer.from_pretrained('albert

model = TFAlbertModel.from_pretrained("albert

text = "Replace me by any text you'd like."

encoded_input = tokenizer(text, return_tensors='tf')

output = model(encoded_input)

```python

>>> from transformers import pipeline

>>> unmasker = pipeline('fillmask', model='albert

>>> unmasker("The man worked as a [MASK].")

[



+
# ALBERT XLarge v1

Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in

[this paper](https://arxiv.org/abs/1909.11942) and first released in



ALBERT is particular in that it shares its layers across its Transformer. Therefore, all layers have the same weights. Using repeating layers results in a small memory footprint, however, the computational cost remains similar to a BERTlike architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.

+
This is the first version of the xlarge model. Version 2 is different from version 1 due to different dropout rates, additional training data, and longer training. It has better results in nearly all downstream tasks.

This model has the following configuration:

 24 repeating layers

 128 embedding dimension

 2048 hidden dimension

 16 attention heads

 58M parameters

## Intended uses & limitations

```python

>>> from transformers import pipeline

+
>>> unmasker = pipeline('fillmask', model='albertxlargev1')

>>> unmasker("Hello I'm a [MASK] model.")

[

```python

from transformers import AlbertTokenizer, AlbertModel

+
tokenizer = AlbertTokenizer.from_pretrained('albertxlargev1')

model = AlbertModel.from_pretrained("albertxlargev1")

text = "Replace me by any text you'd like."

encoded_input = tokenizer(text, return_tensors='pt')

```python

from transformers import AlbertTokenizer, TFAlbertModel

tokenizer = AlbertTokenizer.from_pretrained('albertxlargev1')

model = TFAlbertModel.from_pretrained("albertxlargev1")

text = "Replace me by any text you'd like."

encoded_input = tokenizer(text, return_tensors='tf')

```python

>>> from transformers import pipeline

>>> unmasker = pipeline('fillmask', model='albertxlargev1')

>>> unmasker("The man worked as a [MASK].")

[
