albert-xxlarge-v2 / README.md
1 ---
2 tags:
3 - exbert
4 language: en
5 license: apache-2.0
6 datasets:
7 - bookcorpus
8 - wikipedia
9 ---
10
11 # ALBERT XXLarge v2
12
13 Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in
14 [this paper](https://arxiv.org/abs/1909.11942) and first released in
15 [this repository](https://github.com/google-research/albert). This model, as all ALBERT models, is uncased: it does not make a difference
16 between english and English.
17
18 Disclaimer: The team releasing ALBERT did not write a model card for this model so this model card has been written by
19 the Hugging Face team.
20
21 ## Model description
22
23 ALBERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
24 was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
25 publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
26 was pretrained with two objectives:
27
28 - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
29 the entire masked sentence through the model and has to predict the masked words. This is different from traditional
30 recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
31 GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the
32 sentence.
33 - Sentence Ordering Prediction (SOP): ALBERT uses a pretraining loss based on predicting the ordering of two consecutive segments of text.
34
35 This way, the model learns an inner representation of the English language that can then be used to extract features
36 useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
37 classifier using the features produced by the ALBERT model as inputs.
38
39 ALBERT is particular in that it shares its layers across its Transformer. Therefore, all layers have the same weights. Using repeating layers results in a small memory footprint, however, the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.
40
41 This is the second version of the xxlarge model. Version 2 is different from version 1 due to different dropout rates, additional training data, and longer training. It has better results in nearly all downstream tasks.
42
43 This model has the following configuration:
44
45 - 12 repeating layers
46 - 128 embedding dimension
47 - 4096 hidden dimension
48 - 64 attention heads
49 - 223M parameters
50
51 ## Intended uses & limitations
52
53 You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
54 be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=albert) to look for
55 fine-tuned versions on a task that interests you.
56
57 Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
58 to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
59 generation you should look at model like GPT2.
60
61 ### How to use
62
63 You can use this model directly with a pipeline for masked language modeling:
64
65 ```python
66 >>> from transformers import pipeline
67 >>> unmasker = pipeline('fill-mask', model='albert-xxlarge-v2')
68 >>> unmasker("Hello I'm a [MASK] model.")
69 [
70 {
71 "sequence":"[CLS] hello i'm a modeling model.[SEP]",
72 "score":0.05816134437918663,
73 "token":12807,
74 "token_str":"▁modeling"
75 },
76 {
77 "sequence":"[CLS] hello i'm a modelling model.[SEP]",
78 "score":0.03748830780386925,
79 "token":23089,
80 "token_str":"▁modelling"
81 },
82 {
83 "sequence":"[CLS] hello i'm a model model.[SEP]",
84 "score":0.033725276589393616,
85 "token":1061,
86 "token_str":"▁model"
87 },
88 {
89 "sequence":"[CLS] hello i'm a runway model.[SEP]",
90 "score":0.017313428223133087,
91 "token":8014,
92 "token_str":"▁runway"
93 },
94 {
95 "sequence":"[CLS] hello i'm a lingerie model.[SEP]",
96 "score":0.014405295252799988,
97 "token":29104,
98 "token_str":"▁lingerie"
99 }
100 ]
101 ```
102
103 Here is how to use this model to get the features of a given text in PyTorch:
104
105 ```python
106 from transformers import AlbertTokenizer, AlbertModel
107 tokenizer = AlbertTokenizer.from_pretrained('albert-xxlarge-v2')
108 model = AlbertModel.from_pretrained("albert-xxlarge-v2")
109 text = "Replace me by any text you'd like."
110 encoded_input = tokenizer(text, return_tensors='pt')
111 output = model(**encoded_input)
112 ```
113
114 and in TensorFlow:
115
116 ```python
117 from transformers import AlbertTokenizer, TFAlbertModel
118 tokenizer = AlbertTokenizer.from_pretrained('albert-xxlarge-v2')
119 model = TFAlbertModel.from_pretrained("albert-xxlarge-v2")
120 text = "Replace me by any text you'd like."
121 encoded_input = tokenizer(text, return_tensors='tf')
122 output = model(encoded_input)
123 ```
124
125 ### Limitations and bias
126
127 Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
128 predictions:
129
130 ```python
131 >>> from transformers import pipeline
132 >>> unmasker = pipeline('fill-mask', model='albert-xxlarge-v2')
133 >>> unmasker("The man worked as a [MASK].")
134
135 [
136 {
137 "sequence":"[CLS] the man worked as a chauffeur.[SEP]",
138 "score":0.029577180743217468,
139 "token":28744,
140 "token_str":"▁chauffeur"
141 },
142 {
143 "sequence":"[CLS] the man worked as a janitor.[SEP]",
144 "score":0.028865724802017212,
145 "token":29477,
146 "token_str":"▁janitor"
147 },
148 {
149 "sequence":"[CLS] the man worked as a shoemaker.[SEP]",
150 "score":0.02581118606030941,
151 "token":29024,
152 "token_str":"▁shoemaker"
153 },
154 {
155 "sequence":"[CLS] the man worked as a blacksmith.[SEP]",
156 "score":0.01849772222340107,
157 "token":21238,
158 "token_str":"▁blacksmith"
159 },
160 {
161 "sequence":"[CLS] the man worked as a lawyer.[SEP]",
162 "score":0.01820771023631096,
163 "token":3672,
164 "token_str":"▁lawyer"
165 }
166 ]
167
168 >>> unmasker("The woman worked as a [MASK].")
169
170 [
171 {
172 "sequence":"[CLS] the woman worked as a receptionist.[SEP]",
173 "score":0.04604868218302727,
174 "token":25331,
175 "token_str":"▁receptionist"
176 },
177 {
178 "sequence":"[CLS] the woman worked as a janitor.[SEP]",
179 "score":0.028220869600772858,
180 "token":29477,
181 "token_str":"▁janitor"
182 },
183 {
184 "sequence":"[CLS] the woman worked as a paramedic.[SEP]",
185 "score":0.0261906236410141,
186 "token":23386,
187 "token_str":"▁paramedic"
188 },
189 {
190 "sequence":"[CLS] the woman worked as a chauffeur.[SEP]",
191 "score":0.024797942489385605,
192 "token":28744,
193 "token_str":"▁chauffeur"
194 },
195 {
196 "sequence":"[CLS] the woman worked as a waitress.[SEP]",
197 "score":0.024124596267938614,
198 "token":13678,
199 "token_str":"▁waitress"
200 }
201 ]
202 ```
203
204 This bias will also affect all fine-tuned versions of this model.
205
206 ## Training data
207
208 The ALBERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
209 unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and
210 headers).
211
212 ## Training procedure
213
214 ### Preprocessing
215
216 The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 30,000. The inputs of the model are
217 then of the form:
218
219 ```
220 [CLS] Sentence A [SEP] Sentence B [SEP]
221 ```
222
223 ### Training
224
225 The ALBERT procedure follows the BERT setup.
226
227 The details of the masking procedure for each sentence are the following:
228 - 15% of the tokens are masked.
229 - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
230 - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
231 - In the 10% remaining cases, the masked tokens are left as is.
232
233 ## Evaluation results
234
235 When fine-tuned on downstream tasks, the ALBERT models achieve the following results:
236
237 | | Average | SQuAD1.1 | SQuAD2.0 | MNLI | SST-2 | RACE |
238 |----------------|----------|----------|----------|----------|----------|----------|
239 |V2 |
240 |ALBERT-base |82.3 |90.2/83.2 |82.1/79.3 |84.6 |92.9 |66.8 |
241 |ALBERT-large |85.7 |91.8/85.2 |84.9/81.8 |86.5 |94.9 |75.2 |
242 |ALBERT-xlarge |87.9 |92.9/86.4 |87.9/84.1 |87.9 |95.4 |80.7 |
243 |ALBERT-xxlarge |90.9 |94.6/89.1 |89.8/86.9 |90.6 |96.8 |86.8 |
244 |V1 |
245 |ALBERT-base |80.1 |89.3/82.3 | 80.0/77.1|81.6 |90.3 | 64.0 |
246 |ALBERT-large |82.4 |90.6/83.9 | 82.3/79.4|83.5 |91.7 | 68.5 |
247 |ALBERT-xlarge |85.5 |92.5/86.1 | 86.1/83.1|86.4 |92.4 | 74.8 |
248 |ALBERT-xxlarge |91.0 |94.8/89.3 | 90.2/87.4|90.8 |96.9 | 86.5 |
249
250
251 ### BibTeX entry and citation info
252
253 ```bibtex
254 @article{DBLP:journals/corr/abs-1909-11942,
255 author = {Zhenzhong Lan and
256 Mingda Chen and
257 Sebastian Goodman and
258 Kevin Gimpel and
259 Piyush Sharma and
260 Radu Soricut},
261 title = {{ALBERT:} {A} Lite {BERT} for Self-supervised Learning of Language
262 Representations},
263 journal = {CoRR},
264 volume = {abs/1909.11942},
265 year = {2019},
266 url = {http://arxiv.org/abs/1909.11942},
267 archivePrefix = {arXiv},
268 eprint = {1909.11942},
269 timestamp = {Fri, 27 Sep 2019 13:04:21 +0200},
270 biburl = {https://dblp.org/rec/journals/corr/abs-1909-11942.bib},
271 bibsource = {dblp computer science bibliography, https://dblp.org}
272 }
273 ```
274
275 <a href="https://huggingface.co/exbert/?model=albert-xxlarge-v2">
276 <img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
277 </a>