albert-xxlarge-v2

 1 --- 2 tags: 3 - exbert 4 language: en 5 license: apache-2.0 6 datasets: 7 - bookcorpus 8 - wikipedia 9 --- 10 11 # ALBERT XXLarge v2 12 13 Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in 14 [this paper](https://arxiv.org/abs/1909.11942) and first released in 15 [this repository](https://github.com/google-research/albert). This model, as all ALBERT models, is uncased: it does not make a difference 16 between english and English. 17 18 Disclaimer: The team releasing ALBERT did not write a model card for this model so this model card has been written by 19 the Hugging Face team. 20 21 ## Model description 22 23 ALBERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it 24 was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of 25 publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it 26 was pretrained with two objectives: 27 28 - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run 29 the entire masked sentence through the model and has to predict the masked words. This is different from traditional 30 recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like 31 GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the 32 sentence. 33 - Sentence Ordering Prediction (SOP): ALBERT uses a pretraining loss based on predicting the ordering of two consecutive segments of text. 34 35 This way, the model learns an inner representation of the English language that can then be used to extract features 36 useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard 37 classifier using the features produced by the ALBERT model as inputs. 38 39 ALBERT is particular in that it shares its layers across its Transformer. Therefore, all layers have the same weights. Using repeating layers results in a small memory footprint, however, the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers. 40 41 This is the second version of the xxlarge model. Version 2 is different from version 1 due to different dropout rates, additional training data, and longer training. It has better results in nearly all downstream tasks. 42 43 This model has the following configuration: 44 45 - 12 repeating layers 46 - 128 embedding dimension 47 - 4096 hidden dimension 48 - 64 attention heads 49 - 223M parameters 50 51 ## Intended uses & limitations 52 53 You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to 54 be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=albert) to look for 55 fine-tuned versions on a task that interests you. 56 57 Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) 58 to make decisions, such as sequence classification, token classification or question answering. For tasks such as text 59 generation you should look at model like GPT2. 60 61 ### How to use 62 63 You can use this model directly with a pipeline for masked language modeling: 64 65 python 66 >>> from transformers import pipeline 67 >>> unmasker = pipeline('fill-mask', model='albert-xxlarge-v2') 68 >>> unmasker("Hello I'm a [MASK] model.") 69 [ 70 { 71 "sequence":"[CLS] hello i'm a modeling model.[SEP]", 72 "score":0.05816134437918663, 73 "token":12807, 74 "token_str":"â–modeling" 75 }, 76 { 77 "sequence":"[CLS] hello i'm a modelling model.[SEP]", 78 "score":0.03748830780386925, 79 "token":23089, 80 "token_str":"â–modelling" 81 }, 82 { 83 "sequence":"[CLS] hello i'm a model model.[SEP]", 84 "score":0.033725276589393616, 85 "token":1061, 86 "token_str":"â–model" 87 }, 88 { 89 "sequence":"[CLS] hello i'm a runway model.[SEP]", 90 "score":0.017313428223133087, 91 "token":8014, 92 "token_str":"â–runway" 93 }, 94 { 95 "sequence":"[CLS] hello i'm a lingerie model.[SEP]", 96 "score":0.014405295252799988, 97 "token":29104, 98 "token_str":"â–lingerie" 99 } 100 ] 101  102 103 Here is how to use this model to get the features of a given text in PyTorch: 104 105 python 106 from transformers import AlbertTokenizer, AlbertModel 107 tokenizer = AlbertTokenizer.from_pretrained('albert-xxlarge-v2') 108 model = AlbertModel.from_pretrained("albert-xxlarge-v2") 109 text = "Replace me by any text you'd like." 110 encoded_input = tokenizer(text, return_tensors='pt') 111 output = model(**encoded_input) 112  113 114 and in TensorFlow: 115 116 python 117 from transformers import AlbertTokenizer, TFAlbertModel 118 tokenizer = AlbertTokenizer.from_pretrained('albert-xxlarge-v2') 119 model = TFAlbertModel.from_pretrained("albert-xxlarge-v2") 120 text = "Replace me by any text you'd like." 121 encoded_input = tokenizer(text, return_tensors='tf') 122 output = model(encoded_input) 123  124 125 ### Limitations and bias 126 127 Even if the training data used for this model could be characterized as fairly neutral, this model can have biased 128 predictions: 129 130 python 131 >>> from transformers import pipeline 132 >>> unmasker = pipeline('fill-mask', model='albert-xxlarge-v2') 133 >>> unmasker("The man worked as a [MASK].") 134 135 [ 136 { 137 "sequence":"[CLS] the man worked as a chauffeur.[SEP]", 138 "score":0.029577180743217468, 139 "token":28744, 140 "token_str":"â–chauffeur" 141 }, 142 { 143 "sequence":"[CLS] the man worked as a janitor.[SEP]", 144 "score":0.028865724802017212, 145 "token":29477, 146 "token_str":"â–janitor" 147 }, 148 { 149 "sequence":"[CLS] the man worked as a shoemaker.[SEP]", 150 "score":0.02581118606030941, 151 "token":29024, 152 "token_str":"â–shoemaker" 153 }, 154 { 155 "sequence":"[CLS] the man worked as a blacksmith.[SEP]", 156 "score":0.01849772222340107, 157 "token":21238, 158 "token_str":"â–blacksmith" 159 }, 160 { 161 "sequence":"[CLS] the man worked as a lawyer.[SEP]", 162 "score":0.01820771023631096, 163 "token":3672, 164 "token_str":"â–lawyer" 165 } 166 ] 167 168 >>> unmasker("The woman worked as a [MASK].") 169 170 [ 171 { 172 "sequence":"[CLS] the woman worked as a receptionist.[SEP]", 173 "score":0.04604868218302727, 174 "token":25331, 175 "token_str":"â–receptionist" 176 }, 177 { 178 "sequence":"[CLS] the woman worked as a janitor.[SEP]", 179 "score":0.028220869600772858, 180 "token":29477, 181 "token_str":"â–janitor" 182 }, 183 { 184 "sequence":"[CLS] the woman worked as a paramedic.[SEP]", 185 "score":0.0261906236410141, 186 "token":23386, 187 "token_str":"â–paramedic" 188 }, 189 { 190 "sequence":"[CLS] the woman worked as a chauffeur.[SEP]", 191 "score":0.024797942489385605, 192 "token":28744, 193 "token_str":"â–chauffeur" 194 }, 195 { 196 "sequence":"[CLS] the woman worked as a waitress.[SEP]", 197 "score":0.024124596267938614, 198 "token":13678, 199 "token_str":"â–waitress" 200 } 201 ] 202  203 204 This bias will also affect all fine-tuned versions of this model. 205 206 ## Training data 207 208 The ALBERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 209 unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and 210 headers). 211 212 ## Training procedure 213 214 ### Preprocessing 215 216 The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 30,000. The inputs of the model are 217 then of the form: 218 219  220 [CLS] Sentence A [SEP] Sentence B [SEP] 221  222 223 ### Training 224 225 The ALBERT procedure follows the BERT setup. 226 227 The details of the masking procedure for each sentence are the following: 228 - 15% of the tokens are masked. 229 - In 80% of the cases, the masked tokens are replaced by [MASK]. 230 - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace. 231 - In the 10% remaining cases, the masked tokens are left as is. 232 233 ## Evaluation results 234 235 When fine-tuned on downstream tasks, the ALBERT models achieve the following results: 236 237 | | Average | SQuAD1.1 | SQuAD2.0 | MNLI | SST-2 | RACE | 238 |----------------|----------|----------|----------|----------|----------|----------| 239 |V2 | 240 |ALBERT-base |82.3 |90.2/83.2 |82.1/79.3 |84.6 |92.9 |66.8 | 241 |ALBERT-large |85.7 |91.8/85.2 |84.9/81.8 |86.5 |94.9 |75.2 | 242 |ALBERT-xlarge |87.9 |92.9/86.4 |87.9/84.1 |87.9 |95.4 |80.7 | 243 |ALBERT-xxlarge |90.9 |94.6/89.1 |89.8/86.9 |90.6 |96.8 |86.8 | 244 |V1 | 245 |ALBERT-base |80.1 |89.3/82.3 | 80.0/77.1|81.6 |90.3 | 64.0 | 246 |ALBERT-large |82.4 |90.6/83.9 | 82.3/79.4|83.5 |91.7 | 68.5 | 247 |ALBERT-xlarge |85.5 |92.5/86.1 | 86.1/83.1|86.4 |92.4 | 74.8 | 248 |ALBERT-xxlarge |91.0 |94.8/89.3 | 90.2/87.4|90.8 |96.9 | 86.5 | 249 250 251 ### BibTeX entry and citation info 252 253 bibtex 254 @article{DBLP:journals/corr/abs-1909-11942, 255 author = {Zhenzhong Lan and 256 Mingda Chen and 257 Sebastian Goodman and 258 Kevin Gimpel and 259 Piyush Sharma and 260 Radu Soricut}, 261 title = {{ALBERT:} {A} Lite {BERT} for Self-supervised Learning of Language 262 Representations}, 263 journal = {CoRR}, 264 volume = {abs/1909.11942}, 265 year = {2019}, 266 url = {http://arxiv.org/abs/1909.11942}, 267 archivePrefix = {arXiv}, 268 eprint = {1909.11942}, 269 timestamp = {Fri, 27 Sep 2019 13:04:21 +0200}, 270 biburl = {https://dblp.org/rec/journals/corr/abs-1909-11942.bib}, 271 bibsource = {dblp computer science bibliography, https://dblp.org} 272 } 273  274 275 276 277