Spaces:
No application file
No application file
File size: 12,876 Bytes
d08dd00 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 |
ALBERT
======
*************** Changes from Original Implementation ***************
1. Remove sentence order in `run_pretraining.py`
2. Modify `_is_start_piece_sp` function in `create_pretraining_data.py` to account for non-English languages.
***************New March 28, 2020 ***************
Add a colab [tutorial](https://github.com/google-research/albert/blob/master/albert_glue_fine_tuning_tutorial.ipynb) to run fine-tuning for GLUE datasets.
***************New January 7, 2020 ***************
v2 TF-Hub models should be working now with TF 1.15, as we removed the
native Einsum op from the graph. See updated TF-Hub links below.
***************New December 30, 2019 ***************
Chinese models are released. We would like to thank [CLUE team ](https://github.com/CLUEbenchmark/CLUE) for providing the training data.
- [Base](https://storage.googleapis.com/albert_models/albert_base_zh.tar.gz)
- [Large](https://storage.googleapis.com/albert_models/albert_large_zh.tar.gz)
- [Xlarge](https://storage.googleapis.com/albert_models/albert_xlarge_zh.tar.gz)
- [Xxlarge](https://storage.googleapis.com/albert_models/albert_xxlarge_zh.tar.gz)
Version 2 of ALBERT models is released.
- Base: [[Tar file](https://storage.googleapis.com/albert_models/albert_base_v2.tar.gz)] [[TF-Hub](https://tfhub.dev/google/albert_base/3)]
- Large: [[Tar file](https://storage.googleapis.com/albert_models/albert_large_v2.tar.gz)] [[TF-Hub](https://tfhub.dev/google/albert_large/3)]
- Xlarge: [[Tar file](https://storage.googleapis.com/albert_models/albert_xlarge_v2.tar.gz)] [[TF-Hub](https://tfhub.dev/google/albert_xlarge/3)]
- Xxlarge: [[Tar file](https://storage.googleapis.com/albert_models/albert_xxlarge_v2.tar.gz)] [[TF-Hub](https://tfhub.dev/google/albert_xxlarge/3)]
In this version, we apply 'no dropout', 'additional training data' and 'long training time' strategies to all models. We train ALBERT-base for 10M steps and other models for 3M steps.
The result comparison to the v1 models is as followings:
| | Average | SQuAD1.1 | SQuAD2.0 | MNLI | SST-2 | RACE |
|----------------|----------|----------|----------|----------|----------|----------|
|V2 |
|ALBERT-base |82.3 |90.2/83.2 |82.1/79.3 |84.6 |92.9 |66.8 |
|ALBERT-large |85.7 |91.8/85.2 |84.9/81.8 |86.5 |94.9 |75.2 |
|ALBERT-xlarge |87.9 |92.9/86.4 |87.9/84.1 |87.9 |95.4 |80.7 |
|ALBERT-xxlarge |90.9 |94.6/89.1 |89.8/86.9 |90.6 |96.8 |86.8 |
|V1 |
|ALBERT-base |80.1 |89.3/82.3 | 80.0/77.1|81.6 |90.3 | 64.0 |
|ALBERT-large |82.4 |90.6/83.9 | 82.3/79.4|83.5 |91.7 | 68.5 |
|ALBERT-xlarge |85.5 |92.5/86.1 | 86.1/83.1|86.4 |92.4 | 74.8 |
|ALBERT-xxlarge |91.0 |94.8/89.3 | 90.2/87.4|90.8 |96.9 | 86.5 |
The comparison shows that for ALBERT-base, ALBERT-large, and ALBERT-xlarge, v2 is much better than v1, indicating the importance of applying the above three strategies. On average, ALBERT-xxlarge is slightly worse than the v1, because of the following two reasons: 1) Training additional 1.5 M steps (the only difference between these two models is training for 1.5M steps and 3M steps) did not lead to significant performance improvement. 2) For v1, we did a little bit hyperparameter search among the parameters sets given by BERT, Roberta, and XLnet. For v2, we simply adopt the parameters from v1 except for RACE, where we use a learning rate of 1e-5 and 0 [ALBERT DR](https://arxiv.org/pdf/1909.11942.pdf) (dropout rate for ALBERT in finetuning). The original (v1) RACE hyperparameter will cause model divergence for v2 models. Given that the downstream tasks are sensitive to the fine-tuning hyperparameters, we should be careful about so called slight improvements.
ALBERT is "A Lite" version of BERT, a popular unsupervised language
representation learning algorithm. ALBERT uses parameter-reduction techniques
that allow for large-scale configurations, overcome previous memory limitations,
and achieve better behavior with respect to model degradation.
For a technical description of the algorithm, see our paper:
[ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942)
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut
Release Notes
=============
- Initial release: 10/9/2019
Results
=======
Performance of ALBERT on GLUE benchmark results using a single-model setup on
dev:
| Models | MNLI | QNLI | QQP | RTE | SST | MRPC | CoLA | STS |
|-------------------|----------|----------|----------|----------|----------|----------|----------|----------|
| BERT-large | 86.6 | 92.3 | 91.3 | 70.4 | 93.2 | 88.0 | 60.6 | 90.0 |
| XLNet-large | 89.8 | 93.9 | 91.8 | 83.8 | 95.6 | 89.2 | 63.6 | 91.8 |
| RoBERTa-large | 90.2 | 94.7 | **92.2** | 86.6 | 96.4 | **90.9** | 68.0 | 92.4 |
| ALBERT (1M) | 90.4 | 95.2 | 92.0 | 88.1 | 96.8 | 90.2 | 68.7 | 92.7 |
| ALBERT (1.5M) | **90.8** | **95.3** | **92.2** | **89.2** | **96.9** | **90.9** | **71.4** | **93.0** |
Performance of ALBERT-xxl on SQuaD and RACE benchmarks using a single-model
setup:
|Models | SQuAD1.1 dev | SQuAD2.0 dev | SQuAD2.0 test | RACE test (Middle/High) |
|--------------------------|---------------|---------------|---------------|-------------------------|
|BERT-large | 90.9/84.1 | 81.8/79.0 | 89.1/86.3 | 72.0 (76.6/70.1) |
|XLNet | 94.5/89.0 | 88.8/86.1 | 89.1/86.3 | 81.8 (85.5/80.2) |
|RoBERTa | 94.6/88.9 | 89.4/86.5 | 89.8/86.8 | 83.2 (86.5/81.3) |
|UPM | - | - | 89.9/87.2 | - |
|XLNet + SG-Net Verifier++ | - | - | 90.1/87.2 | - |
|ALBERT (1M) | 94.8/89.2 | 89.9/87.2 | - | 86.0 (88.2/85.1) |
|ALBERT (1.5M) | **94.8/89.3** | **90.2/87.4** | **90.9/88.1** | **86.5 (89.0/85.5)** |
Pre-trained Models
==================
TF-Hub modules are available:
- Base: [[Tar file](https://storage.googleapis.com/albert_models/albert_base_v1.tar.gz)] [[TF-Hub](https://tfhub.dev/google/albert_base/1)]
- Large: [[Tar file](https://storage.googleapis.com/albert_models/albert_large_v1.tar.gz)] [[TF-Hub](https://tfhub.dev/google/albert_large/1)]
- Xlarge: [[Tar file](https://storage.googleapis.com/albert_models/albert_xlarge_v1.tar.gz)] [[TF-Hub](https://tfhub.dev/google/albert_xlarge/1)]
- Xxlarge: [[Tar file](https://storage.googleapis.com/albert_models/albert_xxlarge_v1.tar.gz)] [[TF-Hub](https://tfhub.dev/google/albert_xxlarge/1)]
Example usage of the TF-Hub module in code:
```
tags = set()
if is_training:
tags.add("train")
albert_module = hub.Module("https://tfhub.dev/google/albert_base/1", tags=tags,
trainable=True)
albert_inputs = dict(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids)
albert_outputs = albert_module(
inputs=albert_inputs,
signature="tokens",
as_dict=True)
# If you want to use the token-level output, use
# albert_outputs["sequence_output"] instead.
output_layer = albert_outputs["pooled_output"]
```
Most of the fine-tuning scripts in this repository support TF-hub modules
via the `--albert_hub_module_handle` flag.
Pre-training Instructions
=========================
To pretrain ALBERT, use `run_pretraining.py`:
```
pip install -r albert/requirements.txt
python -m albert.run_pretraining \
--input_file=... \
--output_dir=... \
--init_checkpoint=... \
--albert_config_file=... \
--do_train \
--do_eval \
--train_batch_size=4096 \
--eval_batch_size=64 \
--max_seq_length=512 \
--max_predictions_per_seq=20 \
--optimizer='lamb' \
--learning_rate=.00176 \
--num_train_steps=125000 \
--num_warmup_steps=3125 \
--save_checkpoints_steps=5000
```
Fine-tuning on GLUE
===================
To fine-tune and evaluate a pretrained ALBERT on GLUE, please see the
convenience script `run_glue.sh`.
Lower-level use cases may want to use the `run_classifier.py` script directly.
The `run_classifier.py` script is used both for fine-tuning and evaluation of
ALBERT on individual GLUE benchmark tasks, such as MNLI:
```
pip install -r albert/requirements.txt
python -m albert.run_classifier \
--data_dir=... \
--output_dir=... \
--init_checkpoint=... \
--albert_config_file=... \
--spm_model_file=... \
--do_train \
--do_eval \
--do_predict \
--do_lower_case \
--max_seq_length=128 \
--optimizer=adamw \
--task_name=MNLI \
--warmup_step=1000 \
--learning_rate=3e-5 \
--train_step=10000 \
--save_checkpoints_steps=100 \
--train_batch_size=128
```
Good default flag values for each GLUE task can be found in `run_glue.sh`.
You can fine-tune the model starting from TF-Hub modules instead of raw
checkpoints by setting e.g.
`--albert_hub_module_handle=https://tfhub.dev/google/albert_base/1` instead
of `--init_checkpoint`.
You can find the spm_model_file in the tar files or under the assets folder of
the tf-hub module. The name of the model file is "30k-clean.model".
After evaluation, the script should report some output like this:
```
***** Eval results *****
global_step = ...
loss = ...
masked_lm_accuracy = ...
masked_lm_loss = ...
sentence_order_accuracy = ...
sentence_order_loss = ...
```
Fine-tuning on SQuAD
====================
To fine-tune and evaluate a pretrained model on SQuAD v1, use the
`run_squad_v1.py` script:
```
pip install -r albert/requirements.txt
python -m albert.run_squad_v1 \
--albert_config_file=... \
--output_dir=... \
--train_file=... \
--predict_file=... \
--train_feature_file=... \
--predict_feature_file=... \
--predict_feature_left_file=... \
--init_checkpoint=... \
--spm_model_file=... \
--do_lower_case \
--max_seq_length=384 \
--doc_stride=128 \
--max_query_length=64 \
--do_train=true \
--do_predict=true \
--train_batch_size=48 \
--predict_batch_size=8 \
--learning_rate=5e-5 \
--num_train_epochs=2.0 \
--warmup_proportion=.1 \
--save_checkpoints_steps=5000 \
--n_best_size=20 \
--max_answer_length=30
```
You can fine-tune the model starting from TF-Hub modules instead of raw
checkpoints by setting e.g.
`--albert_hub_module_handle=https://tfhub.dev/google/albert_base/1` instead
of `--init_checkpoint`.
For SQuAD v2, use the `run_squad_v2.py` script:
```
pip install -r albert/requirements.txt
python -m albert.run_squad_v2 \
--albert_config_file=... \
--output_dir=... \
--train_file=... \
--predict_file=... \
--train_feature_file=... \
--predict_feature_file=... \
--predict_feature_left_file=... \
--init_checkpoint=... \
--spm_model_file=... \
--do_lower_case \
--max_seq_length=384 \
--doc_stride=128 \
--max_query_length=64 \
--do_train \
--do_predict \
--train_batch_size=48 \
--predict_batch_size=8 \
--learning_rate=5e-5 \
--num_train_epochs=2.0 \
--warmup_proportion=.1 \
--save_checkpoints_steps=5000 \
--n_best_size=20 \
--max_answer_length=30
```
You can fine-tune the model starting from TF-Hub modules instead of raw
checkpoints by setting e.g.
`--albert_hub_module_handle=https://tfhub.dev/google/albert_base/1` instead
of `--init_checkpoint`.
Fine-tuning on RACE
===================
For RACE, use the `run_race.py` script:
```
pip install -r albert/requirements.txt
python -m albert.run_race \
--albert_config_file=... \
--output_dir=... \
--train_file=... \
--eval_file=... \
--data_dir=...\
--init_checkpoint=... \
--spm_model_file=... \
--max_seq_length=512 \
--max_qa_length=128 \
--do_train \
--do_eval \
--train_batch_size=32 \
--eval_batch_size=8 \
--learning_rate=1e-5 \
--train_step=12000 \
--warmup_step=1000 \
--save_checkpoints_steps=100
```
You can fine-tune the model starting from TF-Hub modules instead of raw
checkpoints by setting e.g.
`--albert_hub_module_handle=https://tfhub.dev/google/albert_base/1` instead
of `--init_checkpoint`.
SentencePiece
=============
Command for generating the sentence piece vocabulary:
```
spm_train \
--input all.txt --model_prefix=30k-clean --vocab_size=30000 --logtostderr
--pad_id=0 --unk_id=1 --eos_id=-1 --bos_id=-1
--control_symbols=[CLS],[SEP],[MASK]
--user_defined_symbols="(,),\",-,.,–,£,€"
--shuffle_input_sentence=true --input_sentence_size=10000000
--character_coverage=0.99995 --model_type=unigram
```
|