Spaces:
Running
Running
File size: 15,501 Bytes
650c5f6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 |
# Training with Quantization Noise for Extreme Model Compression ({Fan\*, Stock\*} *et al.*, 2020)
This page contains information for how to train and quantize models with Quantization Noise, for both scalar quantization like `int8` and Iterative Product Quantization.
Check out our paper [here](https://arxiv.org/abs/2004.07320).
Looking for pretrained models? They will be added shortly.
Looking for code to train vision models? We are working on open sourcing our code as part of ClassyVision. Please check back, but note that both the Scalar and Iterative Product Quantization counterparts of the `nn.Conv2d` module are already included in this release.
**Contents**:
- [Walk through of code](#walk-through-the-code)
- [Reproduce NLP Results](#looking-to-reproduce-the-nlp-results-in-the-paper)
- [Reproduce Vision Results](#looking-to-reproduce-the-vision-results-in-the-paper)
## Citation
```bibtex
@article{fan2020training,
title={Training with Quantization Noise for Extreme Model Compression},
author={Angela Fan* and Pierre Stock* and and Benjamin Graham and Edouard Grave and Remi Gribonval and Herve Jegou and Armand Joulin},
year={2020},
eprint={2004.07320},
archivePrefix={arXiv},
primaryClass={cs.ML}
}
```
## Walk through the code
Training a model with Quant-Noise improves the performance in subsequent inference-time quantization by training models to be robust to quantization. This technique is useful for both scalar and product quantization methods, as well as multiple domains. We detail below our approach to train, quantize models and integrate our code to quantize your favorite models.
### Scalar Quantization
Unlike the section [Iterative Product Quantization](#iterative-product-quantization) which gives state-of-the-art compression, this section showcases the usefulness of our approach for simple scalar quantization baselines such as int8 using on-GPU Fake Quantization.
#### Training
Scalar quantization with Quant-Noise consists in randomly quantizing a proportion `p` of the weights during training. Scalar quantization is implemented [here](https://github.com/pytorch/fairseq/tree/main/fairseq/modules/quantization/scalar) under the form of Fake Quantization, meaning that we emulate int8 on GPU by quantizing and de-quantizing both the weights and the activations. We rely on PyTorch's [quantization primitives](https://github.com/pytorch/pytorch/tree/master/torch/quantization).
To train a model with Quant-Noise, add the following flag:
```
--quant-noise-scalar 0.5
```
Large values of noise make the network easier to quantize but may result in higher non-quantized test and validation perplexities.
#### Quantization
When evaluating a network, all quantized modules and activation hooks automatically switch to `p=1` so the validation accuracy reported by Fairseq is actually the quantized one, nothing more to do.
#### Integration with your own code
Looking to quantize your own models with Quant-Noise + Scalar Quantization?
- Use the function `quantize_model_` implemented [here](https://github.com/pytorch/fairseq/tree/main/fairseq/modules/quantization/scalar/utils.py) to (1) replace all your modules by their quantized counterparts and (2) add hooks to those modules to quantize the activations.
- Then, perform your training as usual. Note that in `eval()` mode, the network is always fully quantized (weights and activations) by default (`p=1`).
### Iterative Product Quantization
Iterative Product Quantization with Quant-Noise proceeds in two steps. First, a model must be trained uncompressed with Quant-Noise. Second, the model must be quantized with iPQ. Note that we implement here the simplest form of noise, which consists in randomly dropping a proportion `p` of blocks, and that worked as well as assigning those blocks to their current centroid.
#### Training
To train a model with Quant-Noise, add the following flags:
```
--quant-noise-pq 0.1 --quant-noise-pq-block-size 8
```
`quant-noise-pq` controls how much dropout is applied to the blocks of the weight matrix. `quant-noise-pq-block-size` controls the size of the weight matrix blocks.
We recommend training with 0.05 to 0.2 Quant-Noise, a value that worked well in our experiments. For the block-size, we recommend training with block-size of 8. Note that the block size must be a multiple of `input_features`, see the size checks [here](https://github.com/pytorch/fairseq/tree/main/fairseq/modules/quant_noise.py). Large block sizes result in higher compression ratio but may induce a loss in accuracy.
We currently support training Transformer based models, such as sequence-to-sequence, language models, and BERT architectures. The `quant_noise` function [here](https://github.com/pytorch/fairseq/tree/main/fairseq/modules/quant_noise.py) wraps a module. It splits a weight matrix into blocks and applies random dropout to these blocks.
In the Transformer architectures, quant-noise is applied to the input and output embeddings, the attention, and the FFN.
Quant-Noise can also be combined with **LayerDrop** (see [here](https://github.com/pytorch/fairseq/tree/main/examples/layerdrop)) to add its pruning effect to the quantized model and make the model even smaller. We recommend training with LayerDrop 0.1 or 0.2.
#### Quantization
We implement an improved version of product quantization from Stock et al, **iPQ**, described [here](https://arxiv.org/abs/1907.05686), see code with old API [here](https://github.com/facebookresearch/kill-the-bits). Note that we improved the iPQ API in terms of both compute speed and usability as described below.
For the particular case of PQ, quantization is made sequentially. We recommend first quantizing the FFNs, then the EMBs, and finally the ATTNs. Quantization is done in two sub-steps:
- First, perform `n` steps of Product Quantization (generally `n=20` is enough).
- Then, finetune the obtained centroids.
#### Integration with your own code
Looking to quantize your own models with Quant-Noise + iPQ?
- First wrap your modules with the `quant_noise` function [here](https://github.com/pytorch/fairseq/tree/main/fairseq/modules/quant_noise.py), which is module-agnostic and train your favorite model.
- Then, quantize your trained model using the code [here](https://github.com/pytorch/fairseq/tree/main/fairseq/modules/quantization/pq). This can be done *without any changes to your training loop*. Below is an example code for integration.
Note that we tried our approach only on Transformers and various Convolutional Models such as EfficientNets.
```python
from fairseq.modules.quantization.pq import quantize_model_, SizeTracker
# get configuration parameters
n_centroids_config = config["n_centroids"]
block_sizes_config = config["block_sizes"]
layers_to_quantize = config["layers_to_quantize"]
# size tracker for keeping track of assignments, centroids and non-compressed sizes
size_tracker = SizeTracker(model)
# Quantize model by stages
for step in range(len(layers_to_quantize)):
# quantize model in-place
quantized_layers = quantize_model_(
model,
size_tracker,
layers_to_quantize,
block_sizes_config,
n_centroids_config,
step=step,
)
logger.info(f"Finetuning stage {step}, quantized layers: {quantized_layers}")
logger.info(f"{size_tracker}")
# Don't forget to re-create/update trainer/optimizer since model parameters have changed
optimizer = ...
# Finetune the centroids with your usual training loop for a few epochs
trainer.train_epoch()
```
## Looking to reproduce the NLP results in the paper?
We detail below how to reproduce the state-of-the-art results in reported in the paper for Quant-Noise + Iterative Product Quantization.
### Training with Quant-Noise
To **train** RoBERTa + QuantNoise, we followed this setting [here](https://github.com/pytorch/fairseq/tree/main/examples/roberta).
The following command can be used to train a RoBERTa Base + QuantNoise model:
```bash
TOTAL_UPDATES=125000
WARMUP_UPDATES=10000
PEAK_LR=0.0005
TOKENS_PER_SAMPLE=512
MAX_POSITIONS=512
MAX_SENTENCES=16
UPDATE_FREQ=2
DATA_DIR=/path/to/data/here
fairseq-train $DATA_DIR \
--task masked_lm --criterion masked_lm --arch roberta_base \
--sample-break-mode complete \
--tokens-per-sample $TOKENS_PER_SAMPLE --max-positions $MAX_POSITIONS \
--optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-6 \
--clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr $PEAK_LR \
--warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
--dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.01 \
--batch-size $MAX_SENTENCES \
--update-freq $UPDATE_FREQ --max-update $TOTAL_UPDATES \
--save-dir checkpoint/roberta \
--ddp-backend legacy_ddp --encoder-layerdrop 0.2 \
--quant-noise-pq 0.2 --quant-noise-pq-block-size 8 --untie-weights-roberta
```
To **finetune** RoBERTa + QuantNoise, we followed this setting [here](https://github.com/pytorch/fairseq/blob/main/examples/roberta/README.glue.md).
The following command can be used to finetune a RoBERTa Base + QuantNoise model on the RTE dataset:
```bash
TOTAL_NUM_UPDATES=2036
WARMUP_UPDATES=122
LR=2e-05
NUM_CLASSES=2
MAX_SENTENCES=16
ROBERTA_PATH=/path/to/roberta_quantnoise/model.pt
fairseq-train /path/to/rte/data/ \
--restore-file $ROBERTA_PATH \
--max-positions 512 \
--batch-size $MAX_SENTENCES \
--max-tokens 4400 \
--task sentence_prediction \
--reset-optimizer --reset-dataloader --reset-meters \
--required-batch-size-multiple 1 \
--init-token 0 --separator-token 2 \
--arch roberta_large \
--criterion sentence_prediction \
--num-classes $NUM_CLASSES \
--dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
--clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
--fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
--max-epoch 10 \
--find-unused-parameters \
--best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
--ddp-backend legacy_ddp \
--quant-noise-pq 0.2 --quant-noise-pq-block-size 8
```
To **train** Language Models on Wikitext-103, we followed this setting [here](https://github.com/pytorch/fairseq/tree/main/examples/language_model).
The following command can be used to train a Transformer + QuantNoise model on Wikitext-103:
```bash
fairseq-train --task language_modeling /path/to/wikitext-103/data \
--save-dir checkpoints/transformer_wikitext-103 \
--adaptive-input --adaptive-input-cutoff 20000,60000 --adaptive-input-factor 4 \
--adaptive-softmax-cutoff 20000,60000 --adaptive-softmax-dropout 0.2 --adaptive-softmax-factor 4.0 \
--tie-adaptive-proj --tie-adaptive-weights \
--arch transformer_lm_gbw \
--attention-dropout 0.1 --dropout 0.2 --relu-dropout 0.1 \
--clip-norm 0.1 --criterion adaptive_loss \
--ddp-backend legacy_ddp \
--decoder-attention-heads 8 --decoder-embed-dim 1024 --decoder-ffn-embed-dim 4096 --decoder-input-dim 1024 \
--decoder-layers 16 --decoder-normalize-before --decoder-output-dim 1024 \
--min-lr 0.0001 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75 --lr 1.0 --t-mult 2.0 \
--max-tokens 3072 --tokens-per-sample 3072 --momentum 0.99 --optimizer nag \
--sample-break-mode none --update-freq 3 \
--warmup-init-lr 1e-07 --warmup-updates 16000 \
--weight-decay 0 --seed 1 --stop-min-lr 1e-09 \
--quant-noise-pq 0.05 --quant-noise-pq-block-size 8
```
To **evaluate** this model, note you need to use the `eval.py` script. The following command can be used to evaluate:
```bash
fairseq-eval-lm /path/to/wikitext-103/data --path /path/to/model/checkpoint \
--sample-break-mode complete \
--max-tokens 3072 \
--context-window 2560 \
--softmax-batch 1024 \
--gen-subset valid
```
and change the `--gen-subset` to `test` if you would like to evaluate on the test set instead.
### Iterative Product Quantization
To quantize the finetuned RoBERTa model, we use this command on 1 GPU. This should run in a day.
```bash
TOTAL_NUM_UPDATES=6108 # 2036 updates for each iteration
WARMUP_UPDATES=122
LR=2e-05
NUM_CLASSES=2
MAX_SENTENCES=16
fairseq-train --task sentence_prediction /path/to/data/ \
--restore-file $ROBERTA_PATH \
--save-dir checkpoints/roberta_finetuned \
--max-positions 512 \
--batch-size $MAX_SENTENCES \
--max-tokens 4400 \
--init-token 0 --separator-token 2 \
--arch roberta_large \
--criterion sentence_prediction \
--num-classes $NUM_CLASSES \
--dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
--clip-norm 0.0 --lr-scheduler polynomial_decay \
--fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
--no-progress-bar --skip-invalid-size-inputs-valid-test --ddp-backend legacy_ddp \
--quantization-config-path /path/to/config/yaml
```
To quantize the trained Language Model, we use this command on 8 V100 23GB GPUs. This should run in a couple of hours.
```bash
fairseq-train --task language_modeling /path/to/wikitext-103/data \
--save-dir checkpoints/transformer_wikitext-103 \
--adaptive-input --adaptive-input-cutoff 20000,60000 --adaptive-input-factor 4 \
--adaptive-softmax-cutoff 20000,60000 --adaptive-softmax-dropout 0.2 --adaptive-softmax-factor 4.0 \
--arch transformer_lm_gbw \
--attention-dropout 0.1 --dropout 0.2 --relu-dropout 0.1 \
--bucket-cap-mb 25 --char-embedder-highway-layers 2 --character-embedding-dim 4 \
--clip-norm 0.1 --criterion adaptive_loss \
--ddp-backend legacy_ddp \
--decoder-attention-heads 8 --decoder-embed-dim 1024 --decoder-ffn-embed-dim 4096 --decoder-input-dim 1024 --decoder-layers 16 --decoder-normalize-before --decoder-output-dim 1024 \
--fp16 --keep-last-epochs -1 \
--min-lr 0.0001 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75 --lr 0.05 --stop-min-lr 1e-09 \
--max-tokens 2944 --tokens-per-sample 2944\
--momentum 0.99 --no-epoch-checkpoints --no-progress-bar --optimizer nag --required-batch-size-multiple 8 \
--sample-break-mode none --t-mult 2.0 --skip-invalid-size-inputs-valid-test \
--tie-adaptive-proj --tie-adaptive-weights --update-freq 3 --weight-decay 0 --seed 1 \
--log-interval 100 --no-progress-bar --skip-invalid-size-inputs-valid-test \
--restore-file path/to/trained/lm/with/quant/noise \
--max-update 13500 --quantization-config-path /path/to/config/yaml
```
If you have less capacity or if your distributed training freezes, try reducing `--max-tokens` and `--tokens-per-sample` (this may reduce the quantized accuracy a bit).
### Remarks
We try to keep the open-sourced code as readable and as easy-to-plug as possible. Therefore, we did not test it for the following cases:
- Scalar quantization with RoBERTa.
- Quantization with iPQ and `int8` combined.
If you have trouble adapting it, we will be more than happy to help!
## Looking to reproduce the Vision results in the paper?
We are working on open sourcing our code as part of ClassyVision. Please check back.
## Having an issue or have a question?
Please open an issue in this repository with the details of your question. Thanks!
|