shangz commited on
Commit
277ab1b
1 Parent(s): d45d1a8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -3
README.md CHANGED
@@ -1,7 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # QDQBERT base model (uncased)
2
 
3
- # Model description
4
- <ins>QDQBERT</ins> model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to (i) linear layer inputs and weights, (ii) matmul inputs, (iii) residual add inputs, in BERT model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
- # Complete example
7
  A complete example of using QDQBERT model to perform Quatization Aware Training and Post Training Quantization for SQUAD task can be found at [transformers/examples/research_projects/quantization-qdqbert](https://github.com/huggingface/transformers/tree/master/examples/research_projects/quantization-qdqbert)
 
1
+ <!---
2
+ Copyright 2021 NVIDIA Corporation. All rights reserved.
3
+ Licensed under the Apache License, Version 2.0 (the "License");
4
+ you may not use this file except in compliance with the License.
5
+ You may obtain a copy of the License at
6
+
7
+ http://www.apache.org/licenses/LICENSE-2.0
8
+
9
+ Unless required by applicable law or agreed to in writing, software
10
+ distributed under the License is distributed on an "AS IS" BASIS,
11
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ See the License for the specific language governing permissions and
13
+ limitations under the License.
14
+ -->
15
+
16
  # QDQBERT base model (uncased)
17
 
18
+ ## Model description
19
+ [QDQBERT](https://huggingface.co/docs/transformers/model_doc/qdqbert) model inserts fake quantization operations (pair of QuantizeLinear/DequantizeLinear operators) to (i) linear layer inputs and weights, (ii) matmul inputs, (iii) residual add inputs, in BERT model.
20
+
21
+ QDQBERT model can be loaded from any checkpoint of HuggingFace BERT model (for example bert-base-uncased), and perform Quantization Aware Training/Post Training Quantization.
22
+
23
+ In this model card, **qdqbert-base-uncased** corresponds to the **bert-base-uncased** model with QuantizeLinear/DequantizeLinear ops (**Q/DQ nodes**). Similarly, one can also use the QDQBERT model for qdqbert-large-cased corresponding to bert-large-cased, etc.
24
+
25
+ ## How to run QDQBERT using Transformers
26
+
27
+ ### Prerequisites
28
+ QDQBERT requires the dependency of [Pytorch Quantization Toolkit](https://github.com/NVIDIA/TensorRT/tree/main/tools/pytorch-quantization). To install Pytorch Quantization Toolkit, run
29
+ ```
30
+ pip install pytorch-quantization --extra-index-url https://pypi.ngc.nvidia.com
31
+ ```
32
+
33
+ ### Set default quantizers
34
+ QDQBERT model inserts Q/DQ nodes to BERT by **TensorQuantizer** in Pytorch Quantization Toolkit. **TensorQuantizer** is the module for quantizing tensors, with **QuantDescriptor** defining how the tensor should be quantized. Refer to [Pytorch Quantization Toolkit userguide](https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/userguide.html) for more details.
35
+
36
+ Before creating QDQBERT model, one has to set the default **QuantDescriptor** defining default tensor quantizers. Example:
37
+
38
+ ```python
39
+ import pytorch_quantization.nn as quant_nn
40
+ from pytorch_quantization.tensor_quant import QuantDescriptor
41
+ # The default tensor quantizer is set to use Max calibration method
42
+ input_desc = QuantDescriptor(num_bits=8, calib_method="max")
43
+ # The default tensor quantizer is set to be per-channel quantization for weights
44
+ weight_desc = QuantDescriptor(num_bits=8, axis=((0,)))
45
+ quant_nn.QuantLinear.set_default_quant_desc_input(input_desc)
46
+ quant_nn.QuantLinear.set_default_quant_desc_weight(weight_desc)
47
+ ```
48
+
49
+ ### Calibration
50
+ Calibration is the terminology of passing data samples to the quantizer and deciding the best scaling factors for tensors. After setting up the tensor quantizers, one can use the following example to calibrate the model:
51
+
52
+ ```python
53
+ # Find the TensorQuantizer and enable calibration
54
+ for name, module in model.named_modules():
55
+ if name.endswith('_input_quantizer'):
56
+ module.enable_calib()
57
+ module.disable_quant() # Use full precision data to calibrate
58
+ # Feeding data samples
59
+ model(x)
60
+ # ...
61
+ # Finalize calibration
62
+ for name, module in model.named_modules():
63
+ if name.endswith('_input_quantizer'):
64
+ module.load_calib_amax()
65
+ module.enable_quant()
66
+ # If running on GPU, it needs to call .cuda() again because new tensors will be created by calibration process
67
+ model.cuda()
68
+ # Keep running the quantized model
69
+ # ...
70
+ ```
71
+
72
+ ### Export to ONNX
73
+ The goal of exporting to ONNX is to deploy inference by [TensorRT](https://developer.nvidia.com/tensorrt). Fake quantization will be broken into a pair of QuantizeLinear/DequantizeLinear ONNX ops. After setting static member of TensorQuantizer to use Pytorch’s own fake quantization functions, fake quantized model can be exported to ONNX, follow the instructions in [torch.onnx](https://pytorch.org/docs/stable/onnx.html). Example:
74
+
75
+ ```python
76
+ from pytorch_quantization.nn import TensorQuantizer
77
+ TensorQuantizer.use_fb_fake_quant = True
78
+ # Load the calibrated model
79
+ ...
80
+ # ONNX export
81
+ torch.onnx.export(...)
82
+ ```
83
 
84
+ ## Complete example
85
  A complete example of using QDQBERT model to perform Quatization Aware Training and Post Training Quantization for SQUAD task can be found at [transformers/examples/research_projects/quantization-qdqbert](https://github.com/huggingface/transformers/tree/master/examples/research_projects/quantization-qdqbert)