Transformers documentation

QDQBERT

You are viewing v4.14.1 version. A newer version v4.40.1 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

QDQBERT

Overview

The QDQBERT model can be referenced in Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.

The abstract from the paper is the following:

Quantization techniques can reduce the size of Deep Neural Networks and improve inference latency and throughput by taking advantage of high throughput integer instructions. In this paper we review the mathematical aspects of quantization parameters and evaluate their choices on a wide range of neural network models for different application domains, including vision, speech, and language. We focus on quantization techniques that are amenable to acceleration by processors with high-throughput integer math pipelines. We also present a workflow for 8-bit quantization that is able to maintain accuracy within 1% of the floating-point baseline on all networks studied, including models that are more difficult to quantize, such as MobileNets and BERT-large.

Tips:

  • QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to (i) linear layer inputs and weights, (ii) matmul inputs, (iii) residual add inputs, in BERT model.

  • QDQBERT requires the dependency of Pytorch Quantization Toolkit. To install pip install pytorch-quantization --extra-index-url https://pypi.ngc.nvidia.com

  • QDQBERT model can be loaded from any checkpoint of HuggingFace BERT model (for example bert-base-uncased), and perform Quantization Aware Training/Post Training Quantization.

  • A complete example of using QDQBERT model to perform Quatization Aware Training and Post Training Quantization for SQUAD task can be found at transformers/examples/research_projects/quantization-qdqbert/.

This model was contributed by shangz.

Set default quantizers

QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to BERT by TensorQuantizer in Pytorch Quantization Toolkit. TensorQuantizer is the module for quantizing tensors, with QuantDescriptor defining how the tensor should be quantized. Refer to Pytorch Quantization Toolkit userguide for more details.

Before creating QDQBERT model, one has to set the default QuantDescriptor defining default tensor quantizers. Example:

>>> import pytorch_quantization.nn as quant_nn
>>> from pytorch_quantization.tensor_quant import QuantDescriptor

>>> # The default tensor quantizer is set to use Max calibration method
>>> input_desc = QuantDescriptor(num_bits=8, calib_method="max")
>>> # The default tensor quantizer is set to be per-channel quantization for weights
>>> weight_desc = QuantDescriptor(num_bits=8, axis=((0,)))
>>> quant_nn.QuantLinear.set_default_quant_desc_input(input_desc)
>>> quant_nn.QuantLinear.set_default_quant_desc_weight(weight_desc)

Calibration

Calibration is the terminology of passing data samples to the quantizer and deciding the best scaling factors for tensors. After setting up the tensor quantizers, one can use the following example to calibrate the model:

>>> # Find the TensorQuantizer and enable calibration
>>> for name, module in model.named_modules():
>>>     if name.endswith('_input_quantizer'):
>>>         module.enable_calib()
>>>         module.disable_quant()  # Use full precision data to calibrate

>>> # Feeding data samples
>>> model(x)
>>> # ...

>>> # Finalize calibration
>>> for name, module in model.named_modules():
>>>     if name.endswith('_input_quantizer'):
>>>         module.load_calib_amax()
>>>         module.enable_quant()

>>> # If running on GPU, it needs to call .cuda() again because new tensors will be created by calibration process
>>> model.cuda()

>>> # Keep running the quantized model
>>> # ...

Export to ONNX

The goal of exporting to ONNX is to deploy inference by TensorRT. Fake quantization will be broken into a pair of QuantizeLinear/DequantizeLinear ONNX ops. After setting static member of TensorQuantizer to use Pytorch’s own fake quantization functions, fake quantized model can be exported to ONNX, follow the instructions in torch.onnx. Example:

>>> from pytorch_quantization.nn import TensorQuantizer
>>> TensorQuantizer.use_fb_fake_quant = True

>>> # Load the calibrated model
>>> ...
>>> # ONNX export
>>> torch.onnx.export(...)

QDQBertConfig

class transformers.QDQBertConfig < >

( vocab_size = 30522 hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout_prob = 0.1 attention_probs_dropout_prob = 0.1 max_position_embeddings = 512 type_vocab_size = 2 initializer_range = 0.02 layer_norm_eps = 1e-12 use_cache = True is_encoder_decoder = False pad_token_id = 1 bos_token_id = 0 eos_token_id = 2 **kwargs )

Parameters

  • vocab_size (int, optional, defaults to 30522) — Vocabulary size of the QDQBERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling QDQBertModel.
  • hidden_size (int, optional, defaults to 768) — Dimension of the encoder layers and the pooler layer.
  • num_hidden_layers (int, optional, defaults to 12) — Number of hidden layers in the Transformer encoder.
  • num_attention_heads (int, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer encoder.
  • intermediate_size (int, optional, defaults to 3072) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
  • hidden_act (str or function, optional, defaults to "gelu") — The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "selu" and "gelu_new" are supported.
  • hidden_dropout_prob (float, optional, defaults to 0.1) — The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
  • attention_probs_dropout_prob (float, optional, defaults to 0.1) — The dropout ratio for the attention probabilities.
  • max_position_embeddings (int, optional, defaults to 512) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
  • type_vocab_size (int, optional, defaults to 2) — The vocabulary size of the token_type_ids passed when calling QDQBertModel.
  • initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
  • layer_norm_eps (float, optional, defaults to 1e-12) — The epsilon used by the layer normalization layers.
  • use_cache (bool, optional, defaults to True) — Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

This is the configuration class to store the configuration of a QDQBertModel. It is used to instantiate an QDQBERT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the BERT bert-base-uncased architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Examples:

>>> from transformers import QDQBertModel, QDQBertConfig

>>> # Initializing a QDQBERT bert-base-uncased style configuration
>>> configuration = QDQBertConfig()

>>> # Initializing a model from the bert-base-uncased style configuration
>>> model = QDQBertModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

QDQBertModel

QDQBertLMHeadModel

QDQBertForMaskedLM

QDQBertForSequenceClassification

QDQBertForNextSentencePrediction

QDQBertForMultipleChoice

QDQBertForTokenClassification

QDQBertForQuestionAnswering