transformers documentation

QDQBERT

# QDQBERT

## Overview

The QDQBERT model can be referenced in Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.

The abstract from the paper is the following:

Quantization techniques can reduce the size of Deep Neural Networks and improve inference latency and throughput by taking advantage of high throughput integer instructions. In this paper we review the mathematical aspects of quantization parameters and evaluate their choices on a wide range of neural network models for different application domains, including vision, speech, and language. We focus on quantization techniques that are amenable to acceleration by processors with high-throughput integer math pipelines. We also present a workflow for 8-bit quantization that is able to maintain accuracy within 1% of the floating-point baseline on all networks studied, including models that are more difficult to quantize, such as MobileNets and BERT-large.

Tips:

• QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to (i) linear layer inputs and weights, (ii) matmul inputs, (iii) residual add inputs, in BERT model.

• QDQBERT requires the dependency of Pytorch Quantization Toolkit. To install pip install pytorch-quantization --extra-index-url https://pypi.ngc.nvidia.com

• QDQBERT model can be loaded from any checkpoint of HuggingFace BERT model (for example bert-base-uncased), and perform Quantization Aware Training/Post Training Quantization.

• A complete example of using QDQBERT model to perform Quatization Aware Training and Post Training Quantization for SQUAD task can be found at transformers/examples/research_projects/quantization-qdqbert/.

This model was contributed by shangz.

### Set default quantizers

QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to BERT by TensorQuantizer in Pytorch Quantization Toolkit. TensorQuantizer is the module for quantizing tensors, with QuantDescriptor defining how the tensor should be quantized. Refer to Pytorch Quantization Toolkit userguide for more details.

Before creating QDQBERT model, one has to set the default QuantDescriptor defining default tensor quantizers. Example:

>>> import pytorch_quantization.nn as quant_nn
>>> from pytorch_quantization.tensor_quant import QuantDescriptor

>>> # The default tensor quantizer is set to use Max calibration method
>>> input_desc = QuantDescriptor(num_bits=8, calib_method="max")
>>> # The default tensor quantizer is set to be per-channel quantization for weights
>>> weight_desc = QuantDescriptor(num_bits=8, axis=((0,)))
>>> quant_nn.QuantLinear.set_default_quant_desc_input(input_desc)
>>> quant_nn.QuantLinear.set_default_quant_desc_weight(weight_desc)

### Calibration

Calibration is the terminology of passing data samples to the quantizer and deciding the best scaling factors for tensors. After setting up the tensor quantizers, one can use the following example to calibrate the model:

>>> # Find the TensorQuantizer and enable calibration
>>> for name, module in model.named_modules():
>>>     if name.endswith('_input_quantizer'):
>>>         module.enable_calib()
>>>         module.disable_quant()  # Use full precision data to calibrate

>>> # Feeding data samples
>>> model(x)
>>> # ...

>>> # Finalize calibration
>>> for name, module in model.named_modules():
>>>     if name.endswith('_input_quantizer'):
>>>         module.enable_quant()

>>> # If running on GPU, it needs to call .cuda() again because new tensors will be created by calibration process
>>> model.cuda()

>>> # Keep running the quantized model
>>> # ...

### Export to ONNX

The goal of exporting to ONNX is to deploy inference by TensorRT. Fake quantization will be broken into a pair of QuantizeLinear/DequantizeLinear ONNX ops. After setting static member of TensorQuantizer to use Pytorch’s own fake quantization functions, fake quantized model can be exported to ONNX, follow the instructions in torch.onnx. Example:

>>> from pytorch_quantization.nn import TensorQuantizer
>>> TensorQuantizer.use_fb_fake_quant = True

>>> # Load the calibrated model
>>> ...
>>> # ONNX export
>>> torch.onnx.export(...)

## QDQBertConfig

class transformers.QDQBertConfig < > expand

( vocab_size = 30522 hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout_prob = 0.1 attention_probs_dropout_prob = 0.1 max_position_embeddings = 512 type_vocab_size = 2 initializer_range = 0.02 layer_norm_eps = 1e-12 use_cache = True is_encoder_decoder = False pad_token_id = 1 bos_token_id = 0 eos_token_id = 2 **kwargs )

This is the configuration class to store the configuration of a QDQBertModel. It is used to instantiate an QDQBERT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the BERT bert-base-uncased architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Examples:

>>> from transformers import QDQBertModel, QDQBertConfig

>>> # Initializing a QDQBERT bert-base-uncased style configuration
>>> configuration = QDQBertConfig()

>>> # Initializing a model from the bert-base-uncased style configuration
>>> model = QDQBertModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

## QDQBertModel

class transformers.QDQBertModel < > expand

( config add_pooling_layer = True )

The bare QDQBERT Model transformer outputting raw hidden-states without any specific head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of cross-attention is added between the self-attention layers, following the architecture described in Attention is all you need by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.

To behave as an decoder the model needs to be initialized with the is_decoder argument of the configuration set to True. To be used in a Seq2Seq model, the model needs to initialized with both is_decoder argument and add_cross_attention set to True; an encoder_hidden_states is then expected as an input to the forward pass.

forward < > expand

( input_ids = None attention_mask = None token_type_ids = None position_ids = None head_mask = None inputs_embeds = None encoder_hidden_states = None encoder_attention_mask = None past_key_values = None use_cache = None output_attentions = None output_hidden_states = None return_dict = None ) BaseModelOutputWithPoolingAndCrossAttentions or tuple(torch.FloatTensor)

The QDQBertModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import BertTokenizer, QDQBertModel
>>> import torch

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = QDQBertModel.from_pretrained('bert-base-uncased')

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state

( config )

QDQBERT Model with a language modeling head on top for CLM fine-tuning.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < > expand

( input_ids = None attention_mask = None token_type_ids = None position_ids = None head_mask = None inputs_embeds = None encoder_hidden_states = None encoder_attention_mask = None labels = None past_key_values = None use_cache = None output_attentions = None output_hidden_states = None return_dict = None ) CausalLMOutputWithCrossAttentions or tuple(torch.FloatTensor)

The QDQBertLMHeadModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import BertTokenizer, QDQBertLMHeadModel, QDQBertConfig
>>> import torch

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
>>> config = QDQBertConfig.from_pretrained("bert-base-cased")
>>> config.is_decoder = True

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)

>>> prediction_logits = outputs.logits

( config )

QDQBERT Model with a language modeling head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < > expand

( input_ids = None attention_mask = None token_type_ids = None position_ids = None head_mask = None inputs_embeds = None encoder_hidden_states = None encoder_attention_mask = None labels = None output_attentions = None output_hidden_states = None return_dict = None ) MaskedLMOutput or tuple(torch.FloatTensor)

The QDQBertForMaskedLM forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import BertTokenizer, QDQBertForMaskedLM
>>> import torch

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

>>> inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
>>> labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]

>>> outputs = model(**inputs, labels=labels)
>>> loss = outputs.loss
>>> logits = outputs.logits

## QDQBertForSequenceClassification

class transformers.QDQBertForSequenceClassification < > expand

( config )

Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < > expand

( input_ids = None attention_mask = None token_type_ids = None position_ids = None head_mask = None inputs_embeds = None labels = None output_attentions = None output_hidden_states = None return_dict = None ) SequenceClassifierOutput or tuple(torch.FloatTensor)

The QDQBertForSequenceClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of single-label classification:

>>> from transformers import BertTokenizer, QDQBertForSequenceClassification
>>> import torch

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = QDQBertForSequenceClassification.from_pretrained('bert-base-uncased')

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
>>> outputs = model(**inputs, labels=labels)
>>> loss = outputs.loss
>>> logits = outputs.logits

Example of multi-label classification:

>>> from transformers import BertTokenizer, QDQBertForSequenceClassification
>>> import torch

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = QDQBertForSequenceClassification.from_pretrained('bert-base-uncased', problem_type="multi_label_classification")

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> labels = torch.tensor([[1, 1]], dtype=torch.float) # need dtype=float for BCEWithLogitsLoss
>>> outputs = model(**inputs, labels=labels)
>>> loss = outputs.loss
>>> logits = outputs.logits

## QDQBertForNextSentencePrediction

class transformers.QDQBertForNextSentencePrediction < > expand

( config )

Bert Model with a next sentence prediction (classification) head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < > expand

( input_ids = None attention_mask = None token_type_ids = None position_ids = None head_mask = None inputs_embeds = None labels = None output_attentions = None output_hidden_states = None return_dict = None **kwargs ) NextSentencePredictorOutput or tuple(torch.FloatTensor)

The QDQBertForNextSentencePrediction forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import BertTokenizer, QDQBertForNextSentencePrediction
>>> import torch

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = QDQBertForNextSentencePrediction.from_pretrained('bert-base-uncased')

>>> prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
>>> next_sentence = "The sky is blue due to the shorter wavelength of blue light."
>>> encoding = tokenizer(prompt, next_sentence, return_tensors='pt')

>>> outputs = model(**encoding, labels=torch.LongTensor([1]))
>>> logits = outputs.logits
>>> assert logits[0, 0] < logits[0, 1] # next sentence was random

## QDQBertForMultipleChoice

class transformers.QDQBertForMultipleChoice < > expand

( config )

Bert Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a softmax) e.g. for RocStories/SWAG tasks.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < > expand

( input_ids = None attention_mask = None token_type_ids = None position_ids = None head_mask = None inputs_embeds = None labels = None output_attentions = None output_hidden_states = None return_dict = None ) MultipleChoiceModelOutput or tuple(torch.FloatTensor)

The QDQBertForMultipleChoice forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import BertTokenizer, QDQBertForMultipleChoice
>>> import torch

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = QDQBertForMultipleChoice.from_pretrained('bert-base-uncased')

>>> prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
>>> choice0 = "It is eaten with a fork and a knife."
>>> choice1 = "It is eaten while held in the hand."
>>> labels = torch.tensor(0).unsqueeze(0)  # choice0 is correct (according to Wikipedia ;)), batch size 1

>>> encoding = tokenizer([prompt, prompt], [choice0, choice1], return_tensors='pt', padding=True)
>>> outputs = model(**{k: v.unsqueeze(0) for k,v in encoding.items()}, labels=labels)  # batch size is 1

>>> # the linear classifier still needs to be trained
>>> loss = outputs.loss
>>> logits = outputs.logits

## QDQBertForTokenClassification

class transformers.QDQBertForTokenClassification < > expand

( config )

QDQBERT Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < > expand

( input_ids = None attention_mask = None token_type_ids = None position_ids = None head_mask = None inputs_embeds = None labels = None output_attentions = None output_hidden_states = None return_dict = None ) TokenClassifierOutput or tuple(torch.FloatTensor)

The QDQBertForTokenClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import BertTokenizer, QDQBertForTokenClassification
>>> import torch

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = QDQBertForTokenClassification.from_pretrained('bert-base-uncased')

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> labels = torch.tensor([1] * inputs["input_ids"].size(1)).unsqueeze(0)  # Batch size 1

>>> outputs = model(**inputs, labels=labels)
>>> loss = outputs.loss
>>> logits = outputs.logits

( config )

QDQBERT Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of the hidden-states output to compute span start logits and span end logits).

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < > expand

( input_ids = None attention_mask = None token_type_ids = None position_ids = None head_mask = None inputs_embeds = None start_positions = None end_positions = None output_attentions = None output_hidden_states = None return_dict = None ) QuestionAnsweringModelOutput or tuple(torch.FloatTensor)

The QDQBertForQuestionAnswering forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import BertTokenizer, QDQBertForQuestionAnswering
>>> import torch

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
>>> inputs = tokenizer(question, text, return_tensors='pt')
>>> start_positions = torch.tensor([1])
>>> end_positions = torch.tensor([3])

>>> outputs = model(**inputs, start_positions=start_positions, end_positions=end_positions)
>>> loss = outputs.loss
>>> start_scores = outputs.start_logits
>>> end_scores = outputs.end_logits