Transformers documentation


You are viewing v4.19.2 version. A newer version v4.42.0 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started


There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call “BERTology”). Some good examples of this field are:

In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted from the great work of Paul Michel (

  • accessing all the hidden-states of BERT/GPT/GPT-2,
  • accessing all the attention weights for each head of BERT/GPT/GPT-2,
  • retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in

To help you understand and use these features, we have added a specific example script: while extract information and prune a model pre-trained on GLUE.