transformers documentation




The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform optical character recognition (OCR).

Please refer to the VisionEncoderDecoder class on how to use this model.

This model was contributed by Niels Rogge.

The original code can be found here.



TrOCR’s VisionEncoderDecoderModel model accepts images as input and makes use of generate() to autoregressively generate text given the input image.

The ViTFeatureExtractor class is responsible for preprocessing the input image and RobertaTokenizer decodes the generated target tokens to the target string. The TrOCRProcessor wraps ViTFeatureExtractor and RobertaTokenizer into a single instance to both extract the input features and decode the predicted token ids.

  • Step-by-step Optical Character Recognition (OCR)
>>> from transformers import TrOCRProcessor, VisionEncoderDecoderModel
>>> import requests
>>> from PIL import Image

>>> processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
>>> model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")

>>> # load image from the IAM dataset
>>> url = ""
>>> image =, stream=True).raw).convert("RGB")

>>> pixel_values = processor(image, return_tensors="pt").pixel_values
>>> generated_ids = model.generate(pixel_values)

>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

See the model hub to look for TrOCR checkpoints.


class transformers.TrOCRConfig < > expand 

( vocab_size = 50265 d_model = 1024 decoder_layers = 12 decoder_attention_heads = 16 decoder_ffn_dim = 4096 activation_function = 'gelu' max_position_embeddings = 512 dropout = 0.1 attention_dropout = 0.0 activation_dropout = 0.0 decoder_start_token_id = 2 classifier_dropout = 0.0 init_std = 0.02 decoder_layerdrop = 0.0 use_cache = False scale_embedding = False use_learned_position_embeddings = True layernorm_embedding = True pad_token_id = 1 bos_token_id = 0 eos_token_id = 2 **kwargs )

This is the configuration class to store the configuration of a TrOCRForCausalLM. It is used to instantiate an TrOCR model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the TrOCR microsoft/trocr-base architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.


>>> from transformers import TrOCRForCausalLM, TrOCRConfig

>>> # Initializing a TrOCR-base style configuration
>>> configuration = TrOCRConfig()

>>> # Initializing a model from the TrOCR-base style configuration
>>> model = TrOCRForCausalLM(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config


class transformers.TrOCRProcessor < > expand 

( feature_extractor tokenizer )

Constructs a TrOCR processor which wraps a vision feature extractor and a TrOCR tokenizer into a single processor.

TrOCRProcessor offers all the functionalities of AutoFeatureExtractor and RobertaTokenizer. See the call() and decode() for more information.

__call__ < > expand 

( *args **kwargs )

When used in normal mode, this method forwards all its arguments to AutoFeatureExtractor’s __call__() and returns its output. If used in the context as_target_processor() this method forwards all its arguments to TrOCRTokenizer’s __call__. Please refer to the doctsring of the above two methods for more information.

from_pretrained < > expand 

( pretrained_model_name_or_path **kwargs )

Instantiate a TrOCRProcessor from a pretrained TrOCR processor.

This class method is simply calling AutoFeatureExtractor’s from_pretrained and TrOCRTokenizer’s from_pretrained. Please refer to the docstrings of the methods above for more information.

save_pretrained < > expand 

( save_directory )

Save a TrOCR feature extractor object and TrOCR tokenizer object to the directory save_directory, so that it can be re-loaded using the from_pretrained() class method.

This class method is simply calling save_pretrained and save_pretrained. Please refer to the docstrings of the methods above for more information.

batch_decode < > expand 

( *args **kwargs )

This method forwards all its arguments to TrOCRTokenizer’s batch_decode(). Please refer to the docstring of this method for more information.

decode < > expand 

( *args **kwargs )

This method forwards all its arguments to TrOCRTokenizer’s decode(). Please refer to the docstring of this method for more information.

as_target_processor < > expand 

( )

Temporarily sets the tokenizer for processing the input. Useful for encoding the labels when fine-tuning TrOCR.


class transformers.TrOCRForCausalLM < > expand 

( config )

The TrOCR Decoder with a language modeling head. Can be used as the decoder part of EncoderDecoderModel and VisionEncoderDecoder. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < > expand 

( input_ids = None attention_mask = None encoder_hidden_states = None encoder_attention_mask = None head_mask = None cross_attn_head_mask = None past_key_values = None inputs_embeds = None labels = None use_cache = None output_attentions = None output_hidden_states = None return_dict = None ) ⟶ CausalLMOutputWithCrossAttentions or tuple(torch.FloatTensor)


>>> from transformers import VisionEncoderDecoderModel, TrOCRForCausalLM, ViTModel, TrOCRConfig, ViTConfig

>>> encoder = ViTModel(ViTConfig())
>>> decoder = TrOCRForCausalLM(TrOCRConfig())

# init vision2text model
>>> model = VisionEncoderDecoderModel(encoder=encoder, decoder=decoder)