Transformers documentation

Image Processor

Transformers

Get started

Transformers Installation Quickstart

Base classes

Inference

Training

Quantization

Kernels

Export to production

Resources

Contribute

API

Main Classes

Models

Internal helpers

Reference

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.57.3).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Image Processor

An image processor is in charge of loading images (optionally), preparing input features for vision models and post processing their outputs. This includes transformations such as resizing, normalization, and conversion to PyTorch and Numpy tensors. It may also include model specific post-processing such as converting logits to segmentation masks. Fast image processors are available for a few models and more will be added in the future. They are based on the torchvision library and provide a significant speed-up, especially when processing on GPU. They have the same API as the base image processors and can be used as drop-in replacements. To use a fast image processor, you need to install the torchvision library, and set the use_fast argument to True when instantiating the image processor:

from transformers import AutoImageProcessor

processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50", use_fast=True)

Note that use_fast will be set to True by default in a future release.

When using a fast image processor, you can also set the device argument to specify the device on which the processing should be done. By default, the processing is done on the same device as the inputs if the inputs are tensors, or on the CPU otherwise.

from torchvision.io import read_image
from transformers import DetrImageProcessorFast

images = read_image("image.jpg")
processor = DetrImageProcessorFast.from_pretrained("facebook/detr-resnet-50")
images_processed = processor(images, return_tensors="pt", device="cuda")

Here are some speed comparisons between the base and fast image processors for the DETR and RT-DETR models, and how they impact overall inference time:

These benchmarks were run on an AWS EC2 g5.2xlarge instance, utilizing an NVIDIA A10G Tensor Core GPU.

ImageProcessingMixin

class transformers.ImageProcessingMixin

( **kwargs )

This is an image processor mixin used to provide saving/loading functionality for sequential and image feature extractors.

from_pretrained

( pretrained_model_name_or_path: typing.Union[str, os.PathLike] cache_dir: typing.Union[str, os.PathLike, NoneType] = None force_download: bool = False local_files_only: bool = False token: typing.Union[str, bool, NoneType] = None revision: str = 'main' **kwargs )

Parameters

pretrained_model_name_or_path (str or os.PathLike) — This can be either:
- a string, the model id of a pretrained image_processor hosted inside a model repo on huggingface.co.
- a path to a directory containing a image processor file saved using the save_pretrained() method, e.g., ./my_model_directory/.
- a path or url to a saved image processor JSON file, e.g., ./my_model_directory/preprocessor_config.json.
cache_dir (str or os.PathLike, optional) — Path to a directory in which a downloaded pretrained model image processor should be cached if the standard cache should not be used.
force_download (bool, optional, defaults to False) — Whether or not to force to (re-)download the image processor files and override the cached versions if they exist.
proxies (dict[str, str], optional) — A dictionary of proxy servers to use by protocol or endpoint, e.g., {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}. The proxies are used on each request.
token (str or bool, optional) — The token to use as HTTP bearer authorization for remote files. If True, or not specified, will use the token generated when running hf auth login (stored in ~/.huggingface).
revision (str, optional, defaults to "main") — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git.

Instantiate a type of ImageProcessingMixin from an image processor.

Examples:

# We can't instantiate directly the base class *ImageProcessingMixin* so let's show the examples on a
# derived class: *CLIPImageProcessor*
image_processor = CLIPImageProcessor.from_pretrained(
    "openai/clip-vit-base-patch32"
)  # Download image_processing_config from huggingface.co and cache.
image_processor = CLIPImageProcessor.from_pretrained(
    "./test/saved_model/"
)  # E.g. image processor (or model) was saved using *save_pretrained('./test/saved_model/')*
image_processor = CLIPImageProcessor.from_pretrained("./test/saved_model/preprocessor_config.json")
image_processor = CLIPImageProcessor.from_pretrained(
    "openai/clip-vit-base-patch32", do_normalize=False, foo=False
)
assert image_processor.do_normalize is False
image_processor, unused_kwargs = CLIPImageProcessor.from_pretrained(
    "openai/clip-vit-base-patch32", do_normalize=False, foo=False, return_unused_kwargs=True
)
assert image_processor.do_normalize is False
assert unused_kwargs == {"foo": False}

save_pretrained

( save_directory: typing.Union[str, os.PathLike] push_to_hub: bool = False **kwargs )

Parameters

save_directory (str or os.PathLike) — Directory where the image processor JSON file will be saved (will be created if it does not exist).
push_to_hub (bool, optional, defaults to False) — Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to with repo_id (will default to the name of save_directory in your namespace).
kwargs (dict[str, Any], optional) — Additional key word arguments passed along to the push_to_hub() method.

Save an image processor object to the directory save_directory, so that it can be re-loaded using the from_pretrained() class method.

BatchFeature

class transformers.BatchFeature

( data: typing.Optional[dict[str, typing.Any]] = None tensor_type: typing.Union[NoneType, str, transformers.utils.generic.TensorType] = None )

Parameters

data (dict, optional) — Dictionary of lists/arrays/tensors returned by the call/pad methods (‘input_values’, ‘attention_mask’, etc.).
tensor_type (Union[None, str, TensorType], optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization.

Holds the output of the pad() and feature extractor specific __call__ methods.

This class is derived from a python dictionary and can be used as a dictionary.

convert_to_tensors

( tensor_type: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None )

Parameters

tensor_type (str or TensorType, optional) — The type of tensors to use. If str, should be one of the values of the enum TensorType. If None, no modification is done.

Convert the inner content to tensors.

to

( *args **kwargs ) → BatchFeature

Parameters

args (Tuple) — Will be passed to the to(...) function of the tensors.
kwargs (Dict, optional) — Will be passed to the to(...) function of the tensors. To enable asynchronous data transfer, set the non_blocking flag in kwargs (defaults to False).

Returns

The same instance after modification.

Send all values to device by calling v.to(*args, **kwargs) (PyTorch only). This should support casting in different dtypes and sending the BatchFeature to a different device.

BaseImageProcessor

class transformers.BaseImageProcessor

( **kwargs )

center_crop

( image: ndarray size: dict data_format: typing.Union[transformers.image_utils.ChannelDimension, str, NoneType] = None input_data_format: typing.Union[transformers.image_utils.ChannelDimension, str, NoneType] = None **kwargs )

Parameters

image (np.ndarray) — Image to center crop.
size (dict[str, int]) — Size of the output image.
data_format (str or ChannelDimension, optional) — The channel dimension format for the output image. If unset, the channel dimension format of the input image is used. Can be one of:
- "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
input_data_format (ChannelDimension or str, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:
- "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.

Center crop an image to (size["height"], size["width"]). If the input size is smaller than crop_size along any edge, the image is padded with 0’s and then center cropped.

normalize

( image: ndarray mean: typing.Union[float, collections.abc.Iterable[float]] std: typing.Union[float, collections.abc.Iterable[float]] data_format: typing.Union[transformers.image_utils.ChannelDimension, str, NoneType] = None input_data_format: typing.Union[transformers.image_utils.ChannelDimension, str, NoneType] = None **kwargs ) → np.ndarray

Parameters

image (np.ndarray) — Image to normalize.
mean (float or Iterable[float]) — Image mean to use for normalization.
std (float or Iterable[float]) — Image standard deviation to use for normalization.
data_format (str or ChannelDimension, optional) — The channel dimension format for the output image. If unset, the channel dimension format of the input image is used. Can be one of:
- "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
input_data_format (ChannelDimension or str, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:
- "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.

Returns

np.ndarray

The normalized image.

Normalize an image. image = (image - image_mean) / image_std.

rescale

( image: ndarray scale: float data_format: typing.Union[transformers.image_utils.ChannelDimension, str, NoneType] = None input_data_format: typing.Union[transformers.image_utils.ChannelDimension, str, NoneType] = None **kwargs ) → np.ndarray

Parameters

image (np.ndarray) — Image to rescale.
scale (float) — The scaling factor to rescale pixel values by.
data_format (str or ChannelDimension, optional) — The channel dimension format for the output image. If unset, the channel dimension format of the input image is used. Can be one of:
- "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
input_data_format (ChannelDimension or str, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:
- "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.

Returns

np.ndarray

The rescaled image.

Rescale an image by a scale factor. image = image * scale.

BaseImageProcessorFast

class transformers.BaseImageProcessorFast

( **kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs] )

center_crop

( image: torch.Tensor size: SizeDict **kwargs ) → torch.Tensor

Parameters

image ("torch.Tensor") — Image to center crop.
size (dict[str, int]) — Size of the output image.

Returns

torch.Tensor

The center cropped image.

Note: override torchvision’s center_crop to have the same behavior as the slow processor. Center crop an image to (size["height"], size["width"]). If the input size is smaller than crop_size along any edge, the image is padded with 0’s and then center cropped.

compile_friendly_resize

( image: torch.Tensor new_size: tuple interpolation: typing.Optional[ForwardRef('F.InterpolationMode')] = None antialias: bool = True )

A wrapper around F.resize so that it is compatible with torch.compile when the image is a uint8 tensor.

convert_to_rgb

( image: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] ) → ImageInput

Parameters

image (ImageInput) — The image to convert.

Returns

ImageInput

The converted image.

Converts an image to RGB format. Only converts if the image is of type PIL.Image.Image, otherwise returns the image as is.

filter_out_unused_kwargs

( kwargs: dict )

Filter out the unused kwargs from the kwargs dictionary.

normalize

( image: torch.Tensor mean: typing.Union[float, collections.abc.Iterable[float]] std: typing.Union[float, collections.abc.Iterable[float]] **kwargs ) → torch.Tensor

Parameters

image (torch.Tensor) — Image to normalize.
mean (torch.Tensor, float or Iterable[float]) — Image mean to use for normalization.
std (torch.Tensor, float or Iterable[float]) — Image standard deviation to use for normalization.

Returns

torch.Tensor

The normalized image.

Normalize an image. image = (image - image_mean) / image_std.

pad

( images: list pad_size: SizeDict = None fill_value: typing.Optional[int] = 0 padding_mode: typing.Optional[str] = 'constant' return_mask: bool = False disable_grouping: typing.Optional[bool] = False is_nested: typing.Optional[bool] = False **kwargs ) → Union[tuple[torch.Tensor, torch.Tensor], torch.Tensor]

Parameters

images (list[torch.Tensor]) — Images to pad.
pad_size (SizeDict, optional) — Dictionary in the format {"height": int, "width": int} specifying the size of the output image.
fill_value (int, optional, defaults to 0) — The constant value used to fill the padded area.
padding_mode (str, optional, defaults to “constant”) — The padding mode to use. Can be any of the modes supported by torch.nn.functional.pad (e.g. constant, reflection, replication).
return_mask (bool, optional, defaults to False) — Whether to return a pixel mask to denote padded regions.
disable_grouping (bool, optional, defaults to False) — Whether to disable grouping of images by size.

Returns

Union[tuple[torch.Tensor, torch.Tensor], torch.Tensor]

The padded images and pixel masks if return_mask is True.

Pads images to (pad_size["height"], pad_size["width"]) or to the largest size in the batch.

preprocess

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] *args **kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs] ) → <class 'transformers.image_processing_base.BatchFeature'>

Parameters

images (Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set do_rescale=False.
do_convert_rgb (bool, optional) — Whether to convert the image to RGB.
do_resize (bool, optional) — Whether to resize the image.
size (Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) — Describes the maximum input dimensions to the model.
crop_size (Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) — Size of the output image after applying center_crop.
resample (Annotated[Union[PILImageResampling, int, NoneType], None]) — Resampling filter to use if resizing the image. This can be one of the enum PILImageResampling. Only has an effect if do_resize is set to True.
do_rescale (bool, optional) — Whether to rescale the image.
rescale_factor (float, optional) — Rescale factor to rescale the image by if do_rescale is set to True.
do_normalize (bool, optional) — Whether to normalize the image.
image_mean (Union[float, list[float], tuple[float, ...], NoneType]) — Image mean to use for normalization. Only has an effect if do_normalize is set to True.
image_std (Union[float, list[float], tuple[float, ...], NoneType]) — Image standard deviation to use for normalization. Only has an effect if do_normalize is set to True.
do_pad (bool, optional) — Whether to pad the image. Padding is done either to the largest size in the batch or to a fixed square size per image. The exact padding strategy depends on the model.
pad_size (Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]) — The size in {"height": int, "width" int} to pad the images to. Must be larger than any image size provided for preprocessing. If pad_size is not provided, images will be padded to the largest height and width in the batch. Applied only when do_pad=True.
do_center_crop (bool, optional) — Whether to center crop the image.
data_format (Union[~image_utils.ChannelDimension, str, NoneType]) — Only ChannelDimension.FIRST is supported. Added for compatibility with slow processors.
input_data_format (Union[~image_utils.ChannelDimension, str, NoneType]) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:
- "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
- "none" or ChannelDimension.NONE: image in (height, width) format.
device (Annotated[Union[str, torch.device, NoneType], None]) — The device to process the images on. If unset, the device is inferred from the input images.
return_tensors (Annotated[Union[str, ~utils.generic.TensorType, NoneType], None]) — Returns stacked tensors if set to `pt, otherwise returns a list of tensors.
disable_grouping (bool, optional) — Whether to disable grouping of images by size to process them individually and not in batches. If None, will be set to True if the images are on CPU, and False otherwise. This choice is based on empirical observations, as detailed here: https://github.com/huggingface/transformers/pull/38157
image_seq_length (int, optional) — The number of image tokens to be used for each image in the input. Added for backward compatibility but this should be set as a processor attribute in future models.

Returns

<class 'transformers.image_processing_base.BatchFeature'>

data (dict) — Dictionary of lists/arrays/tensors returned by the call method (‘pixel_values’, etc.).
tensor_type (Union[None, str, TensorType], optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization.

rescale

( image: torch.Tensor scale: float **kwargs ) → torch.Tensor

Parameters

image (torch.Tensor) — Image to rescale.
scale (float) — The scaling factor to rescale pixel values by.

Returns

torch.Tensor

The rescaled image.

Rescale an image by a scale factor. image = image * scale.

rescale_and_normalize

( images: torch.Tensor do_rescale: bool rescale_factor: float do_normalize: bool image_mean: typing.Union[float, list[float]] image_std: typing.Union[float, list[float]] )

Rescale and normalize images.

resize

( image: torch.Tensor size: SizeDict interpolation: typing.Optional[ForwardRef('F.InterpolationMode')] = None antialias: bool = True **kwargs ) → torch.Tensor

Parameters

image (torch.Tensor) — Image to resize.
size (SizeDict) — Dictionary in the format {"height": int, "width": int} specifying the size of the output image.
interpolation (InterpolationMode, optional, defaults to InterpolationMode.BILINEAR) — InterpolationMode filter to use when resizing the image e.g. InterpolationMode.BICUBIC.
antialias (bool, optional, defaults to True) — Whether to use antialiasing.

Returns

torch.Tensor

The resized image.

Resize an image to (size["height"], size["width"]).

Update on GitHub

←Feature Extractor Video Processor→