Transformers documentation

Feature Extractor

You are viewing v4.20.1 version. A newer version v4.41.2 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Feature Extractor

A feature extractor is in charge of preparing input features for a multi-modal model. This includes feature extraction from sequences, e.g., pre-processing audio files to Log-Mel Spectrogram features, feature extraction from images e.g. cropping image image files, but also padding, normalization, and conversion to Numpy, PyTorch, and TensorFlow tensors.


class transformers.FeatureExtractionMixin

< >

( **kwargs )

This is a feature extraction mixin used to provide saving/loading functionality for sequential and image feature extractors.


< >

( pretrained_model_name_or_path: typing.Union[str, os.PathLike] **kwargs )


  • pretrained_model_name_or_path (str or os.PathLike) — This can be either:

    • a string, the model id of a pretrained feature_extractor hosted inside a model repo on Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like dbmdz/bert-base-german-cased.
    • a path to a directory containing a feature extractor file saved using the save_pretrained() method, e.g., ./my_model_directory/.
    • a path or url to a saved feature extractor JSON file, e.g., ./my_model_directory/preprocessor_config.json.
  • cache_dir (str or os.PathLike, optional) — Path to a directory in which a downloaded pretrained model feature extractor should be cached if the standard cache should not be used.
  • force_download (bool, optional, defaults to False) — Whether or not to force to (re-)download the feature extractor files and override the cached versions if they exist.
  • resume_download (bool, optional, defaults to False) — Whether or not to delete incompletely received file. Attempts to resume the download if such a file exists.
  • proxies (Dict[str, str], optional) — A dictionary of proxy servers to use by protocol or endpoint, e.g., {'http': '', 'http://hostname': ''}. The proxies are used on each request.
  • use_auth_token (str or bool, optional) — The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running transformers-cli login (stored in ~/.huggingface).
  • revision (str, optional, defaults to "main") — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on, so revision can be any identifier allowed by git.
  • return_unused_kwargs (bool, optional, defaults to False) — If False, then this function returns just the final feature extractor object. If True, then this functions returns a Tuple(feature_extractor, unused_kwargs) where unused_kwargs is a dictionary consisting of the key/value pairs whose keys are not feature extractor attributes: i.e., the part of kwargs which has not been used to update feature_extractor and is otherwise ignored.
  • kwargs (Dict[str, Any], optional) — The values in kwargs of any keys which are feature extractor attributes will be used to override the loaded values. Behavior concerning key/value pairs whose keys are not feature extractor attributes is controlled by the return_unused_kwargs keyword parameter.

Instantiate a type of FeatureExtractionMixin from a feature extractor, e.g. a derived class of SequenceFeatureExtractor.

Passing use_auth_token=True is required when you want to use a private model.


# We can't instantiate directly the base class *FeatureExtractionMixin* nor *SequenceFeatureExtractor* so let's show the examples on a
# derived class: *Wav2Vec2FeatureExtractor*
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(
)  # Download feature_extraction_config from and cache.
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(
)  # E.g. feature_extractor (or model) was saved using *save_pretrained('./test/saved_model/')*
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("./test/saved_model/preprocessor_config.json")
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(
    "facebook/wav2vec2-base-960h", return_attention_mask=False, foo=False
assert feature_extractor.return_attention_mask is False
feature_extractor, unused_kwargs = Wav2Vec2FeatureExtractor.from_pretrained(
    "facebook/wav2vec2-base-960h", return_attention_mask=False, foo=False, return_unused_kwargs=True
assert feature_extractor.return_attention_mask is False
assert unused_kwargs == {"foo": False}


< >

( save_directory: typing.Union[str, os.PathLike] push_to_hub: bool = False **kwargs )


  • save_directory (str or os.PathLike) — Directory where the feature extractor JSON file will be saved (will be created if it does not exist).
  • push_to_hub (bool, optional, defaults to False) — Whether or not to push your feature extractor to the Hugging Face model hub after saving it.

    Using push_to_hub=True will synchronize the repository you are pushing to with save_directory, which requires save_directory to be a local clone of the repo you are pushing to if it’s an existing folder. Pass along temp_dir=True to use a temporary directory instead.

    kwargs — Additional key word arguments passed along to the push_to_hub() method.

Save a feature_extractor object to the directory save_directory, so that it can be re-loaded using the from_pretrained() class method.


class transformers.SequenceFeatureExtractor

< >

( feature_size: int sampling_rate: int padding_value: float **kwargs )


  • feature_size (int) — The feature dimension of the extracted features.
  • sampling_rate (int) — The sampling rate at which the audio files should be digitalized expressed in Hertz per second (Hz).
  • padding_value (float) — The value that is used to fill the padding values / vectors.

This is a general feature extraction class for speech recognition.


< >

( processed_features: typing.Union[transformers.feature_extraction_utils.BatchFeature, typing.List[transformers.feature_extraction_utils.BatchFeature], typing.Dict[str, transformers.feature_extraction_utils.BatchFeature], typing.Dict[str, typing.List[transformers.feature_extraction_utils.BatchFeature]], typing.List[typing.Dict[str, transformers.feature_extraction_utils.BatchFeature]]] padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = True max_length: typing.Optional[int] = None truncation: bool = False pad_to_multiple_of: typing.Optional[int] = None return_attention_mask: typing.Optional[bool] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None )


  • processed_features (BatchFeature, list of BatchFeature, Dict[str, List[float]], Dict[str, List[List[float]] or List[Dict[str, List[float]]]) — Processed inputs. Can represent one input (BatchFeature or Dict[str, List[float]]) or a batch of input values / vectors (list of BatchFeature, Dict[str, List[List[float]]] or List[Dict[str, List[float]]]) so you can use this method during preprocessing as well as in a PyTorch Dataloader collate function.

    Instead of List[float] you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), see the note above for the return type.

  • padding (bool, str or PaddingStrategy, optional, defaults to True) — Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:

    • True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
    • 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
    • False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
  • max_length (int, optional) — Maximum length of the returned list and optionally padding length (see above).
  • truncation (bool) — Activates truncation to cut input sequences longer than max_length to max_length.
  • pad_to_multiple_of (int, optional) — If set will pad the sequence to a multiple of the provided value.

    This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability

    = 7.5 (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.

  • return_attention_mask (bool, optional) — Whether to return the attention mask. If left to the default, will return the attention mask according to the specific feature_extractor’s default.

    What are attention masks?

  • return_tensors (str or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:

    • 'tf': Return TensorFlow tf.constant objects.
    • 'pt': Return PyTorch torch.Tensor objects.
    • 'np': Return Numpy np.ndarray objects.

Pad input values / input vectors or a batch of input values / input vectors up to predefined length or to the max sequence length in the batch.

Padding side (left/right) padding values are defined at the feature extractor level (with self.padding_side, self.padding_value)

If the processed_features passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the result will use the same type unless you provide a different tensor type with return_tensors. In the case of PyTorch tensors, you will lose the specific device of your tensors however.


class transformers.BatchFeature

< >

( data: typing.Union[typing.Dict[str, typing.Any], NoneType] = None tensor_type: typing.Union[NoneType, str, transformers.utils.generic.TensorType] = None )


  • data (dict) — Dictionary of lists/arrays/tensors returned by the call/pad methods (‘input_values’, ‘attention_mask’, etc.).
  • tensor_type (Union[None, str, TensorType], optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/TensorFlow/Numpy Tensors at initialization.

Holds the output of the pad() and feature extractor specific __call__ methods.

This class is derived from a python dictionary and can be used as a dictionary.


< >

( tensor_type: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None )


  • tensor_type (str or TensorType, optional) — The type of tensors to use. If str, should be one of the values of the enum TensorType. If None, no modification is done.

Convert the inner content to tensors.


< >

( device: typing.Union[str, ForwardRef('torch.device')] ) BatchFeature


  • device (str or torch.device) — The device to put the tensors on.



The same instance after modification.

Send all values to device by calling (PyTorch only).


class transformers.ImageFeatureExtractionMixin

< >

( )

Mixin that contain utilities for preparing image features.


< >

( image size ) new_image


  • image (PIL.Image.Image or np.ndarray or torch.Tensor of shape (n_channels, height, width) or (height, width, n_channels)) — The image to resize.
  • size (int or Tuple[int, int]) — The size to which crop the image.



A center cropped PIL.Image.Image or np.ndarray or torch.Tensor of shape: (n_channels, height, width).

Crops image to the given size using a center crop. Note that if the image is too small to be cropped to the size given, it will be padded (so the returned result has the size asked).


< >

( image )


  • image (PIL.Image.Image) — The image to convert.

Converts PIL.Image.Image to RGB format.


< >

( image )


  • image (PIL.Image.Image or np.ndarray or torch.Tensor) — The image to expand.

Expands 2-dimensional image to 3 dimensions.


< >

( image mean std )


  • image (PIL.Image.Image or np.ndarray or torch.Tensor) — The image to normalize.
  • mean (List[float] or np.ndarray or torch.Tensor) — The mean (per channel) to use for normalization.
  • std (List[float] or np.ndarray or torch.Tensor) — The standard deviation (per channel) to use for normalization.

Normalizes image with mean and std. Note that this will trigger a conversion of image to a NumPy array if it’s a PIL Image.


< >

( image size resample = <Resampling.BILINEAR: 2> default_to_square = True max_size = None ) image


  • image (PIL.Image.Image or np.ndarray or torch.Tensor) — The image to resize.
  • size (int or Tuple[int, int]) — The size to use for resizing the image. If size is a sequence like (h, w), output size will be matched to this.

    If size is an int and default_to_square is True, then image will be resized to (size, size). If size is an int and default_to_square is False, then smaller edge of the image will be matched to this number. i.e, if height > width, then image will be rescaled to (size * height / width, size).

  • resample (int, optional, defaults to PIL.Image.BILINEAR) — The filter to user for resampling.
  • default_to_square (bool, optional, defaults to True) — How to convert size when it is a single int. If set to True, the size will be converted to a square (size,size). If set to False, will replicate torchvision.transforms.Resize with support for resizing only the smallest edge and providing an optional max_size.
  • max_size (int, optional, defaults to None) — The maximum allowed for the longer edge of the resized image: if the longer edge of the image is greater than max_size after being resized according to size, then the image is resized again so that the longer edge is equal to max_size. As a result, size might be overruled, i.e the smaller edge may be shorter than size. Only used if default_to_square is False.



A resized PIL.Image.Image.

Resizes image. Enforces conversion of input to PIL.Image.


< >

( image rescale = None channel_first = True )


  • image (PIL.Image.Image or np.ndarray or torch.Tensor) — The image to convert to a NumPy array.
  • rescale (bool, optional) — Whether or not to apply the scaling factor (to make pixel values floats between 0. and 1.). Will default to True if the image is a PIL Image or an array/tensor of integers, False otherwise.
  • channel_first (bool, optional, defaults to True) — Whether or not to permute the dimensions of the image to put the channel dimension first.

Converts image to a numpy array. Optionally rescales it and puts the channel dimension as the first dimension.


< >

( image rescale = None )


  • image (PIL.Image.Image or numpy.ndarray or torch.Tensor) — The image to convert to the PIL Image format.
  • rescale (bool, optional) — Whether or not to apply the scaling factor (to make pixel values integers between 0 and 255). Will default to True if the image type is a floating type, False otherwise.

Converts image to a PIL Image. Optionally rescales it and puts the channel dimension back as the last axis if needed.