transformers documentation

Feature Extractor

Feature Extractor

A feature extractor is in charge of preparing input features for a multi-modal model. This includes feature extraction from sequences, e.g., pre-processing audio files to Log-Mel Spectrogram features, feature extraction from images e.g. cropping image image files, but also padding, normalization, and conversion to Numpy, PyTorch, and TensorFlow tensors.


class transformers.feature_extraction_utils.FeatureExtractionMixin < >

( **kwargs )

This is a feature extraction mixin used to provide saving/loading functionality for sequential and image feature extractors.

from_pretrained < >

( pretrained_model_name_or_path: typing.Union[str, os.PathLike] **kwargs )


  • pretrained_model_name_or_path (str or os.PathLike) — This can be either:

    • a string, the model id of a pretrained feature_extractor hosted inside a model repo on Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like dbmdz/bert-base-german-cased.
    • a path to a directory containing a feature extractor file saved using the save_pretrained() method, e.g., ./my_model_directory/.
    • a path or url to a saved feature extractor JSON file, e.g., ./my_model_directory/preprocessor_config.json.
  • cache_dir (str or os.PathLike, optional) — Path to a directory in which a downloaded pretrained model feature extractor should be cached if the standard cache should not be used.
  • force_download (bool, optional, defaults to False) — Whether or not to force to (re-)download the feature extractor files and override the cached versions if they exist.
  • resume_download (bool, optional, defaults to False) — Whether or not to delete incompletely received file. Attempts to resume the download if such a file exists.
  • proxies (Dict[str, str], optional) — A dictionary of proxy servers to use by protocol or endpoint, e.g., {'http': '', 'http://hostname': ''}. The proxies are used on each request.
  • use_auth_token (str or bool, optional) — The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running transformers-cli login (stored in ~/.huggingface).
  • revision(str, optional, defaults to "main") — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on, so revision can be any identifier allowed by git.
  • return_unused_kwargs (bool, optional, defaults to False) — If False, then this function returns just the final feature extractor object. If True, then this functions returns a Tuple(feature_extractor, unused_kwargs) where unused_kwargs is a dictionary consisting of the key/value pairs whose keys are not feature extractor attributes: i.e., the part of kwargs which has not been used to update feature_extractor and is otherwise ignored.
  • kwargs (Dict[str, Any], optional) — The values in kwargs of any keys which are feature extractor attributes will be used to override the loaded values. Behavior concerning key/value pairs whose keys are not feature extractor attributes is controlled by the return_unused_kwargs keyword parameter.


A feature extractor of type FeatureExtractionMixin.

Instantiate a type of FeatureExtractionMixin from a feature extractor, e.g. a derived class of SequenceFeatureExtractor.

Passing use_auth_token=True is required when you want to use a private model.


# We can't instantiate directly the base class *FeatureExtractionMixin* nor *SequenceFeatureExtractor* so let's show the examples on a
# derived class: *Wav2Vec2FeatureExtractor*
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('facebook/wav2vec2-base-960h')    # Download feature_extraction_config from and cache.
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('./test/saved_model/')  # E.g. feature_extractor (or model) was saved using *save_pretrained('./test/saved_model/')*
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('./test/saved_model/preprocessor_config.json')
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('facebook/wav2vec2-base-960h', return_attention_mask=False, foo=False)
assert feature_extractor.return_attention_mask is False
feature_extractor, unused_kwargs = Wav2Vec2FeatureExtractor.from_pretrained('facebook/wav2vec2-base-960h', return_attention_mask=False,
                                                   foo=False, return_unused_kwargs=True)
assert feature_extractor.return_attention_mask is False
assert unused_kwargs == {'foo': False}
save_pretrained < >

( save_directory: typing.Union[str, os.PathLike] )


  • save_directory (str or os.PathLike) — Directory where the feature extractor JSON file will be saved (will be created if it does not exist).

Save a feature_extractor object to the directory save_directory, so that it can be re-loaded using the from_pretrained() class method.


class transformers.SequenceFeatureExtractor < >

( feature_size: int sampling_rate: int padding_value: float **kwargs )


  • feature_size (int) — The feature dimension of the extracted features.
  • sampling_rate (int) — The sampling rate at which the audio files should be digitalized expressed in Hertz per second (Hz).
  • padding_value (float) — The value that is used to fill the padding values / vectors.

This is a general feature extraction class for speech recognition.

pad < >

( processed_features: typing.Union[transformers.feature_extraction_utils.BatchFeature, typing.List[transformers.feature_extraction_utils.BatchFeature], typing.Dict[str, transformers.feature_extraction_utils.BatchFeature], typing.Dict[str, typing.List[transformers.feature_extraction_utils.BatchFeature]], typing.List[typing.Dict[str, transformers.feature_extraction_utils.BatchFeature]]] padding: typing.Union[bool, str, transformers.file_utils.PaddingStrategy] = True max_length: typing.Optional[int] = None truncation: bool = False pad_to_multiple_of: typing.Optional[int] = None return_attention_mask: typing.Optional[bool] = None return_tensors: typing.Union[str, transformers.file_utils.TensorType, NoneType] = None )


  • processed_features (BatchFeature, list of BatchFeature, Dict[str, List[float]], Dict[str, List[List[float]] or List[Dict[str, List[float]]]) — Processed inputs. Can represent one input (BatchFeature or Dict[str, List[float]]) or a batch of input values / vectors (list of BatchFeature, Dict[str, List[List[float]]] or List[Dict[str, List[float]]]) so you can use this method during preprocessing as well as in a PyTorch Dataloader collate function.

    Instead of List[float] you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), see the note above for the return type.

  • padding (bool, str or PaddingStrategy, optional, defaults to True) — Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:

    • True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
    • 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
    • False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).
  • max_length (int, optional) — Maximum length of the returned list and optionally padding length (see above).
  • truncation (bool) — Activates truncation to cut input sequences longer than max_length to max_length.
  • pad_to_multiple_of (int, optional) — If set will pad the sequence to a multiple of the provided value.

    This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability

    = 7.5 (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.

  • return_attention_mask (bool, optional) — Whether to return the attention mask. If left to the default, will return the attention mask according to the specific feature_extractor’s default.

    What are attention masks?

  • return_tensors (str or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:

    • 'tf': Return TensorFlow tf.constant objects.
    • 'pt': Return PyTorch torch.Tensor objects.
    • 'np': Return Numpy np.ndarray objects.

Pad input values / input vectors or a batch of input values / input vectors up to predefined length or to the max sequence length in the batch.

Padding side (left/right) padding values are defined at the feature extractor level (with self.padding_side, self.padding_value)

If the processed_features passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the result will use the same type unless you provide a different tensor type with return_tensors. In the case of PyTorch tensors, you will lose the specific device of your tensors however.


class transformers.BatchFeature < >

( data: typing.Union[typing.Dict[str, typing.Any], NoneType] = None tensor_type: typing.Union[NoneType, str, transformers.file_utils.TensorType] = None )


  • data (dict) — Dictionary of lists/arrays/tensors returned by the call/pad methods (‘input_values’, ‘attention_mask’, etc.).
  • tensor_type (Union[None, str, TensorType], optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/TensorFlow/Numpy Tensors at initialization.

Holds the output of the pad() and feature extractor specific __call__ methods.

This class is derived from a python dictionary and can be used as a dictionary.

convert_to_tensors < >

( tensor_type: typing.Union[str, transformers.file_utils.TensorType, NoneType] = None )


  • tensor_type (str or TensorType, optional) — The type of tensors to use. If str, should be one of the values of the enum TensorType. If None, no modification is done.

Convert the inner content to tensors.

to < >

( device: typing.Union[str, ForwardRef('torch.device')] ) β†’ BatchFeature


  • device (str or torch.device) — The device to put the tensors on.



The same instance after modification.

Send all values to device by calling (PyTorch only).


class transformers.ImageFeatureExtractionMixin < >

( )

Mixin that contain utilities for preparing image features.

center_crop < >

( image size )


  • image (PIL.Image.Image or np.ndarray or torch.Tensor) — The image to resize.
  • size (int or Tuple[int, int]) — The size to which crop the image.

Crops image to the given size using a center crop. Note that if the image is too small to be cropped to the size given, it will be padded (so the returned result has the size asked).

normalize < >

( image mean std )


  • image (PIL.Image.Image or np.ndarray or torch.Tensor) — The image to normalize.
  • mean (List[float] or np.ndarray or torch.Tensor) — The mean (per channel) to use for normalization.
  • std (List[float] or np.ndarray or torch.Tensor) — The standard deviation (per channel) to use for normalization.

Normalizes image with mean and std. Note that this will trigger a conversion of image to a NumPy array if it’s a PIL Image.

resize < >

( image size resample = 2 )


  • image (PIL.Image.Image or np.ndarray or torch.Tensor) — The image to resize.
  • size (int or Tuple[int, int]) — The size to use for resizing the image.
  • resample (int, optional, defaults to PIL.Image.BILINEAR) — The filter to user for resampling.

Resizes image. Note that this will trigger a conversion of image to a PIL Image.

to_numpy_array < >

( image rescale = None channel_first = True )


  • image (PIL.Image.Image or np.ndarray or torch.Tensor) — The image to convert to a NumPy array.
  • rescale (bool, optional) — Whether or not to apply the scaling factor (to make pixel values floats between 0. and 1.). Will default to True if the image is a PIL Image or an array/tensor of integers, False otherwise.
  • channel_first (bool, optional, defaults to True) — Whether or not to permute the dimensions of the image to put the channel dimension first.

Converts image to a numpy array. Optionally rescales it and puts the channel dimension as the first dimension.

to_pil_image < >

( image rescale = None )


  • image (PIL.Image.Image or numpy.ndarray or torch.Tensor) — The image to convert to the PIL Image format.
  • rescale (bool, optional) — Whether or not to apply the scaling factor (to make pixel values integers between 0 and 255). Will default to True if the image type is a floating type, False otherwise.

Converts image to a PIL Image. Optionally rescales it and puts the channel dimension back as the last axis if needed.