Feature Extractor

A feature extractor is in charge of preparing input features for a multi-modal model. This includes feature extraction from sequences, e.g., pre-processing audio files to Log-Mel Spectrogram features, feature extraction from images e.g. cropping image image files, but also padding, normalization, and conversion to Numpy, PyTorch, and TensorFlow tensors.


class transformers.feature_extraction_utils.FeatureExtractionMixin(**kwargs)[source]

This is a feature extraction mixin used to provide saving/loading functionality for sequential and image feature extractors.

classmethod from_pretrained(pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) → SequenceFeatureExtractor[source]

Instantiate a type of FeatureExtractionMixin from a feature extractor, e.g. a derived class of SequenceFeatureExtractor.

  • pretrained_model_name_or_path (str or os.PathLike) –

    This can be either:

    • a string, the model id of a pretrained feature_extractor hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like dbmdz/bert-base-german-cased.

    • a path to a directory containing a feature extractor file saved using the save_pretrained() method, e.g., ./my_model_directory/.

    • a path or url to a saved feature extractor JSON file, e.g., ./my_model_directory/feature_extraction_config.json.

  • cache_dir (str or os.PathLike, optional) – Path to a directory in which a downloaded pretrained model feature extractor should be cached if the standard cache should not be used.

  • force_download (bool, optional, defaults to False) – Whether or not to force to (re-)download the feature extractor files and override the cached versions if they exist.

  • resume_download (bool, optional, defaults to False) – Whether or not to delete incompletely received file. Attempts to resume the download if such a file exists.

  • proxies (Dict[str, str], optional) – A dictionary of proxy servers to use by protocol or endpoint, e.g., {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}. The proxies are used on each request.

  • use_auth_token (str or bool, optional) – The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running transformers-cli login (stored in huggingface).

  • revision (str, optional, defaults to "main") – The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git.

  • return_unused_kwargs (bool, optional, defaults to False) – If False, then this function returns just the final feature extractor object. If True, then this functions returns a Tuple(feature_extractor, unused_kwargs) where unused_kwargs is a dictionary consisting of the key/value pairs whose keys are not feature extractor attributes: i.e., the part of kwargs which has not been used to update feature_extractor and is otherwise ignored.

  • kwargs (Dict[str, Any], optional) – The values in kwargs of any keys which are feature extractor attributes will be used to override the loaded values. Behavior concerning key/value pairs whose keys are not feature extractor attributes is controlled by the return_unused_kwargs keyword parameter.


Passing use_auth_token=True is required when you want to use a private model.


A feature extractor of type FeatureExtractionMixin.


# We can't instantiate directly the base class `FeatureExtractionMixin` nor `SequenceFeatureExtractor` so let's show the examples on a
# derived class: `Wav2Vec2FeatureExtractor`
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('facebook/wav2vec2-base-960h')    # Download feature_extraction_config from huggingface.co and cache.
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('./test/saved_model/')  # E.g. feature_extractor (or model) was saved using `save_pretrained('./test/saved_model/')`
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('./test/saved_model/preprocessor_config.json')
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('facebook/wav2vec2-base-960h', return_attention_mask=False, foo=False)
assert feature_extractor.return_attention_mask is False
feature_extractor, unused_kwargs = Wav2Vec2FeatureExtractor.from_pretrained('facebook/wav2vec2-base-960h', return_attention_mask=False,
                                                   foo=False, return_unused_kwargs=True)
assert feature_extractor.return_attention_mask is False
assert unused_kwargs == {'foo': False}
save_pretrained(save_directory: Union[str, os.PathLike])[source]

Save a feature_extractor object to the directory save_directory, so that it can be re-loaded using the from_pretrained() class method.


save_directory (str or os.PathLike) – Directory where the feature extractor JSON file will be saved (will be created if it does not exist).


class transformers.SequenceFeatureExtractor(feature_size: int, sampling_rate: int, padding_value: float, **kwargs)[source]

This is a general feature extraction class for speech recognition.

  • feature_size (int) – The feature dimension of the extracted features.

  • sampling_rate (int) – The sampling rate at which the audio files should be digitalized expressed in Hertz per second (Hz).

  • padding_value (float) – The value that is used to fill the padding values / vectors.

pad(processed_features: Union[transformers.feature_extraction_utils.BatchFeature, List[transformers.feature_extraction_utils.BatchFeature], Dict[str, transformers.feature_extraction_utils.BatchFeature], Dict[str, List[transformers.feature_extraction_utils.BatchFeature]], List[Dict[str, transformers.feature_extraction_utils.BatchFeature]]], padding: Union[bool, str, transformers.file_utils.PaddingStrategy] = True, max_length: Optional[int] = None, pad_to_multiple_of: Optional[int] = None, return_attention_mask: Optional[bool] = None, return_tensors: Optional[Union[str, transformers.file_utils.TensorType]] = None) → transformers.feature_extraction_utils.BatchFeature[source]

Pad input values / input vectors or a batch of input values / input vectors up to predefined length or to the max sequence length in the batch.

Padding side (left/right) padding values are defined at the feature extractor level (with self.padding_side, self.padding_value)


If the processed_features passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the result will use the same type unless you provide a different tensor type with return_tensors. In the case of PyTorch tensors, you will lose the specific device of your tensors however.

  • processed_features (BatchFeature, list of BatchFeature, Dict[str, List[float]], Dict[str, List[List[float]] or List[Dict[str, List[float]]]) –

    Processed inputs. Can represent one input (BatchFeature or Dict[str, List[float]]) or a batch of input values / vectors (list of BatchFeature, Dict[str, List[List[float]]] or List[Dict[str, List[float]]]) so you can use this method during preprocessing as well as in a PyTorch Dataloader collate function.

    Instead of List[float] you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), see the note above for the return type.

  • padding (bool, str or PaddingStrategy, optional, defaults to True) –

    Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:

    • True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).

    • 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.

    • False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).

  • max_length (int, optional) – Maximum length of the returned list and optionally padding length (see above).

  • pad_to_multiple_of (int, optional) –

    If set will pad the sequence to a multiple of the provided value.

    This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.

  • return_attention_mask (bool, optional) –

    Whether to return the attention mask. If left to the default, will return the attention mask according to the specific feature_extractor’s default.

    What are attention masks?

  • return_tensors (str or TensorType, optional) –

    If set, will return tensors instead of list of python integers. Acceptable values are:

    • 'tf': Return TensorFlow tf.constant objects.

    • 'pt': Return PyTorch torch.Tensor objects.

    • 'np': Return Numpy np.ndarray objects.


class transformers.BatchFeature(data: Optional[Dict[str, Any]] = None, tensor_type: Union[None, str, transformers.file_utils.TensorType] = None)[source]

Holds the output of the pad() and feature extractor specific __call__ methods.

This class is derived from a python dictionary and can be used as a dictionary.

  • data (dict) – Dictionary of lists/arrays/tensors returned by the __call__/pad methods (‘input_values’, ‘attention_mask’, etc.).

  • tensor_type (Union[None, str, TensorType], optional) – You can give a tensor_type here to convert the lists of integers in PyTorch/TensorFlow/Numpy Tensors at initialization.

convert_to_tensors(tensor_type: Optional[Union[str, transformers.file_utils.TensorType]] = None)[source]

Convert the inner content to tensors.


tensor_type (str or TensorType, optional) – The type of tensors to use. If str, should be one of the values of the enum TensorType. If None, no modification is done.

items() → a set-like object providing a view on D’s items[source]
keys() → a set-like object providing a view on D’s keys[source]
to(device: Union[str, torch.device])BatchFeature[source]

Send all values to device by calling v.to(device) (PyTorch only).


device (str or torch.device) – The device to put the tensors on.


The same instance after modification.

Return type


values() → an object providing a view on D’s values[source]


class transformers.image_utils.ImageFeatureExtractionMixin[source]

Mixin that contain utilities for preparing image features.

center_crop(image, size)[source]

Crops image to the given size using a center crop. Note that if the image is too small to be cropped to the size given, it will be padded (so the returned result has the size asked).

  • image (PIL.Image.Image or np.ndarray or torch.Tensor) – The image to resize.

  • size (int or Tuple[int, int]) – The size to which crop the image.

normalize(image, mean, std)[source]

Normalizes image with mean and std. Note that this will trigger a conversion of image to a NumPy array if it’s a PIL Image.

  • image (PIL.Image.Image or np.ndarray or torch.Tensor) – The image to normalize.

  • mean (List[float] or np.ndarray or torch.Tensor) – The mean (per channel) to use for normalization.

  • std (List[float] or np.ndarray or torch.Tensor) – The standard deviation (per channel) to use for normalization.

resize(image, size, resample=2)[source]

Resizes image. Note that this will trigger a conversion of image to a PIL Image.

  • image (PIL.Image.Image or np.ndarray or torch.Tensor) – The image to resize.

  • size (int or Tuple[int, int]) – The size to use for resizing the image.

  • resample (int, optional, defaults to PIL.Image.BILINEAR) – The filter to user for resampling.

to_numpy_array(image, rescale=None, channel_first=True)[source]

Converts image to a numpy array. Optionally rescales it and puts the channel dimension as the first dimension.

  • image (PIL.Image.Image or np.ndarray or torch.Tensor) – The image to convert to a NumPy array.

  • rescale (bool, optional) – Whether or not to apply the scaling factor (to make pixel values floats between 0. and 1.). Will default to True if the image is a PIL Image or an array/tensor of integers, False otherwise.

  • channel_first (bool, optional, defaults to True) – Whether or not to permute the dimensions of the image to put the channel dimension first.

to_pil_image(image, rescale=None)[source]

Converts image to a PIL Image. Optionally rescales it and puts the channel dimension back as the last axis if needed.

  • image (PIL.Image.Image or numpy.ndarray or torch.Tensor) – The image to convert to the PIL Image format.

  • rescale (bool, optional) – Whether or not to apply the scaling factor (to make pixel values integers between 0 and 255). Will default to True if the image type is a floating type, False otherwise.