Transformers documentation

Utilities for `FeatureExtractors`

You are viewing v4.27.0 version. A newer version v4.41.0 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Utilities for FeatureExtractors

This page lists all the utility functions that can be used by the audio FeatureExtractor in order to compute special features from a raw audio using common algorithms such as Short Time Fourier Transform or Mel log spectrogram.

Most of those are only useful if you are studying the code of the image processors in the library.

Audio Transformations


< >

( freq: float mel_scale: str = 'htk' ) → mels (float)


  • freqs (float) — Frequencies in Hertz
  • mel_scale (str, optional, defaults to "htk") — Scale to use, htk or slaney.


mels (float)

Frequency in Mels

Convert Hertz to Mels.


< >

( mels: array mel_scale: str = 'htk' ) → freqs (np.array)


  • mels (np.array) — Mel frequencies
  • mel_scale (str, optional, "htk") — Scale to use: htk or slaney.


freqs (np.array)

Mels converted to Hertz

Convert mel bin numbers to frequencies.


< >

( nb_frequency_bins: int nb_mel_filters: int frequency_min: float frequency_max: float sample_rate: int norm: typing.Optional[str] = None mel_scale: str = 'htk' ) → np.ndarray


  • nb_frequency_bins (int) — Number of frequencies used to compute the spectrogram (should be the same as in stft).
  • nb_mel_filters (int) — Number of Mel filers to generate.
  • frequency_min (float) — Minimum frequency of interest(Hertz).
  • frequency_max (float) — Maximum frequency of interest(Hertz).
  • sample_rate (int) — Sample rate of the audio waveform.
  • norm (str, optional) — If “slaney”, divide the triangular Mel weights by the width of the mel band (area normalization).
  • mel_scale (str, optional, defaults to "htk") — Scale to use: "htk" or "slaney".



Triangular filter banks (fb matrix) of shape (nb_frequency_bins, nb_mel_filters). This matrix is a projection matrix to go from a spectrogram to a Mel Spectrogram.

Create a frequency bin conversion matrix used to obtain the Mel Spectrogram. This is called a mel filter bank, and various implementation exist, which differ in the number of filters, the shape of the filters, the way the filters are spaced, the bandwidth of the filters, and the manner in which the spectrum is warped. The goal of these features is to approximate the non-linear human perception of the variation in pitch with respect to the frequency. This code is heavily inspired from the torchaudio implementation, see here for more details.


  • Different banks of Mel filters were introduced in the litterature. The following variation are supported:
    • MFCC FB-20: introduced in 1980 by Davis and Mermelstein, it assumes a sampling frequency of 10 kHertz and a speech bandwidth of [0, 4600] Hertz
    • MFCC FB-24 HTK: from the Cambridge HMM Toolkit (HTK) (1995) uses a filter bank of 24 filters for a speech bandwidth [0, 8000] Hertz (sampling rate ≥ 16 kHertz).
    • MFCC FB-40: from the Auditory Toolbox for MATLAB written by Slaney in 1998, assumes a sampling rate of 16 kHertz, and speech bandwidth [133, 6854] Hertz. This version also includes an area normalization.
    • HFCC-E FB-29 (Human Factor Cepstral Coefficients) of Skowronski and Harris (2004), assumes sampling rate of 12.5 kHertz and speech bandwidth [0, 6250] Hertz
  • The default parameters of torchaudio’s mel filterbanks implement the "htk" filers while torchlibrosa uses the "slaney" implementation.


< >

( frames: array windowing_function: array fft_window_size: int = None ) → spectrogram (np.ndarray)


  • frames (np.array of dimension (num_frames, fft_window_size)) — A framed audio signal obtained using audio_utils.fram_wav.
  • windowing_function (np.array of dimension (nb_frequency_bins, nb_mel_filters) — A array reprensenting the function that will be used to reduces the amplitude of the discontinuities at the boundaries of each frame when computing the STFT. Each frame will be multiplied by the windowing_function. For more information on the discontinuities, called Spectral leakage, refer to [this tutorial]
  • fft_window_size (int, optional) — Size of the window om which the Fourier transform is applied. This controls the frequency resolution of the spectrogram. 400 means that the fourrier transform is computed on windows of 400 samples. The number of frequency bins (nb_frequency_bins) used to divide the window into equal strips is equal to (1+fft_window_size)//2. An increase of the fft_window_size slows the calculus time proportionnally.


spectrogram (np.ndarray)

A spectrogram of shape (num_frames, nb_frequency_bins) obtained using the STFT algorithm

Calculates the complex Short-Time Fourier Transform (STFT) of the given framed signal. Should give the same results as torch.stft.


>>> from transformers.audio_utils import stft, fram_wave
>>> import numpy as np

>>> audio = np.random.rand(50)
>>> fft_window_size = 10
>>> hop_length = 2
>>> framed_audio = fram_wave(audio, hop_length, fft_window_size)
>>> spectrogram = stft(framed_audio, np.hanning(fft_window_size + 1))


< >

( mel_spectrogram top_db = None a_min = 1e-10 ref = 1.0 )


  • mel_spectrogram (np.array) — Input mel spectrogram.
  • top_db (int, optional) — The maximum decibel value.
  • a_min (int, optional, default to 1e-10) — Minimum value to use when cliping the mel spectrogram.
  • ref (float, optional, default to 1.0) — Maximum reference value used to scale the mel_spectrogram.

Convert a mel spectrogram from power to db scale, this function is the numpy implementation of librosa.power_to_lb. It computes 10 * log10(mel_spectrogram / ref), using basic log properties for stability.


  • The motivation behind applying the log function on the mel spectrogram is that humans do not hear loudness on a linear scale. Generally to double the percieved volume of a sound we need to put 8 times as much energy into it.
  • This means that large variations in energy may not sound all that different if the sound is loud to begin with. This compression operation makes the mel features match more closely what humans actually hear.


< >

( waveform: array hop_length: int = 160 fft_window_size: int = 400 center: bool = True ) → framed_waveform (np.array of shape (waveform.shape // hop_length , fft_window_size))


  • waveform (np.array of shape (sample_length,)) — The raw waveform which will be split into smaller chunks.
  • hop_length (int, optional, defaults to 160) — Step between each window of the waveform.
  • fft_window_size (int, optional, defaults to 400) — Defines the size of the window.
  • center (bool, defaults to True) — Whether or not to center each frame around the middle of the frame. Centering is done by reflecting the waveform on the left and on the right.


framed_waveform (np.array of shape (waveform.shape // hop_length , fft_window_size))

The framed waveforms that can be fed to np.fft.

In order to compute the short time fourier transform, the waveform needs to be split in overlapping windowed segments called frames.

The window length (window_length) defines how much of the signal is contained in each frame, while the hop length defines the step between the beginning of each new frame.