Utilities for FeatureExtractors
This page lists all the utility functions that can be used by the audio FeatureExtractor
in order to compute special features from a raw audio using common algorithms such as Short Time Fourier Transform or Mel log spectrogram.
Most of those are only useful if you are studying the code of the image processors in the library.
Audio Transformations
transformers.audio_utils.hertz_to_mel
< source >(
freq: float
mel_scale: str = 'htk'
)
→
mels (float
)
Convert Hertz to Mels.
transformers.audio_utils.mel_to_hertz
< source >(
mels: array
mel_scale: str = 'htk'
)
→
freqs (np.array
)
Convert mel bin numbers to frequencies.
transformers.audio_utils.get_mel_filter_banks
< source >(
nb_frequency_bins: int
nb_mel_filters: int
frequency_min: float
frequency_max: float
sample_rate: int
norm: typing.Optional[str] = None
mel_scale: str = 'htk'
)
→
np.ndarray
Parameters

nb_frequency_bins (
int
) — Number of frequencies used to compute the spectrogram (should be the same as instft
). 
nb_mel_filters (
int
) — Number of Mel filers to generate. 
frequency_min (
float
) — Minimum frequency of interest(Hertz). 
frequency_max (
float
) — Maximum frequency of interest(Hertz). 
sample_rate (
int
) — Sample rate of the audio waveform. 
norm (
str
, optional) — If “slaney”, divide the triangular Mel weights by the width of the mel band (area normalization). 
mel_scale (
str
, optional, defaults to"htk"
) — Scale to use:"htk"
or"slaney"
.
Returns
np.ndarray
Triangular filter banks (fb matrix) of shape (nb_frequency_bins
, nb_mel_filters
). This matrix
is a projection matrix to go from a spectrogram to a Mel Spectrogram.
Create a frequency bin conversion matrix used to obtain the Mel Spectrogram. This is called a mel filter bank, and various implementation exist, which differ in the number of filters, the shape of the filters, the way the filters are spaced, the bandwidth of the filters, and the manner in which the spectrum is warped. The goal of these features is to approximate the nonlinear human perception of the variation in pitch with respect to the frequency. This code is heavily inspired from the torchaudio implementation, see here for more details.
Tips:
 Different banks of Mel filters were introduced in the litterature. The following variation are supported:
 MFCC FB20: introduced in 1980 by Davis and Mermelstein, it assumes a sampling frequency of 10 kHertz
and a speech bandwidth of
[0, 4600]
Hertz  MFCC FB24 HTK: from the Cambridge HMM Toolkit (HTK) (1995) uses a filter bank of 24 filters for a
speech bandwidth
[0, 8000]
Hertz (sampling rate ≥ 16 kHertz).  MFCC FB40: from the Auditory Toolbox for MATLAB written by Slaney in 1998, assumes a sampling rate of 16 kHertz, and speech bandwidth [133, 6854] Hertz. This version also includes an area normalization.
 HFCCE FB29 (Human Factor Cepstral Coefficients) of Skowronski and Harris (2004), assumes sampling rate of 12.5 kHertz and speech bandwidth [0, 6250] Hertz
 MFCC FB20: introduced in 1980 by Davis and Mermelstein, it assumes a sampling frequency of 10 kHertz
and a speech bandwidth of
 The default parameters of
torchaudio
’s mel filterbanks implement the"htk"
filers whiletorchlibrosa
uses the"slaney"
implementation.
transformers.audio_utils.stft
< source >(
frames: array
windowing_function: array
fft_window_size: int = None
)
→
spectrogram (np.ndarray
)
Parameters

frames (
np.array
of dimension(num_frames, fft_window_size)
) — A framed audio signal obtained usingaudio_utils.fram_wav
. 
windowing_function (
np.array
of dimension(nb_frequency_bins, nb_mel_filters)
— A array reprensenting the function that will be used to reduces the amplitude of the discontinuities at the boundaries of each frame when computing the STFT. Each frame will be multiplied by the windowing_function. For more information on the discontinuities, called Spectral leakage, refer to [this tutorial]https://download.ni.com/evaluation/pxi/Understanding%20FFTs%20and%20Windowing.pdf 
fft_window_size (
int
, optional) — Size of the window om which the Fourier transform is applied. This controls the frequency resolution of the spectrogram. 400 means that the fourrier transform is computed on windows of 400 samples. The number of frequency bins (nb_frequency_bins
) used to divide the window into equal strips is equal to(1+fft_window_size)//2
. An increase of the fft_window_size slows the calculus time proportionnally.
Returns
spectrogram (np.ndarray
)
A spectrogram of shape (num_frames, nb_frequency_bins)
obtained using the STFT algorithm
Calculates the complex ShortTime Fourier Transform (STFT) of the given framed signal. Should give the same results
as torch.stft
.
transformers.audio_utils.power_to_db
< source >( mel_spectrogram top_db = None a_min = 1e10 ref = 1.0 )
Parameters
Convert a mel spectrogram from power to db scale, this function is the numpy implementation of librosa.power_to_lb.
It computes 10 * log10(mel_spectrogram / ref)
, using basic log properties for stability.
Tips:
 The motivation behind applying the log function on the mel spectrogram is that humans do not hear loudness on a linear scale. Generally to double the percieved volume of a sound we need to put 8 times as much energy into it.
 This means that large variations in energy may not sound all that different if the sound is loud to begin with. This compression operation makes the mel features match more closely what humans actually hear.
transformers.audio_utils.fram_wave
< source >(
waveform: array
hop_length: int = 160
fft_window_size: int = 400
center: bool = True
)
→
framed_waveform (np.array
of shape (waveform.shape // hop_length , fft_window_size)
)
Parameters

waveform (
np.array
of shape(sample_length,)
) — The raw waveform which will be split into smaller chunks. 
hop_length (
int
, optional, defaults to 160) — Step between each window of the waveform. 
fft_window_size (
int
, optional, defaults to 400) — Defines the size of the window. 
center (
bool
, defaults toTrue
) — Whether or not to center each frame around the middle of the frame. Centering is done by reflecting the waveform on the left and on the right.
Returns
framed_waveform (np.array
of shape (waveform.shape // hop_length , fft_window_size)
)
The framed waveforms that can be fed to np.fft
.
In order to compute the short time fourier transform, the waveform needs to be split in overlapping windowed
segments called frames
.
The window length (window_length) defines how much of the signal is contained in each frame, while the hop length defines the step between the beginning of each new frame.