Spaces:
Running
Running
# REAPER: Robust Epoch And Pitch EstimatoR | |
This is a speech processing system. The _reaper_ program uses the | |
EpochTracker class to simultaneously estimate the location of | |
voiced-speech "epochs" or glottal closure instants (GCI), voicing | |
state (voiced or unvoiced) and fundamental frequency (F0 or "pitch"). | |
We define the local (instantaneous) F0 as the inverse of the time | |
between successive GCI. | |
This code was developed by David Talkin at Google. This is not an | |
official Google product (experimental or otherwise), it is just | |
code that happens to be owned by Google. | |
## Downloading and Building _reaper_ | |
``` | |
cd convenient_place_for_repository | |
git clone https://github.com/google/REAPER.git | |
cd REAPER | |
mkdir build # In the REAPER top-level directory | |
cd build | |
cmake .. | |
make | |
``` | |
_reaper_ will now be in `convenient_place_for_repository/REAPER/build/reaper` | |
You may want to add that path to your PATH environment variable or | |
move _reaper_ to your favorite bin repository. | |
Example: | |
To compute F0 (pitch) and pitchmark (GCI) tracks and write them out as ASCII files: | |
`reaper -i /tmp/bla.wav -f /tmp/bla.f0 -p /tmp/bla.pm -a` | |
## Input Signals: | |
As written, the input stage expects 16-bit, signed integer samples. | |
Any reasonable sample rate may be used, but rates below 16 kHz will | |
introduce increasingly coarse quantization of the results, and higher | |
rates will incur quadratic increase in computational requirements | |
without gaining much in output accuracy. | |
While REAPER is fairly robust to recording quality, it is designed for | |
use with studio-quality speech signals, such as those recorded for | |
concatenation text-to-speech systems. Phase distortion, such as that | |
introduced by some close-talking microphones or by well-intended | |
recording-studio filtering, including rumble removal, should be | |
avoided, for best results. A rumble filter is provided within REAPER | |
as the recommended (default) high-pass pre-filtering option, and is | |
implemented as a symmetric FIR filter that introduces no phase | |
distortion. | |
The help text _(-h)_ provided by the _reaper_ program describes | |
various output options, including debug output of some of the feature | |
signals. Of special interest is the residual waveform which may be | |
used to check for the expected waveshape. (The residual has a | |
_.resid_ filename extension.) During non-nasalized, open vocal tract | |
vocalizations (such as /a/), each period should show a somewhat noisy | |
version of the derivative of the idealized glottal flow. If the computed | |
residual deviates radically from this ideal, the Hilbert transform | |
option _(-t)_ might improve matters. | |
## The REAPER Algorithm: | |
The process can be broken down into the following phases: | |
* Signal Conditioning | |
* Feature Extraction | |
* Lattice Generation | |
* Dynamic Programming | |
* Backtrace and Output Generation | |
## Signal Conditioning | |
DC bias and low-frequency noise are removed by high-pass filtering, | |
and the signal is converted to floating point. If the input is known | |
to have phase distortion that is impacting tracker performance, a | |
Hilbert transform, optionally done at this point, may improve | |
performance. | |
## Feature Extraction | |
The following feature signals are derived from the conditioned input: | |
* Linear Prediction residual: | |
This is computed using the autocorrelation method and continuous | |
interpolation of the filter coefficients. It is checked for the | |
expected polarity (negative impulses), and inverted, if necessary. | |
* Amplitude-normalized prediction residual: | |
The normalization factor is based on the running, local RMS. | |
* Pseudo-probability of voicing: | |
This is based on a local measure of low-frequency energy normalized | |
by the peak energy in the utterance. | |
* Pseudo-probability of voicing onset: | |
Based on a forward delta of lowpassed energy. | |
* Pseudo-probability of voicing offset: | |
Based on a backward delta of lowpassed energy. | |
* Graded GCI candidates: | |
Each negative peak in the normalized residual is compared with the | |
local RMS. Peaks exceeding a threshold are selected as GCI candidates, | |
and then graded by a weighted combination of peak amplitude, skewness, | |
and sharpness. Each of the resulting candidates is associated with the | |
other feature values that occur closest in time to the candidate. | |
* Normalized cross-correlation functions (NCCF) for each GCI candidate: | |
The correlations are computed on a weighted combination of the speech | |
signal and its LP residual. The correlation reference window for | |
each GCI candidate impulse is centered on the inpulse, and | |
correlations are computed for all lags in the expected pitch period range. | |
## Lattice Generation | |
Each GCI candidate (pulse) is set into a lattice structure that links | |
preceding and following pulses that occur within minimum and maximum | |
pitch period limits that are being considered for the utterance. | |
These links establish all of the period hypotheses that will be | |
considered for the pulse. Each hypothesis is scored on "local" | |
evidence derived from the NCCF and peak quality measures. Each pulse | |
is also assigned an unvoiced hypothesis, which is also given a score | |
based on the available local evidence. The lattice is checked, and | |
modified, if necessary to ensure that each pulse has at least one | |
voiced and one unvoiced hypothesis preceding and following it, to | |
maintain continuity for the dynamic programming to follow. | |
(Note that the "scores" are used as costs during dynamic programming, | |
so that low scores encourage selection of hypotheses.) | |
## Dynamic Programming | |
``` | |
For each pulse in the utterance: | |
For each period hypotheses following the pulse: | |
For each period hypothesis preceding the pulse: | |
Score the transition cost of connecting the periods. Choose the | |
minimum overall cost (cumulative+local+transition) preceding | |
period hypothesis, and save its cost and a backpointer to it. | |
The costs of making a voicing state change are modulated by the | |
probability of voicing onset and offset. The cost of | |
voiced-to-voiced transition is based on the delta F0 that | |
occurs, and the cost of staying in the unvoiced state is a | |
constant system parameter. | |
``` | |
## Backtrace and Output Generation | |
Starting at the last peak in the utterance, the lowest cost period | |
candidate ending on that peak is found. This is the starting point | |
for backtracking. The backpointers to the best preceding period | |
candidates are then followed backwards through the utterance. As each | |
"best candidate" is found, the time location of the terminal peak is | |
recorded, along with the F0 corresponding to the period, or 0.0 if the | |
candidate is unvoiced. Instead of simply taking the inverse of the | |
period between GCI estimates as F0, the system refers back to the NCCF | |
for that GCI, and takes the location of the NCCF maximum closest to | |
the GCI-based period as the actual period. The output array of F0 and | |
estimated GCI location is then time-reversed for final output. | |