Mismatch in data preprocessing vs facebookresearch/omnilingual-asr

#6
by cy161718 - opened

During code inspection, I noticed that the data processing pipeline implemented in this project differs from the one used in facebookresearch/omnilingual-asr. In this project, modifying the langs field does make it easier to obtain transcriptions in the corresponding language.

For reference, the preprocessing workflow in this project (for already segmented audio) is:
1.Decode audio to WAV with FFmpeg.
2.Export to a temporary WAV file.
3.Load the audio via Librosa to obtain floating-point waveform data.
4.Resample to 16 kHz if needed.
5.Convert to mono if multi-channel.
6.Write out a normalized WAV file.
7.Convert the waveform to a PyTorch tensor.
8.Cast from float32 to bfloat16.
9.Apply Layer Normalization over the entire audio segment.
10.Re-encode the normalized waveform into a PCM-16 WAV byte stream.
11.Feed this byte stream into the ASR model.

The main differences from the official Omnilingual-ASR preprocessing appear to be the LayerNorm applied to the raw waveform and the conversion to BF16.

In my experiments, this preprocessing does make the langs field behave more effectively.
However, when evaluating on the dataset used to train OmniASR, WER and CER increase under this preprocessing, which suggests it may not match what the model was trained with.

My questions are:

  • Did OmniASR use this normalization + BF16 preprocessing during training?
  • If not, which preprocessing pipeline corresponds to the actual training setup?

Since inference generally works best when the preprocessing matches training, understanding the intended pipeline would help clarify the observed behavior.

Sign up or log in to comment