Error-prone audio example

#10
by cromz22 - opened

Here's the current example of loading audio dataset:

# let's load an audio sample from an Arabic speech corpus
from datasets import load_dataset
dataset = load_dataset("arabic_speech_corpus", split="test", streaming=True)
audio_sample = next(iter(dataset))["audio"]

# now, process it
audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt")

The processor is meant to be used with sampling_rate argument. Therfore, the code produces the following warning: "It is strongly recommended to pass the sampling_rate argument to this function. Failing to do so can result in silent errors that might be hard to debug." In this example, the error that is "hard to debug" is actually happening.

The example uses arabic_speech_corpus. The default sampling rate of this corpus is 48000, whereas the SeamlessM4T model was trained with audio of 16000 Hz. Therefore, the following code throws a ValueError that ends with "Please make sure that the provided raw_speech input was sampled with 16000 and not 48000."

audio_inputs = processor(audios=audio_sample["array"], sampling_rate=audio_sample["sampling_rate"], return_tensors="pt")

Of course we can resample audio to 16000 so that it is suitable as model input (I actually tried translating the Arabic speech to English text, and it generated "The sea is the sea of the sea of the sea of the sea of the sea of the sea of the sea of the sea". After resampling the audio, it generated "It allowed the traveling salesman to be attractive to the low-income citizen.", which seems OK although I don't know Arabic.), but as an example it should not contain such complications.

Please consider rewriting the example using some dataset with default sampling rate of 16000 (e.g., "hf-internal-testing/librispeech_asr_dummy") and passing the sampling_rate argument to the processor. (Examples in seamless-m4t-large model card and SeamlessM4T model document should also be changed.)

Sign up or log in to comment