RuntimeError: Expected 2D (unbatched) or 3D (batched) input to conv1d, but got input of size: [1, 1, 2, 476544]

#3
by wuyuh - opened

What is the cause of this problem

I have the same Problem and I dont know how to solve it. It seams like most wav2vec models that are fine-tuned for emotion speech recognition have a similar structure of code, so I dont know how to avoid the predict() function.

It seams like that there is a problem of the input tensor being passed to the convulutional layer of the neural network. The input Tensor has a shape of [1, 1, 2, 476544], which is a 4 Dimensional input tensor. However the convolutional layer expects a 2D or 3D input tensor.

It is possible that the 'speech_file_to_array_fn()' function is not returning the expected shape of the audio data, which could cause issues downstream in the prediction process.

But again I dont really know how to solve this issue:/

I think the issue is in speach_file_to_array_fn. It looks like torchaudio.load returns the waveform and time (based on channels_first arg: https://pytorch.org/audio/0.8.0/backend.html#torchaudio.backend.sox_io_backend.load). I selected just the second element of speach_array and then the predict function worked for me.

def speech_file_to_array_fn(path, sampling_rate):
    speech_array, _sampling_rate = torchaudio.load(path,format="mp3")
    resampler = torchaudio.transforms.Resample(_sampling_rate)
    speech = resampler(speech_array[1]).squeeze().numpy()
    return speech

Sign up or log in to comment