RuntimeError: Expected 2D (unbatched) or 3D (batched) input to conv1d, but got input of size: [1, 1, 2, 63169] using example in readme

#1
by qwertyuu - opened

Hello!
Thanks for taking time to make this model. I'm trying to use it on my PC (windows 10 on WSL, torch 1.13)

At line 25 where model is called, I get RuntimeError: Expected 2D (unbatched) or 3D (batched) input to conv1d, but got input of size: [1, 1, 2, 63169]

I was loading my own .wav on my disk

Hi!

Thanks for your interest!

Could you maybe share a colab notebook with the audio file you are using? It'll be easier to debug

Hi,

I'm getting the same issue.
Here's a link of the google collab https://colab.research.google.com/drive/1JNQMxiPueSfX9eIM52dPmuCkbNPLEszK?usp=sharing

I'm not very familiar with GC, so I'm not sure if the uploaded content is available to other people. Just in case, attached is the audio file I'm working with

Hi @Abaddon ,

Thank you for your feedback and the provided notebook!

I've reproduced the error. In fact, when the input audio has multiple channels, we have to convert it to mono before passing it to the wav2vec model.

Please try with the following updated code :)

import torch
import torchaudio
import torchaudio.sox_effects as ta_sox

from transformers import AutoModelForCTC, Wav2Vec2Processor

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
processor = Wav2Vec2Processor.from_pretrained("bhuang/asr-wav2vec2-french")
model_sample_rate = processor.feature_extractor.sampling_rate

audio_path = "example.wav"  # path to your audio file
waveform, sample_rate = torchaudio.load(audio_path)

effects = []
if model_sample_rate != sample_rate:
    # resample
    effects.append(["rate", f"{model_sample_rate}"])
if waveform.shape[0] > 1:
    # convert to mono
    effects.append(["channels", "1"])
if len(effects) > 0:
    converted_waveform, _ = ta_sox.apply_effects_tensor(waveform, sample_rate, effects)

# 1d array
converted_waveform = converted_waveform.squeeze(axis=0)

# normalize
input_dict = processor(converted_waveform, sampling_rate=model_sample_rate, return_tensors="pt")

with torch.inference_mode():
    # forward
    logits = model(input_dict.input_values.to(device)).logits

# decode
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentence = processor.batch_decode(predicted_ids)[0]

Hi,

Thank you so much for taking to time to look into this @bofenghuang .
It works on the google collab!
Sadly it doesn't work locally: pytorch audio (especially libsox) doesn't seem to be having a hard time with Windows.

EDIT: I added another way to resolve this by loading with librosa directly into a 16kHz mono, so no problem for me anymore. It is on the collab if anyone has the same issue!

bofenghuang changed discussion status to closed

Hello!
I can confirm this fix not only works but has been running on my discord bot for a while now. Thanks and sorry for not updating this issue earlier!

Contrary to abaddon, I am using ubuntu which works wonderfully.

That's awesome !!

Sign up or log in to comment