openai/whisper · I have some issue with getting the write input shape for fine tuning whisper AI

I never changed my input shape, it matches the common_voice from whisper AI fine tuning blog (https://huggingface.co/blog/fine-tune-whisper) .

Also the mapping work fine and seems just like what it is for Common_voices, but when run the trainer() it shows that there is dimension mismatch !

                                                      Code

from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
output_dir="openai/whisper-small-ar", # change to a repo name of your choice
per_device_train_batch_size=20,
gradient_accumulation_steps=1, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=10,
max_steps=10,
gradient_checkpointing=True,
fp16=False,
evaluation_strategy="steps",
per_device_eval_batch_size=2,
predict_with_generate=True,
generation_max_length=225,
save_steps=1000,
eval_steps=1000,
logging_steps=25,
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
#greater_is_better=False,
#push_to_hub=True,
dataloader_drop_last=True,

)
Note: if one does not want to upload the model checkpoints to the Hub, set push_to_hub=False.

We can forward the training arguments to the 🤗 Trainer along with our model, dataset, data collator and compute_metrics function:

model
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
args=training_args,
model=model,
train_dataset=ar_text["train"],
eval_dataset=ar_text["test"],
data_collator=data_collator,
compute_metrics=compute_metrics,
tokenizer=processor.feature_extractor,
)
len(ar_text["train"])
len(ar_text["test"])
150
ar_text
DatasetDict({
train: Dataset({
features: ['input_features', 'labels'],
num_rows: 216
})
test: Dataset({
features: ['input_features', 'labels'],
num_rows: 150
})
})
We'll save the processor object once before starting training. Since the processor is not trainable, it won't change over the course of training:

processor.save_pretrained(training_args.output_dir)
Training
Training will take approximately 5-10 hours depending on your GPU or the one allocated to this Google Colab. If using this Google Colab directly to fine-tune a Whisper model, you should make sure that training isn't interrupted due to inactivity. A simple workaround to prevent this is to paste the following code into the console of this tab (right mouse click -> inspect -> Console tab -> insert code).

function ConnectButton(){
console.log("Connect pushed");
document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click()
}
setInterval(ConnectButton, 60000);
The peak GPU memory for the given training configuration is approximately 15.8GB. Depending on the GPU allocated to the Google Colab, it is possible that you will encounter a CUDA "out-of-memory" error when you launch training. In this case, you can reduce the per_device_train_batch_size incrementally by factors of 2 and employ gradient_accumulation_steps to compensate.

To launch training, simply execute:

trainer.train()

                                                      ERORR

RuntimeError Traceback (most recent call last)
Cell In[360], line 1
----> 1 trainer.train()

File ~\anaconda3\Lib\site-packages\transformers\trainer.py:1534, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1529 self.model_wrapped = self.model
1531 inner_training_loop = find_executable_batch_size(
1532 self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
1533 )
-> 1534 return inner_training_loop(
1535 args=args,
1536 resume_from_checkpoint=resume_from_checkpoint,
1537 trial=trial,
1538 ignore_keys_for_eval=ignore_keys_for_eval,
1539 )

File ~\anaconda3\Lib\site-packages\transformers\trainer.py:1807, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1804 self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
1806 with self.accelerator.accumulate(model):
-> 1807 tr_loss_step = self.training_step(model, inputs)
1809 if (
1810 args.logging_nan_inf_filter
1811 and not is_torch_tpu_available()
1812 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
1813 ):
1814 # if loss is nan or inf simply add the average of previous logged losses
1815 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~\anaconda3\Lib\site-packages\transformers\trainer.py:2649, in Trainer.training_step(self, model, inputs)
2646 return loss_mb.reduce_mean().detach().to(self.args.device)
2648 with self.compute_loss_context_manager():
-> 2649 loss = self.compute_loss(model, inputs)
2651 if self.args.n_gpu > 1:
2652 loss = loss.mean() # mean() to average on multi-gpu parallel training

File ~\anaconda3\Lib\site-packages\transformers\trainer.py:2674, in Trainer.compute_loss(self, model, inputs, return_outputs)
2672 else:
2673 labels = None
-> 2674 outputs = model(**inputs)
2675 # Save past state if it exists
2676 # TODO: this needs to be fixed and made cleaner later.
2677 if self.args.past_index >= 0:

File ~\anaconda3\Lib\site-packages\torch\nn\modules\module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~\anaconda3\Lib\site-packages\transformers\models\whisper\modeling_whisper.py:1490, in WhisperForConditionalGeneration.forward(self, input_features, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
1485 if decoder_input_ids is None and decoder_inputs_embeds is None:
1486 decoder_input_ids = shift_tokens_right(
1487 labels, self.config.pad_token_id, self.config.decoder_start_token_id
1488 )
-> 1490 outputs = self.model(
1491 input_features,
1492 attention_mask=attention_mask,
1493 decoder_input_ids=decoder_input_ids,
1494 encoder_outputs=encoder_outputs,
1495 decoder_attention_mask=decoder_attention_mask,
1496 head_mask=head_mask,
1497 decoder_head_mask=decoder_head_mask,
1498 cross_attn_head_mask=cross_attn_head_mask,
1499 past_key_values=past_key_values,
1500 decoder_inputs_embeds=decoder_inputs_embeds,
1501 use_cache=use_cache,
1502 output_attentions=output_attentions,
1503 output_hidden_states=output_hidden_states,
1504 return_dict=return_dict,
1505 )
1506 lm_logits = self.proj_out(outputs[0])
1508 loss = None

File ~\anaconda3\Lib\site-packages\transformers\models\whisper\modeling_whisper.py:1346, in WhisperModel.forward(self, input_features, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, decoder_inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
1343 if encoder_outputs is None:
1344 input_features = self._mask_input_features(input_features, attention_mask=attention_mask)
-> 1346 encoder_outputs = self.encoder(
1347 input_features,
1348 head_mask=head_mask,
1349 output_attentions=output_attentions,
1350 output_hidden_states=output_hidden_states,
1351 return_dict=return_dict,
1352 )
1353 # If the user passed a tuple for encoder_outputs, we wrap it in a BaseModelOutput when return_dict=True
1354 elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):

File ~\anaconda3\Lib\site-packages\transformers\models\whisper\modeling_whisper.py:896, in WhisperEncoder.forward(self, input_features, attention_mask, head_mask, output_attentions, output_hidden_states, return_dict)
892 output_hidden_states = (
893 output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
894 )
895 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
--> 896 inputs_embeds = nn.functional.gelu(self.conv1(input_features))
897 inputs_embeds = nn.functional.gelu(self.conv2(inputs_embeds))
899 inputs_embeds = inputs_embeds.permute(0, 2, 1)

File ~\anaconda3\Lib\site-packages\torch\nn\modules\conv.py:313, in Conv1d.forward(self, input)
312 def forward(self, input: Tensor) -> Tensor:
--> 313 return self._conv_forward(input, self.weight, self.bias)

File ~\anaconda3\Lib\site-packages\torch\nn\modules\conv.py:309, in Conv1d._conv_forward(self, input, weight, bias)
305 if self.padding_mode != 'zeros':
306 return F.conv1d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
307 weight, bias, self.stride,
308 _single(0), self.dilation, self.groups)
--> 309 return F.conv1d(input, weight, bias, self.stride,
310 self.padding, self.dilation, self.groups)

RuntimeError: Expected 2D (unbatched) or 3D (batched) input to conv1d, but got input of size: [20, 1, 80, 3000]