WER = 100% !!

#84
by Seyfelislem - opened

Hello everyone,

I am having an issue when finetuning OpenAI's Whisper Medium on Mozilla's Common Voice 11 Dataset with the Arabic language.
The training and validation loss are both decreasing but the WER is being 100% after some steps (specially when the loss becomes < 1) and I see that the model is performing well and that WER is just miscalculated.

hf_issue.png

Notes :

I met the same question

Hey @Seyfelislem and @lnpwcd68730 ! Thank you both for reporting this issue. You might be interested in checking out the Whisper leaderboard for finding the most performant fine-tuned Whisper checkpoints in your language: https://huggingface.co/spaces/whisper-event/leaderboard?dataset=mozilla-foundation%2Fcommon_voice_11_0&config=ar&split=test

Good to see that the eval loss is still decreasing (it's pretty easy to overfit with Whisper fine-tuning). For the WER issue, what we can do is save the references and predictions to a .txt file, and inspect them to see what sorts of errors the model is making. To do this, you can amend the compute_metrics function as follows:

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    # save references and predictions to a txt file for debugging
    with open('refs_and_preds.txt', 'w') as f:
    for ref, pred in zip(label_str, pred_str)
        f.write(f"Ref: {ref}\n")
        f.write(f"Pred: {pred}\n\n")

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

Hey @sanchit-gandhi , @amyeroberts
Thank you for you answers here and in github.
(I'd like to proceed with my question here on Hugging Face.)

So, I tried your suggestion by modifying compute_metrics, and it seems that the transcriptions generated by the model are sometimes in the Arabic language (original and buckwalter) and sometimes are translated to an other language (French, Russian, Chinese ,etc) and sometimes even empty!

image.png

Here are the results of the transformers-cli env command:

  • transformers version: 4.28.1
  • Platform: Linux-5.10.147+-x86_64-with-glibc2.31
  • Python version: 3.9.16
  • Huggingface_hub version: 0.14.1
  • Safetensors version: not installed
  • PyTorch version (GPU?): 2.0.0+cu118 (True)
  • Tensorflow version (GPU?): 2.12.0 (True)
  • Flax version (CPU?/GPU?/TPU?): 0.6.8 (gpu)
  • Jax version: 0.4.8
  • JaxLib version: 0.4.7
  • Using GPU in script?: (True)

Hey @Seyfelislem - can you do two things please:

  1. Could you verify that you set the language correctly in your tokenizer and processor (i.e. that you set language="Arabic" and task="transcribe" in both the tokenizer and processor)
  2. Secondly, could you add this line right after you set forced_decoder_ids=None (in this section:
model.generate = partial(model.generate, language="arabic", task="transcribe")

This will now force the model always to predict in Arabic.

Hey again @sanchit-gandhi
I can ensure that I set the language correctly in the tokenizer and the processor.
Now, after adding the line model.generate = partial(model.generate, language="arabic", task="transcribe"), there are no more problems with WER.

image.png

Thank you very much for your efforts.

Hello @sanchit-gandhi ,

I have same problem with WER approaching to 100% for Czech language.

After I added following two lines:
from functools import partial
....
model.generate = partial(model.generate, language="Czech", task="transcribe")

Following error appeare during evaluation (after eval_steps):
AttributeError: 'WhisperForConditionalGeneration' object has no attribute 'language'

Result of the transformers-cli env command:

  • transformers version: 4.28.1
  • Platform: Linux-5.10.147+-x86_64-with-glibc2.31
  • Python version: 3.10.11
  • Huggingface_hub version: 0.14.1
  • Safetensors version: not installed
  • PyTorch version (GPU?): 2.0.0+cu118 (True)
  • Tensorflow version (GPU?): 2.12.0 (True)
  • Flax version (CPU?/GPU?/TPU?): 0.6.9 (gpu)
  • Jax version: 0.4.8
  • JaxLib version: 0.4.7
  • Using GPU in script?: (True)
Detailed traceback:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ in :81                                                                            โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1662 in train                    โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   1659 โ”‚   โ”‚   inner_training_loop = find_executable_batch_size(                                 โ”‚
โ”‚   1660 โ”‚   โ”‚   โ”‚   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  โ”‚
โ”‚   1661 โ”‚   โ”‚   )                                                                                 โ”‚
โ”‚ โฑ 1662 โ”‚   โ”‚   return inner_training_loop(                                                       โ”‚
โ”‚   1663 โ”‚   โ”‚   โ”‚   args=args,                                                                    โ”‚
โ”‚   1664 โ”‚   โ”‚   โ”‚   resume_from_checkpoint=resume_from_checkpoint,                                โ”‚
โ”‚   1665 โ”‚   โ”‚   โ”‚   trial=trial,                                                                  โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2006 in _inner_training_loop     โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   2003 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   self.state.epoch = epoch + (step + 1 + steps_skipped) / steps_in_epo  โ”‚
โ”‚   2004 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   self.control = self.callback_handler.on_step_end(args, self.state, s  โ”‚
โ”‚   2005 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚                                                                         โ”‚
โ”‚ โฑ 2006 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_k  โ”‚
โ”‚   2007 โ”‚   โ”‚   โ”‚   โ”‚   else:                                                                     โ”‚
โ”‚   2008 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   self.control = self.callback_handler.on_substep_end(args, self.state  โ”‚
โ”‚   2009                                                                                           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2287 in _maybe_log_save_evaluate โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   2284 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   )                                                                     โ”‚
โ”‚   2285 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   metrics.update(dataset_metrics)                                       โ”‚
โ”‚   2286 โ”‚   โ”‚   โ”‚   else:                                                                         โ”‚
โ”‚ โฑ 2287 โ”‚   โ”‚   โ”‚   โ”‚   metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)                 โ”‚
โ”‚   2288 โ”‚   โ”‚   โ”‚   self._report_to_hp_search(trial, self.state.global_step, metrics)             โ”‚
โ”‚   2289 โ”‚   โ”‚                                                                                     โ”‚
โ”‚   2290 โ”‚   โ”‚   if self.control.should_save:                                                      โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /usr/local/lib/python3.10/dist-packages/transformers/trainer_seq2seq.py:159 in evaluate          โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   156 โ”‚   โ”‚   )                                                                                  โ”‚
โ”‚   157 โ”‚   โ”‚   self._gen_kwargs = gen_kwargs                                                      โ”‚
โ”‚   158 โ”‚   โ”‚                                                                                      โ”‚
โ”‚ โฑ 159 โ”‚   โ”‚   return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix   โ”‚
โ”‚   160 โ”‚                                                                                          โ”‚
โ”‚   161 โ”‚   def predict(                                                                           โ”‚
โ”‚   162 โ”‚   โ”‚   self,                                                                              โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2993 in evaluate                 โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   2990 โ”‚   โ”‚   start_time = time.time()                                                          โ”‚
โ”‚   2991 โ”‚   โ”‚                                                                                     โ”‚
โ”‚   2992 โ”‚   โ”‚   eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else se  โ”‚
โ”‚ โฑ 2993 โ”‚   โ”‚   output = eval_loop(                                                               โ”‚
โ”‚   2994 โ”‚   โ”‚   โ”‚   eval_dataloader,                                                              โ”‚
โ”‚   2995 โ”‚   โ”‚   โ”‚   description="Evaluation",                                                     โ”‚
โ”‚   2996 โ”‚   โ”‚   โ”‚   # No point gathering the predictions if there are no metrics, otherwise we d  โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:3174 in evaluation_loop          โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   3171 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   batch_size = observed_batch_size                                      โ”‚
โ”‚   3172 โ”‚   โ”‚   โ”‚                                                                                 โ”‚
โ”‚   3173 โ”‚   โ”‚   โ”‚   # Prediction step                                                             โ”‚
โ”‚ โฑ 3174 โ”‚   โ”‚   โ”‚   loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_o  โ”‚
โ”‚   3175 โ”‚   โ”‚   โ”‚   inputs_decode = self._prepare_input(inputs["input_ids"]) if args.include_inp  โ”‚
โ”‚   3176 โ”‚   โ”‚   โ”‚                                                                                 โ”‚
โ”‚   3177 โ”‚   โ”‚   โ”‚   if is_torch_tpu_available():                                                  โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /usr/local/lib/python3.10/dist-packages/transformers/trainer_seq2seq.py:271 in prediction_step   โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   268 โ”‚   โ”‚   # TODO (Joao): the following line is needed to keep a consistent result on SQUAD   โ”‚
โ”‚   269 โ”‚   โ”‚   # users from preparing a dataset with `decoder_input_ids`.                         โ”‚
โ”‚   270 โ”‚   โ”‚   inputs = {k: v for k, v in inputs.items() if k != "decoder_input_ids"}             โ”‚
โ”‚ โฑ 271 โ”‚   โ”‚   generated_tokens = self.model.generate(**inputs, **gen_kwargs)                     โ”‚
โ”‚   272 โ”‚   โ”‚                                                                                      โ”‚
โ”‚   273 โ”‚   โ”‚   # Temporary hack to ensure the generation config is not initialized for each ite   โ”‚
โ”‚   274 โ”‚   โ”‚   # TODO: remove this hack when the legacy code that initializes generation_config   โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /usr/local/lib/python3.10/dist-packages/transformers/models/whisper/modeling_whisper.py:1576 in  โ”‚
โ”‚ generate                                                                                         โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   1573 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   language_token = f"<|{TO_LANGUAGE_CODE[generation_config.language]}|  โ”‚
โ”‚   1574 โ”‚   โ”‚   โ”‚   โ”‚   else:                                                                     โ”‚
โ”‚   1575 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   raise ValueError(                                                     โ”‚
โ”‚ โฑ 1576 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   f"Unsupported language: {self.language}. Language should be one   โ”‚
โ”‚   1577 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   f" {list(TO_LANGUAGE_CODE.keys()) if generation_config.language   โ”‚
โ”‚   1578 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   )                                                                     โ”‚
โ”‚   1579 โ”‚   โ”‚   โ”‚   โ”‚   forced_decoder_ids.append((1, generation_config.lang_to_id[language_toke  โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1614 in __getattr__           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   1611 โ”‚   โ”‚   โ”‚   modules = self.__dict__['_modules']                                           โ”‚
โ”‚   1612 โ”‚   โ”‚   โ”‚   if name in modules:                                                           โ”‚
โ”‚   1613 โ”‚   โ”‚   โ”‚   โ”‚   return modules[name]                                                      โ”‚
โ”‚ โฑ 1614 โ”‚   โ”‚   raise AttributeError("'{}' object has no attribute '{}'".format(                  โ”‚
โ”‚   1615 โ”‚   โ”‚   โ”‚   type(self).__name__, name))                                                   โ”‚
โ”‚   1616 โ”‚                                                                                         โ”‚
โ”‚   1617 โ”‚   def __setattr__(self, name: str, value: Union[Tensor, 'Module']) -> None:             โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
AttributeError: 'WhisperForConditionalGeneration' object has no attribute 'language'

Capital "C" in my code language="Czech" is probably wrong. I changed it to "czech" and it is working now.

Awesome - glad both the issues have been fixed! Closing as complete - feel free to open a new issue if you find anything that looks wrong

sanchit-gandhi changed discussion status to closed

Hey @Seyfelislem - can you do two things please:

  1. Could you verify that you set the language correctly in your tokenizer and processor (i.e. that you set language="Arabic" and task="transcribe" in both the tokenizer and processor)
  2. Secondly, could you add this line right after you set forced_decoder_ids=None (in this section:
model.generate = partial(model.generate, language="arabic", task="transcribe")

This will now force the model always to predict in Arabic.

Hello there, I was wondering what is the partial object part of? I'd like to use this line but partial is not defined.

Thanks

Hey @Seyfelislem - can you do two things please:

  1. Could you verify that you set the language correctly in your tokenizer and processor (i.e. that you set language="Arabic" and task="transcribe" in both the tokenizer and processor)
  2. Secondly, could you add this line right after you set forced_decoder_ids=None (in this section:
model.generate = partial(model.generate, language="arabic", task="transcribe")

This will now force the model always to predict in Arabic.

Hello there, I was wondering what is the partial object part of? I'd like to use this line but partial is not defined.

Thanks

Hey @mohblnk ,

You should add this line to import partial :
from functools import partial

For more details about this function, you should check this link :
https://www.geeksforgeeks.org/partial-functions-python

Hey @Seyfelislem and @lnpwcd68730 ! Thank you both for reporting this issue. You might be interested in checking out the Whisper leaderboard for finding the most performant fine-tuned Whisper checkpoints in your language: https://huggingface.co/spaces/whisper-event/leaderboard?dataset=mozilla-foundation%2Fcommon_voice_11_0&config=ar&split=test

Good to see that the eval loss is still decreasing (it's pretty easy to overfit with Whisper fine-tuning). For the WER issue, what we can do is save the references and predictions to a .txt file, and inspect them to see what sorts of errors the model is making. To do this, you can amend the compute_metrics function as follows:

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    # save references and predictions to a txt file for debugging
    with open('refs_and_preds.txt', 'w') as f:
    for ref, pred in zip(label_str, pred_str)
        f.write(f"Ref: {ref}\n")
        f.write(f"Pred: {pred}\n\n")

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

I had the issue when I finetune the Whisper Medium in chinese and english., after some steps , the wer is 100%.

# wer is normal
Ref: Report a 3 mile final
Pred: Supporting 3 miles final.

Ref: ่ฟ›่ท‘้“28,ๅ“็ฎญ710.
Pred: ่ฟ›่ท‘้“28,ๅ“็ฎญ710.

# wer == 100%
Ref: ๆถŒๆณ‰192,่”็ณปๆœบๅช121.8 ๅ†่ง.
Pred:

Ref: Yangtze River 8314, offset 2 miles left of the track, expedite descend and maintain 7200 meters.
Pred:

Ref: ็™ฝ้นญ808, ่”็ณป็ฆๅทž่ฟ›่ฟ‘125.175ๅ†่ง.
Pred:

there is no pred_str,and i print pred.predictions in every eval.

# wer is normal
# pred_ids
[[50258 50259 50359 ... 50257 50257 50257]
 [50258 50259 50359 ... 50257 50257 50257]
 [50258 50260 50359 ... 50257 50257 50257]
 ...
 [50258 50259 50359 ... 50257 50257 50257]
 [50258 50260 50359 ... 50257 50257 50257]
 [50258 50260 50359 ... 50257 50257 50257]]
# label_ids
[[50258 50259 50359 ... 50257 50257 50257]
 [50258 50259 50359 ... 50257 50257 50257]
 [50258 50260 50359 ... 50257 50257 50257]
 ...
 [50258 50259 50359 ... 50257 50257 50257]
 [50258 50260 50359 ... 50257 50257 50257]
 [50258 50260 50359 ... 50257 50257 50257]]

# wer == 100%
# pred_ids
[[50258 50257 50257 ... 50257 50257 50257]
 [50258 50257 50257 ... 50257 50257 50257]
 [50258 50257 50257 ... 50257 50257 50257]
 ...
 [50258 50257 50257 ... 50257 50257 50257]
 [50258 50257 50257 ... 50257 50257 50257]
 [50258 50257 50257 ... 50257 50257 50257]]
# label_ids
[[50258 50259 50359 ... 50257 50257 50257]
 [50258 50259 50359 ... 50257 50257 50257]
 [50258 50260 50359 ... 50257 50257 50257]
 ...
 [50258 50259 50359 ... 50257 50257 50257]
 [50258 50260 50359 ... 50257 50257 50257]
 [50258 50260 50359 ... 50257 50257 50257]]

and i try the finetune model(which wer == 100%),it works very well.

Sign up or log in to comment