Whisper-Large-v2 Model for Audio Transcription: Repeated and Missing Translation Information

#43
by jayce777 - opened

Hi everyone,

I have been using the whisper-large-v2 model to transcribe audio files that are about 10 to 20 minutes long. I have tried many different techniques to improve the accuracy of the transcription, but so far, nothing has worked.

One of the main issues I am facing is that there is repeated translation information between chunks, and some translation information is missing altogether. I have tried normalizing the audio to [-1,1], as well as tuning hyperparameters such as chunk_length_s and stride_length_s. However, I have found that the optimal parameters vary from file to file, and there is no generalization.

I am using the following environment:

  • whisper-large-v2
  • torch 1.10.0+cu111
  • transformers 4.28.1 (latest)

Here is the code I am using:

pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    device='cuda:0',
    generate_kwargs={"task": "transcribe"},
)
sound = AudioSegment.from_file("sample.mp3", format="mp3")
sound.export("sample.wav", format="wav")
y, sr = librosa.load("sample.wav", sr=None)
y_resampled = librosa.resample(y, orig_sr=sr, target_sr=16000)

outputs = pipe(y_resampled, return_timestamps=True, generate_kwargs={"task": "transcribe", "language": "<|zh|>"}, chunk_length_s=30, stride_length_s=5, batch_size=16, max_new_tokens=512)
print(outputs["text"])

Overall, I feel that the performance of the model is not good enough for me to use it formally. I would appreciate any suggestions or advice on how to improve the accuracy of the transcription.

Thank you.

Hi there,

any ideas on this? I am having the same problem using very similar code and parameters. From my tests it looks like it is most likely an issue with the chunking algorithm. --> https://huggingface.co/blog/asr-chunking

Playing around with the stride parameters sometimes gives me good results. Then using the same parameters for a different file, again there may be parts missing.

I am comparing two the openai whisper API and they always get it right. So it must be solvable obiously :)

Thx so much
Andi

Hello Good People,

Any update on this? I'm experiencing a similar problem where the transcription contains repeated text or is missing some content, particularly when the audio length exceeds 30 seconds.

Regards
Abaddon - The Knight of Hell

Sign up or log in to comment