Not transcribing the audio into text (for some audios)

#13

by uriii3 - opened Nov 30, 2022

Nov 30, 2022

I have been transcribing some audios (30 second long) and for some of them the output is just the first word (something like: ' the'). I have change the script and tried with different models (whisper-medium) and with that one seems to work...
When it works it is just perfect but sometimes it doesn't transcribe anything for me.
Any thoughts on how to solve or overcome this?

sanchit-gandhi

Dec 1, 2022

Hey @uriii3 ! Cool to see that you're using the Whisper medium/large checkpoints for ASR!

Have you got a script we that we can use to reproduce?

uriii3

Dec 1, 2022

Yees! I'm using the straight method of "pipeline". The only "thing" is that I'm cutting the audio in 30 second batches and transferring directly to the model. I can share the code but not the audios (for confidentiality), sorry.

for file in os.listdir(folder_path):
  if file.endswith(".wav"): #if file is an audio
    txt_path = output_folder + file[:-4]+ '.txt'
    print("Starting document: ", file)
    if not os.path.exists(txt_path): # do not make the same file more than once (if re-running or something like that) # Transcribing the audio file
      transcription = pipe(folder_path + file) # defined before as: pipe = pipeline(model=openai/whisper-large)
      with open(txt_path, 'w') as f:
          f.write(transcription['text'].lower())
      print("Document ", file, " ready!")

The code is not anything unusual or strange: the only "strange" thing is that the audios I'm transcribing are 30 second segments, cut with ffmpeg previously by me, from 10 minute longer audios. When I join the transcriptions of all the segments, some have worked perfectly and others haven't been even transcribed.

sanchit-gandhi

Dec 2, 2022

If the model detects silence within one of these 30s segments the transcription will be terminated for that segment, hence why some are probably cut short.

You can also try using the Whisper model to generate transcription for the 10 mins clips:

pipe = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-small",
    framework="pt",
    batch_size=1,
    device=0,
    chunk_length_s=30,  # this will chunk audios to 30s
)

uriii3

Dec 3, 2022

Is this a new feature? I looked a few weeks ago and it didn't have this method!

Thank you very much, I'll try it right away.

uriii3

Dec 7, 2022

Heey, I'm coming back to you. I tried the method you suggested and the output looks fairly similar... some parts of the transcription are perfect and some other have whole phrases and spaces left blank...

i'll try to combine some other chunk_length and/or stride length, but the problems looks to be still there.

Thanks for the help anyway!

sanchit-gandhi

Dec 7, 2022

It sounds like it's related to the data in that case! You can try chunking your data based on where there are large periods of silence and see if that helps!

chitversion

Jun 5, 2023

please help in this matter. When I try to make a transcription of an utterance of 30 sec in mandarin it gives output:

" This kind of serious illegal act, trampling on Hong Kong's rule of law, undermining Hong Kong's social order, and harming Hong Kong's fundamental interests, is a blatant challenge to the bottom line of one country, two systems. We strongly condemn this."

when i make a segment of this 30 second audio by slicing from 1 sec (2,3,4, upto 6 sec) to 29 sec the output is :

" This serious illegal act of trampling on Hong Kong's rule of law, undermining Hong Kong's social order, and harming Hong Kong's fundamental interests is a blatant challenge to the bottom line of one country, two systems. We strongly condemn this. The spokesperson said, the central government firmly supports the Hong Kong Special Administrative Region government and police to fulfill their duties according to the law, and supports the criminal criminal cases of Hong Kong Special Administrative Region-related agencies to pursue violent criminals according to the law, and restore social order as soon as possible to ensure the safety of citizens, personal and property."

but for the segment from 8th second to 29th second is passed to the model the output is just:
' the'

and from 6th second to 29th:
' the'

this is very weird. I'm unable to understand what is going wrong.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment