Translating English Audio Into Spanish Text

#61
by stvnchnsn - opened

I'm trying to translate audio that is in english to spanish text using the code listed below. No errors occur but the text is in english with no translation performed. Any clues?

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

translate_pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=1,
    torch_dtype=torch_dtype,
    device=device,
    return_timestamps=True,
    generate_kwargs={"language": "spanish", "task": "translate"}
)

"language" parameter is used to indicate the spoken language in the audio.
The "translate" parameter indicates that the speech must be translated into English.

Whisper was trained on speech recognition (audio in X -> text in X) and speech translation to English (audio in X -> text in En)

You can also 'trick' it into performing more general speech translation (audio in X -> text in Y) with reasonable results, but not as good as the trained tasks. You just need to set the language to your target language, and the task to "transcribe":

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

translate_pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=1,
    torch_dtype=torch_dtype,
    device=device,
    return_timestamps=True,
    generate_kwargs={"language": "spanish", "task": "transcribe"}
)

Whisper was trained on speech recognition (audio in X -> text in X) and speech translation to English (audio in X -> text in En)

You can also 'trick' it into performing more general speech translation (audio in X -> text in Y) with reasonable results, but not as good as the trained tasks. You just need to set the language to your target language, and the task to "transcribe":

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

translate_pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=1,
    torch_dtype=torch_dtype,
    device=device,
    return_timestamps=True,
    generate_kwargs={"language": "spanish", "task": "transcribe"}
)

Is there any way to improve performance? I didn't find any dataset (english audio - spanish test) for fine tuning

Is there any way to improve performance? I didn't find any dataset (english audio - spanish test) for fine tuning

@Daniel981215
You could obtain such dataset by taking english speech-to-text dataset, then translating english text to spanish (using open source or cloud solutions)

Sign up or log in to comment