Translating English Audio Into Spanish Text

#61

by stvnchnsn - opened Dec 28, 2023

Dec 28, 2023

I'm trying to translate audio that is in english to spanish text using the code listed below. No errors occur but the text is in english with no translation performed. Any clues?

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

translate_pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=1,
    torch_dtype=torch_dtype,
    device=device,
    return_timestamps=True,
    generate_kwargs={"language": "spanish", "task": "translate"}
)

MathieuBsqt

Jan 11

"language" parameter is used to indicate the spoken language in the audio.
The "translate" parameter indicates that the speech must be translated into English.

sanchit-gandhi

Jan 11

Whisper was trained on speech recognition (audio in X -> text in X) and speech translation to English (audio in X -> text in En)

You can also 'trick' it into performing more general speech translation (audio in X -> text in Y) with reasonable results, but not as good as the trained tasks. You just need to set the language to your target language, and the task to "transcribe":

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

translate_pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=1,
    torch_dtype=torch_dtype,
    device=device,
    return_timestamps=True,
    generate_kwargs={"language": "spanish", "task": "transcribe"}
)

Daniel981215

Feb 2

Whisper was trained on speech recognition (audio in X -> text in X) and speech translation to English (audio in X -> text in En)

You can also 'trick' it into performing more general speech translation (audio in X -> text in Y) with reasonable results, but not as good as the trained tasks. You just need to set the language to your target language, and the task to "transcribe":
model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

translate_pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=1,
    torch_dtype=torch_dtype,
    device=device,
    return_timestamps=True,
    generate_kwargs={"language": "spanish", "task": "transcribe"}
)

Is there any way to improve performance? I didn't find any dataset (english audio - spanish test) for fine tuning

hf-ds-user

Feb 27

•

edited Feb 27

Is there any way to improve performance? I didn't find any dataset (english audio - spanish test) for fine tuning

@Daniel981215
You could obtain such dataset by taking english speech-to-text dataset, then translating english text to spanish (using open source or cloud solutions)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment