Transcript an Spanish audio

#86

by Andrews99 - opened Mar 5, 2024

Discussion

Andrews99

Mar 5, 2024

How I can select the language Spanish to has a better transcription?

I have an example but give errors:

import whisper

Cargar el modelo Whisper (usaremos el modelo 'base' como ejemplo)

model = whisper.load_model("base")

Ruta al archivo de audio en español

audio_path = r'C:\Users\andre\Downloads\Example.wav'

Cargar el audio

audio = whisper.load_audio(audio_path)

Convertir a espectrograma log-Mel y mover al mismo dispositivo que el modelo

mel = whisper.log_mel_spectrogram(audio).to(model.device)

Detectar el idioma hablado (opcional)

_, probs = model.detect_language(mel)
detected_language = max(probs, key=probs.get)
print(f"Idioma detectado: {detected_language}")

Decodificar el audio

options = whisper.DecodingOptions(language="es") # Indicar que el idioma es español
result = whisper.decode(model, mel, options)

Imprimir el texto reconocido

print(result.text)

Andrews99 changed discussion status to closed Mar 5, 2024

Andrews99 changed discussion status to open Mar 5, 2024

sanchit-gandhi

Mar 5, 2024

Hey @Andrews99 , you can do this in Transformers with the following steps. First, install Transformers:

pip install -U transformers accelerate

Then, run the following code snippet:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

audio_path = r'C:\Users\andre\Downloads\Example.wav'

result = pipe(audio_path, generate_kwargs={"language": "es", "task": "transcribe"})
print(result["text"])

sanchit-gandhi

Mar 5, 2024

Alternatively, in OpenAI Whisper you can do:

import torch
from whisper import load_model, transcribe

model = load_model("large-v3")

audio_path = r'C:\Users\andre\Downloads\Example.wav'
pred_out = transcribe(model, audio= audio_path, language="es")
print(pred_out["text"])

Andrews99

Mar 6, 2024

Alternatively, in OpenAI Whisper you can do:

import torch
from whisper import load_model, transcribe

model = load_model("large-v3")

audio_path = r'C:\Users\andre\Downloads\Example.wav'
pred_out = transcribe(model, audio= audio_path, language="es")
print(pred_out["text"])

Thank you! I have a better understanding 🙏

Exvar

May 26, 2024

Following from what https://huggingface.co/sanchit-gandhi wrote, what worked for me to force GPU CUDA acceleration, CPU (Core i9 13th Gen) was taking forever and overheating on multiple hour long spanish audios, was the following:
Windows 10
Miniconda 3 (created environment)
Installed CUDA Toolkit 12.5 Downloads
Nvidia GPU RTX

(Conda environment activated prompt)

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install datasets
pip install -U transformers accelerate
python -c "import torch; print(torch.version)"
check within python availability that torch and CUDA work:
python

import torch
torch.cuda.is_available()
True
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
model_id = "openai/whisper-large-v3"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = torch.device("cuda")
model = AutoModelForSpeechSeq2Seq.from_pretrained(
... model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
... )
model.to(device)

(you should see Whisper's model parameters)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
torch_dtype=torch_dtype,
device=device,
)
audio_path = r'C:\Users\andre\Downloads\Example.wav'

result = pipe(audio_path, generate_kwargs={"language": "es", "task": "transcribe"})
print(result["text"])

Took about a minute for a 3 hour long spanish audio with no spelling errors

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment