eryk7381/whisper-med-pol-car

Abstract

Whisper medium model fine-tuned on speech recordings mixed with vechicle interior noises. Training batch: 750 Evaluation batch: 150

Very small dataset, because of very long training times: aprox 22h 3 min for this dataset. Predicted 202 hours for 7500 recordings, 2000 hours for 75000 recordings(full bigos training set)

Datasets

Speech dataset: https://huggingface.co/datasets/michaljunczyk/pl-asr-bigos-v2

Noise dataset: https://zenodo.org/records/5606504

Usage:

# Specify the CUDA device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_id = "eryk7381/whisper-med-pol-car"
torch_dtype = torch.float16 # You can adjust the dtype if needed

# Load model and move it to CUDA
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

# Load processor
processor = AutoProcessor.from_pretrained(model_id)

# Create the pipeline with CUDA support
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
torch_dtype=torch_dtype,
device=device,
)
audio_path = 'your_audio_path.wav'
sample = audio_path
result = pipe(sample, generate_kwargs={"language": "polish"})
print(result['text'])