New Whisper implementation optimized for speaker diarization

#3
by smach - opened
Journalists on Hugging Face org

This looks interesting via @philschmid :

“ If you are using Whisper for transcription, listen⁉️👂We created an optimized Whisper with Speaker Diarization for @huggingface Inference Endpoints 🤗 We created a reference implementation that optimizes Whisper with Flash Attention and Speculative Decoding and combines it with Diarization for speaker separations! 🤯

TL;DR:
🏎️ Ultra faster inference due to flash attention & speculative decoding
✅ Leverages the Custom Handler feature of Hugging Face Inference Endpoints
⚡️Takes 4.15s to transcribe 60s audio for Whisper Large on 1x A10G GPU
🔬 Combines Whisper with Pyannote's diarization model
🌐 Fully customizable and adjustable to specific use cases
🔓 Open-source for easy deployment”

Blog post: https://huggingface.co/blog/asr-diarization

Python code: https://huggingface.co/sergeipetrov/asrdiarization-handler/blob/main/handler.py

Journalists on Hugging Face org

I immediately thought about potential use cases in journalism when I read this! Would be curious to know if someone tried it!

Sign up or log in to comment