Time stamp on word level & Speaker identification

#29
by chan-K - opened

Are there any functions for time stamp and speaker identification?

Timestamp should be supported in the coming weeks (on the official transformers), however speaker identification was not really part of the official release. Pretty sure you could train the model to predict a speaker_token before the predicted transcription!

@ArthurZ are you referring to the chunk level timestamps (as in the original Whisper repo) or word level timestamps ?

@tdeboissiere That's what I want to ask. I mean word level timestamps.

any further information found on this subject?

@tdeboissiere @chan-K Any further info regarding word level timestamps?

Hey all! @ArthurZ is integrating timestamp prediction into Transformers and should have it finished fairly shortly: https://github.com/huggingface/transformers/pull/20620#issuecomment-1344452967

For word level timestamps, you can check-out the WhisperX repo: https://github.com/m-bain/whisperX
This workflow combines the Whisper sequence level timestamps with word-level time-stamps from a CTC model to give accurate timestamps and text predictions.

Here is a repository to estimate word-level timestamps and confidence with Whisper : https://github.com/Jeronymous/whisper-timestamped

Contrarily to whisperX, the approach here does not need an additional wav2vec model, so it should be more robust.

That's very cool @Jeronymous ! Gonna check out the repo 🙌

@Jeronymous : just wanted to say thank you for this repo. Super travail !

@sanchit-gandhi how can one highlight words while running whisperx locally

Sign up or log in to comment