Spaces:

openai
/

whisper

Running on L40S

App Files Files Community

132

Time stamp on word level & Speaker identification

#29

by chan-K - opened Oct 11, 2022

Discussion

chan-K

Oct 11, 2022

Are there any functions for time stamp and speaker identification?

ArthurZ

Oct 11, 2022

Timestamp should be supported in the coming weeks (on the official transformers), however speaker identification was not really part of the official release. Pretty sure you could train the model to predict a speaker_token before the predicted transcription!

tdeboissiere

Oct 11, 2022

@ArthurZ are you referring to the chunk level timestamps (as in the original Whisper repo) or word level timestamps ?

chan-K

Oct 12, 2022

•

edited Oct 12, 2022

@tdeboissiere That's what I want to ask. I mean word level timestamps.

redplanetrobot

Oct 15, 2022

any further information found on this subject?

yugaljain1999

Jan 13, 2023

@tdeboissiere @chan-K Any further info regarding word level timestamps?

sanchit-gandhi

Jan 16, 2023

Hey all! @ArthurZ is integrating timestamp prediction into Transformers and should have it finished fairly shortly: https://github.com/huggingface/transformers/pull/20620#issuecomment-1344452967

For word level timestamps, you can check-out the WhisperX repo: https://github.com/m-bain/whisperX
This workflow combines the Whisper sequence level timestamps with word-level time-stamps from a CTC model to give accurate timestamps and text predictions.

Jeronymous

Jan 22, 2023

Here is a repository to estimate word-level timestamps and confidence with Whisper : https://github.com/Jeronymous/whisper-timestamped

Contrarily to whisperX, the approach here does not need an additional wav2vec model, so it should be more robust.

sanchit-gandhi

Jan 26, 2023

That's very cool @Jeronymous ! Gonna check out the repo 🙌

joris-rumble

Nov 4, 2023

@Jeronymous : just wanted to say thank you for this repo. Super travail !

Rixhabh

May 5, 2024

@sanchit-gandhi how can one highlight words while running whisperx locally

Laurin-myreha

Sep 5, 2024

•

edited Sep 5, 2024

@sanchit-gandhi This should also alleviate some of the timestamp issues of whisper especially around pauses. Would be cool to also have this evaluated on the ASR leaderboard. Also we found that removing symbols that have no clear accoustic representation from the DTW alignment like punctuation improves results slightly even for the original models. We will open a PR in the future :)

accompanying Interspeech paper: https://arxiv.org/abs/2408.16589

some further explanations of how the final model was created: https://huggingface.co/nyrahealth/CrisperWhisper

model: https://github.com/nyrahealth/CrisperWhisper/tree/main

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment