Automatic Speech Recognition
NeMo

Timestamps

#3
by Bharsh056 - opened

how can we get timestamps of transcription as it is not inheritly there

Hello Bharsh056,

Here is how to get word-level timestamps from Vaani-FastConformer:

Pass return_hypotheses=True to transcribe():

  hyps = model.transcribe([audio_path], return_hypotheses=True)
  hyp = hyps[0]

hyp.timestamp is a tensor of encoder output frame indices, one per decoded BPE token. Convert to seconds by multiplying by the effective frame stride:

  stride = model.cfg.preprocessor.window_stride * model.cfg.encoder.subsampling_factor# → 0.01s × 8 = 0.08s per frame

Group SentencePiece tokens into words using the ▁ word-start marker, then pair with the next word's start as the end time:

  import soundfile as sf

  frames = hyp.timestamp.tolist()
  token_ids = hyp.y_sequence.tolist()
  tokens = [model.tokenizer.ids_to_tokens([tid])[0] for tid in token_ids]

  words, starts = [], []
  cur_word, cur_start = "", None
  for f, tok in zip(frames, tokens):
      if tok.startswith("▁"):
          if cur_word:
              words.append(cur_word)
              starts.append(cur_start)
          cur_word, cur_start = tok[1:], f * stride
      else:
          cur_word += tok
  if cur_word:
      words.append(cur_word)
      starts.append(cur_start)

  audio_duration = sf.info(audio_path).duration
  ends = starts[1:] + [audio_duration]

  for word, start, end in zip(words, starts, ends):
      print(f"{start:.3f}s–{end:.3f}s  {word}")

Example output (test.wav, duration= 3.77s):

0.000s–0.160s ई
0.160s–0.560s दवाई
0.560s–0.880s काज
0.880s–1.120s नै
1.120s–1.600s केलक
1.600s–1.760s हम
1.760s–2.320s एखनो
2.320s–2.800s बीमारे
2.800s–3.360s लगैत
3.360s–3.773s छी

I hope this was helpful.

Thank you for providing word-level transcription; this is highly beneficial as it allows for the creation of segment-level transcriptions. I have been encountering difficulties with a specific use case where the audio quality (vaani) is poor for English, yet generic pipelines heavily rely on English as a primary component. This often leads to failures and mixed speech issues. How can I effectively address this challenge, given that no single model offers a comprehensive solution for mixed-language segmentation, especially considering my requirement for an open-source model?

Sign up or log in to comment