Instructions to use ARTPARK-IISc/Vaani-FastConformer-Multilingual with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use ARTPARK-IISc/Vaani-FastConformer-Multilingual with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("ARTPARK-IISc/Vaani-FastConformer-Multilingual") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
Timestamps
how can we get timestamps of transcription as it is not inheritly there
Hello Bharsh056,
Here is how to get word-level timestamps from Vaani-FastConformer:
Pass return_hypotheses=True to transcribe():
hyps = model.transcribe([audio_path], return_hypotheses=True)
hyp = hyps[0]
hyp.timestamp is a tensor of encoder output frame indices, one per decoded BPE token. Convert to seconds by multiplying by the effective frame stride:
stride = model.cfg.preprocessor.window_stride * model.cfg.encoder.subsampling_factor# → 0.01s × 8 = 0.08s per frame
Group SentencePiece tokens into words using the ▁ word-start marker, then pair with the next word's start as the end time:
import soundfile as sf
frames = hyp.timestamp.tolist()
token_ids = hyp.y_sequence.tolist()
tokens = [model.tokenizer.ids_to_tokens([tid])[0] for tid in token_ids]
words, starts = [], []
cur_word, cur_start = "", None
for f, tok in zip(frames, tokens):
if tok.startswith("▁"):
if cur_word:
words.append(cur_word)
starts.append(cur_start)
cur_word, cur_start = tok[1:], f * stride
else:
cur_word += tok
if cur_word:
words.append(cur_word)
starts.append(cur_start)
audio_duration = sf.info(audio_path).duration
ends = starts[1:] + [audio_duration]
for word, start, end in zip(words, starts, ends):
print(f"{start:.3f}s–{end:.3f}s {word}")
Example output (test.wav, duration= 3.77s):
0.000s–0.160s ई
0.160s–0.560s दवाई
0.560s–0.880s काज
0.880s–1.120s नै
1.120s–1.600s केलक
1.600s–1.760s हम
1.760s–2.320s एखनो
2.320s–2.800s बीमारे
2.800s–3.360s लगैत
3.360s–3.773s छी
I hope this was helpful.
Thank you for providing word-level transcription; this is highly beneficial as it allows for the creation of segment-level transcriptions. I have been encountering difficulties with a specific use case where the audio quality (vaani) is poor for English, yet generic pipelines heavily rely on English as a primary component. This often leads to failures and mixed speech issues. How can I effectively address this challenge, given that no single model offers a comprehensive solution for mixed-language segmentation, especially considering my requirement for an open-source model?