KenLM, finetuning?

#3
by RASMUS - opened

I have two questions:

  1. Could this model benefit from the same kind of KenLM model like there is with Wav2vec2 models available (using pyctcdecode)?
  2. Are there finetuning examples available?
SpeechBrain org
  1. Yes! We got a big improvement on LibriSpeech when we decoded using an n-gram language model.

  2. I haven't tried finetuning in Hugging Face, but glancing at the wav2vec 2.0 examples, it might be possible to just replace that model with M-CTC-T since both are just PyTorch encoders.

lorenlugosch changed discussion status to closed
  1. How was the processs to add n-gram model? Is it already supported with object like MCTCTProcessorWithLM like in wav2vec?
    Just now the difference would be to not remove cased characters and separators etch? https://huggingface.co/blog/wav2vec2-with-ngram
  2. Is there support in this for timestamps in output?

This would be perfect tool with those additions for generating transcriptions from language x to language y and add those to video.
I have been building a demo that uses wav2vec2 + T5 for casing/punctuation correction and then use those to map out sentence level timestamps + transcriptions --> feed to OpusMT model for translation to English --> Burn the translations to original video. The T5 is a bit of clunky for that kind of stuff (Handles only 128/256 tokens at time) and I need to use bunch of ugly matching logic back to original asr model output to get things right.

See my short demo here in the thread @lorenlugosch https://twitter.com/itsafiz/status/1533484258597437440

Pinging
@patrickvonplaten if he has time to give his thoughts

SpeechBrain org
  1. You could try the wav2vec LM decoder, assuming the interface is the same (logits as inputs)? We ran our LM experiments using Flashlight.

  2. You could maybe generate timestamps by feeding the output logits into a tool like CTC-Segmentation: https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/alignment/ctc_segmentation.py

But note that this model doesn't handle long utterances very well because it was trained on Common Voice (which contains only short utterances), so you might need to split the audio from your videos into smaller chunks before running the model.

Thanks for fast answer.
I will take time to ponder and what things we will take under work next in our applied research stuff with @aapot
I am now mostly working on building a pipeline from long audios to gather pretraining material for Finnish but hopefully we can spend some time on this too!

Sign up or log in to comment