Super cool model!

#1
by patrickvonplaten - opened

Thanks for pinging me
Definitely looks interesting, will try this for my multi lingual approach

@flozi00 In https://huggingface.co/fxtentacle/tevr-token-entropy-predictor-de is the Jupyter notebook for extracting tokens. Bilingual DE+EN worked OK for me too, but I conditioned the acoustic decoder by using different BOS tokens for each language, similar to when training mBART for translation.

BTW, we cited your https://huggingface.co/flozi00/wav2vec2-xls-r-1b-5gram-german as a reference in the paper. You're the "wav2vec 2.0 XLS-R 1B 5-gram 4:38% Zimmermeister (2022)*" in the results table.

FYI the C++ code for the Linux CLI tool based on this model is now on GitHub: https://github.com/DeutscheKI/tevr-asr-tool

And BTW @flozi00 I got this comment: https://news.ycombinator.com/item?id=32413636 from nshmyrev who is one of the main developers of the Vosk speech recognition toolkit. He believes that both of our models are overtrained for CommonVoice German and suggests a new method I never heard about called "perplexity" to check for that and prevent it to make the models work better for general everyday recognition.

Hmm
That seems interesting, because the discussion is about the language model and I did not train it on commonvoice.
The LM is trained on parts of Wikipedia and Oscar instead

I think what nshmyrev meant is that perhaps the LM training data contained sentences that are also in the CV test set (e.g., if the sentences used in CV test were sourced from Project Guttenberg texts, and these texts were also included in OSCAR), which would bias it. I have no idea if that's true or not, but it would explain the large improvement after adding the LM.

I disagree, as the LM only lowers the WER and the CER barely changes.
The LM only fixes the typos but not the understanding in transcription.

So I dont think the LM is overfitted and biased on CV

Sign up or log in to comment