How to generate "unit" for the model?

#1
by scaufish - opened

In the provided code, input to CodeHiFiGANVocoder is "unit", which, I gather, is a speech representation.
I tried running the codes, and reading the original papers. If I understand correctly, unit is an intermediate output of a multilingual speech-to-speech translation model. And then, with the vocoder presented here, the unit could be converted to audio.
Question is, how do we generate the unit ourselves? As far as I can find, Meta did not release the speech-to-speech translation model that covers Hokkien speech translation mentioned in this paper.
So does it mean, we are actually missing the most important part of parameters -- the speech translation model that generates "units" for the final vocoder? Or am I missing something here?
Thanks for any insight!

For example, the multilingual speech translation model I found (https://huggingface.co/facebook/seamless-m4t-v2-large) does not support Hokkien.

Sign up or log in to comment