10 seconds for everything, serious?
#18
by Markobes - opened
I have a 2-hour movie track with different people speaking with different emotions, and there are subtitles in a different language. How can a model learn from a just ONE 10-second sample how and which actor said a particular phrase without inventing their own reading?
I need her to listen to the entire sound and understand how to read it in EVERY place.
It maps the qualities of the voice to a skeleton, I believe. If it would require larger lengths of audio to map to a synthetic version of the person you're replicating, it probably is not a quality that will be mapped.
But quite a bit can be learned from 10 seconds.