10 seconds for everything, serious?

#18

by Markobes - opened Apr 20

Apr 20

I have a 2-hour movie track with different people speaking with different emotions, and there are subtitles in a different language. How can a model learn from a just ONE 10-second sample how and which actor said a particular phrase without inventing their own reading?
I need her to listen to the entire sound and understand how to read it in EVERY place.

jattoedaltni

Apr 24

•

edited Apr 24

It maps the qualities of the voice to a skeleton, I believe. If it would require larger lengths of audio to map to a synthetic version of the person you're replicating, it probably is not a quality that will be mapped.

But quite a bit can be learned from 10 seconds.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment