posted an update 17 days ago
I'm working on talking head generation that takes audio and video as input, can someone suggest me a good existing architecture that can generate videos with less latency or can we make it in real time?

I think most existing OSS talking head archs only take audio and image as input, you can checkout sadtalker ( it takes in audio and image as inputs. As for streaming you'll have to do that via api with websocket, checkout D-ID's stream api:


Tried sadtalker , too much time consumption. D-ID is proprietary . Looking something from opensource. Tried wav2lip and also enhancing that with GFPGAN , output is good but i want something fast.