deepkyu/ml-talking-face · Hugging Face

This is a model card to link the paper https://arxiv.org/abs/2205.06421 to HF Space demo.

Hi, everyone. It's been a while since we presented our demo at CVPR. I found that our demo reached 300+ Likes.
Thanks for liking our repo, even in the age of Diffusion and LLM 🙂

In 2022, I got questions about our demo and paper through Hugging Face Spaces, Youtube, GitHub, and LinkedIn.
So, as a small gift, I decided to answer frequently asked questions.
I hope this help with your journey in talking face generation and multilingual TTS research.

Is your demo made with Wav2Lip?

There were questions about whether or not we succeeded in training Wav2Lip with the dataset from a single person. Our demo hasn't started from the Wav2Lip code. We implemented our model from scratch and with PyTorch Lightning.

Someone says that our training strategy is the same as Wav2Lip, but this is incorrect. We applied the positive/negative sampling suggested in Wav2Lip, but we never used SyncNet loss in our training, which is the main contribution of Wav2Lip. Our paper contains the detail, so please check once again.

Nonetheless, we have a lot of experience with Wav2Lip code and papers. We also failed to train Wav2Lip with the dataset of the seen face. We assume that the following parts might be the reasons for failure.

SyncNet loss adversely affects training with the dataset of the seen face.
The discriminator in Wav2Lip is too shallow compared to other models, which only fit in training with the unseen dataset.

As one of the fans of the Wav2Lip research, I wish you good luck with your research.

Is it impossible to support other languages in Multilingual TTS?

That's impossible unless we train our model again. As mentioned in our paper, our demo made it possible to speak four languages with speech data in Korean. If we add a new language, we have to collect utterance data of that language and use it for our baseline training. As stated in the paper, we collected lots of utterance data for four languages.

There have been several "pull requests" for testing other languages (such as Arabic and Polish) with the Hugging Face Spaces. While this may be helpful for those who have trouble typing text in one of four languages, it doesn't allow you to make utterances and generate lip movement with the new language.

Can I run your model in my local setup?

Currently, this demo is running on an AWS EC2 instance operated by MINDsLab in Korea. This Hugging Face Demo sends a RESTful request to the instance, and the server sends the video back to the Hugging Face Space.

MINDsLab is a Korean startup, and model codes are strongly related to their earnings. For this reason, the executives did not allow to make the code public.

Even if you clone or download the demo code, there won't be such information as you expected. However, if you try to make a Gradio demo with the video in/output, I expect this would be a good reference. Again, I hope you understand.

Would you like to do another project together?

As mentioned above, I was in MINDsLab, a company in South Korea. Nowadays, I left the company and recently joined another company to research lightweight models running on an edge device.

However, as a side project, I plan to start a new project related to talking face generation. It involves re-interpreting the talking face generation task as an Audio-conditioned VideoINR to create a lightweight model which can support training with seen faces. I am always open to discussion about talking face generation and other related tasks 🙂

Lastly, to briefly introduce myself, I labored at MINDsLab for three years as an alternative military service. I will get my bachelor’s degree next month (Feb. 2023).

I hope this article helps you in any way.

deepkyu
/

ml-talking-face

Is your demo made with Wav2Lip?

Is it impossible to support other languages in Multilingual TTS?

Can I run your model in my local setup?

Would you like to do another project together?

Spaces using deepkyu/ml-talking-face 4