TMElyralab
/

MuseV

@@ -34,15 +34,16 @@ We have setup **the world simulator vision since March 2023, believing diffusion
 We will soon release `MuseTalk`, a real-time high quality lip sync model, which can be applied with MuseV as a complete virtual human generation solution. Please stay tuned!
-# Intro
 `MuseV` is a diffusion-based virtual human video generation framework, which
-1. supports **infinite length** generation using a novel **Parallel Denoising scheme**.
 2. checkpoint available for virtual human video generation trained on human dataset.
 3. supports Image2Video, Text2Image2Video, Video2Video.
 4. compatible with the **Stable Diffusion ecosystem**, including `base_model`, `lora`, `controlnet`, etc.
 5. supports multi reference image technology, including `IPAdapter`, `ReferenceOnly`, `ReferenceNet`, `IPAdapterFaceID`.
 6. training codes (comming very soon).
 # News
 - [03/27/2024] release `MuseV` project and trained model `musev`, `muse_referencenet`, `muse_referencenet_pose`.
@@ -281,7 +282,19 @@ In duffy case, pose of the vision condition frame is not aligned with of the fir
         <td>video</td>
         <td>prompt</td>
     </tr>
   <tr>
     <td>
@@ -339,7 +352,19 @@ please refer to [MuseV](https://github.com/TMElyralab/MuseV)
 # Acknowledgements
-MuseV builds on `TuneAVideo`, `diffusers`. Thanks  for open-sourcing!
 <!-- # Contribution 暂时不需要组织开源共建 -->

 We will soon release `MuseTalk`, a real-time high quality lip sync model, which can be applied with MuseV as a complete virtual human generation solution. Please stay tuned!
+# Overview
 `MuseV` is a diffusion-based virtual human video generation framework, which
+1. supports **infinite length** generation using a novel **Visual Conditioned Parallel Denoising scheme**.
 2. checkpoint available for virtual human video generation trained on human dataset.
 3. supports Image2Video, Text2Image2Video, Video2Video.
 4. compatible with the **Stable Diffusion ecosystem**, including `base_model`, `lora`, `controlnet`, etc.
 5. supports multi reference image technology, including `IPAdapter`, `ReferenceOnly`, `ReferenceNet`, `IPAdapterFaceID`.
 6. training codes (comming very soon).
 # News
 - [03/27/2024] release `MuseV` project and trained model `musev`, `muse_referencenet`, `muse_referencenet_pose`.
         <td>video</td>
         <td>prompt</td>
     </tr>
+  <tr>
+    <td>
+      <img src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/fX1ND0YqDp1LV0LEh2eFN.png" width="200">
+      <img src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/pe2aQt5FU66tplNZCOZaB.png" width="200">
+    </td>
+    <td>
+      <video width="900" src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/IMPIDjR7-w5A_xc6ZHIzT.mp4" controls preload></video>
+    </td>
+    <td>
+      (masterpiece, best quality, highres:1)
+    </td>
+  </tr>
   <tr>
     <td>
 # Acknowledgements
+1. MuseV has referred much to [TuneAVideo](https://github.com/showlab/Tune-A-Video), [diffusers](https://github.com/huggingface/diffusers), [Moore-AnimateAnyone](https://github.com/MooreThreads/Moore-AnimateAnyone/tree/master/src/pipelines), [animatediff](https://github.com/guoyww/AnimateDiff), [IP-Adapter](https://github.com/tencent-ailab/IP-Adapter), [AnimateAnyone](https://arxiv.org/abs/2311.17117), [VideoFusion](https://arxiv.org/abs/2303.08320).
+2. MuseV has been built on `ucf101` and `webvid` datasets.
+Thanks for open-sourcing!
+# Limitation
+There are still many limitations, including
+1. Limited types of video generation and limited motion range, partly because of limited types of training data. The released `MuseV` has been trained on approximately 60K human text-video pairs with resolution `512*320`. `MuseV` has greater motion range while lower video quality at lower resolution. `MuseV` tends to generate less motion range with high video quality. Trained on larger, higher resolution, higher quality text-video dataset may make `MuseV` better.
+1. Watermarks may appear because of `webvid`. A cleaner dataset withour watermarks may solve this issue.
+1. Limited types of long video generation. Visual Conditioned Parallel Denoise can solve accumulated error of video generation, but the current method is only suitable for relatively fixed camera scenes.
+1. Undertrained referencenet and IP-Adapter, beacause of limited time and limited resources.
+1. Understructured code. `MuseV`  supports rich and dynamic features, but with complex and unrefacted codes. It takes time to familiarize.
 <!-- # Contribution 暂时不需要组织开源共建 -->