anchor commited on
Commit
a893898
1 Parent(s): af33de2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -4
README.md CHANGED
@@ -34,15 +34,16 @@ We have setup **the world simulator vision since March 2023, believing diffusion
34
 
35
  We will soon release `MuseTalk`, a real-time high quality lip sync model, which can be applied with MuseV as a complete virtual human generation solution. Please stay tuned!
36
 
37
- # Intro
38
  `MuseV` is a diffusion-based virtual human video generation framework, which
39
- 1. supports **infinite length** generation using a novel **Parallel Denoising scheme**.
40
  2. checkpoint available for virtual human video generation trained on human dataset.
41
  3. supports Image2Video, Text2Image2Video, Video2Video.
42
  4. compatible with the **Stable Diffusion ecosystem**, including `base_model`, `lora`, `controlnet`, etc.
43
  5. supports multi reference image technology, including `IPAdapter`, `ReferenceOnly`, `ReferenceNet`, `IPAdapterFaceID`.
44
  6. training codes (comming very soon).
45
 
 
46
  # News
47
  - [03/27/2024] release `MuseV` project and trained model `musev`, `muse_referencenet`, `muse_referencenet_pose`.
48
 
@@ -281,7 +282,19 @@ In duffy case, pose of the vision condition frame is not aligned with of the fir
281
  <td>video</td>
282
  <td>prompt</td>
283
  </tr>
284
-
 
 
 
 
 
 
 
 
 
 
 
 
285
 
286
  <tr>
287
  <td>
@@ -339,7 +352,19 @@ please refer to [MuseV](https://github.com/TMElyralab/MuseV)
339
 
340
  # Acknowledgements
341
 
342
- MuseV builds on `TuneAVideo`, `diffusers`. Thanks for open-sourcing!
 
 
 
 
 
 
 
 
 
 
 
 
343
 
344
  <!-- # Contribution 暂时不需要组织开源共建 -->
345
 
 
34
 
35
  We will soon release `MuseTalk`, a real-time high quality lip sync model, which can be applied with MuseV as a complete virtual human generation solution. Please stay tuned!
36
 
37
+ # Overview
38
  `MuseV` is a diffusion-based virtual human video generation framework, which
39
+ 1. supports **infinite length** generation using a novel **Visual Conditioned Parallel Denoising scheme**.
40
  2. checkpoint available for virtual human video generation trained on human dataset.
41
  3. supports Image2Video, Text2Image2Video, Video2Video.
42
  4. compatible with the **Stable Diffusion ecosystem**, including `base_model`, `lora`, `controlnet`, etc.
43
  5. supports multi reference image technology, including `IPAdapter`, `ReferenceOnly`, `ReferenceNet`, `IPAdapterFaceID`.
44
  6. training codes (comming very soon).
45
 
46
+
47
  # News
48
  - [03/27/2024] release `MuseV` project and trained model `musev`, `muse_referencenet`, `muse_referencenet_pose`.
49
 
 
282
  <td>video</td>
283
  <td>prompt</td>
284
  </tr>
285
+
286
+ <tr>
287
+ <td>
288
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/fX1ND0YqDp1LV0LEh2eFN.png" width="200">
289
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/pe2aQt5FU66tplNZCOZaB.png" width="200">
290
+ </td>
291
+ <td>
292
+ <video width="900" src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/IMPIDjR7-w5A_xc6ZHIzT.mp4" controls preload></video>
293
+ </td>
294
+ <td>
295
+ (masterpiece, best quality, highres:1)
296
+ </td>
297
+ </tr>
298
 
299
  <tr>
300
  <td>
 
352
 
353
  # Acknowledgements
354
 
355
+ 1. MuseV has referred much to [TuneAVideo](https://github.com/showlab/Tune-A-Video), [diffusers](https://github.com/huggingface/diffusers), [Moore-AnimateAnyone](https://github.com/MooreThreads/Moore-AnimateAnyone/tree/master/src/pipelines), [animatediff](https://github.com/guoyww/AnimateDiff), [IP-Adapter](https://github.com/tencent-ailab/IP-Adapter), [AnimateAnyone](https://arxiv.org/abs/2311.17117), [VideoFusion](https://arxiv.org/abs/2303.08320).
356
+ 2. MuseV has been built on `ucf101` and `webvid` datasets.
357
+
358
+ Thanks for open-sourcing!
359
+
360
+ # Limitation
361
+ There are still many limitations, including
362
+
363
+ 1. Limited types of video generation and limited motion range, partly because of limited types of training data. The released `MuseV` has been trained on approximately 60K human text-video pairs with resolution `512*320`. `MuseV` has greater motion range while lower video quality at lower resolution. `MuseV` tends to generate less motion range with high video quality. Trained on larger, higher resolution, higher quality text-video dataset may make `MuseV` better.
364
+ 1. Watermarks may appear because of `webvid`. A cleaner dataset withour watermarks may solve this issue.
365
+ 1. Limited types of long video generation. Visual Conditioned Parallel Denoise can solve accumulated error of video generation, but the current method is only suitable for relatively fixed camera scenes.
366
+ 1. Undertrained referencenet and IP-Adapter, beacause of limited time and limited resources.
367
+ 1. Understructured code. `MuseV` supports rich and dynamic features, but with complex and unrefacted codes. It takes time to familiarize.
368
 
369
  <!-- # Contribution 暂时不需要组织开源共建 -->
370