anchor
commited on
Commit
•
a893898
1
Parent(s):
af33de2
Update README.md
Browse files
README.md
CHANGED
@@ -34,15 +34,16 @@ We have setup **the world simulator vision since March 2023, believing diffusion
|
|
34 |
|
35 |
We will soon release `MuseTalk`, a real-time high quality lip sync model, which can be applied with MuseV as a complete virtual human generation solution. Please stay tuned!
|
36 |
|
37 |
-
#
|
38 |
`MuseV` is a diffusion-based virtual human video generation framework, which
|
39 |
-
1. supports **infinite length** generation using a novel **Parallel Denoising scheme**.
|
40 |
2. checkpoint available for virtual human video generation trained on human dataset.
|
41 |
3. supports Image2Video, Text2Image2Video, Video2Video.
|
42 |
4. compatible with the **Stable Diffusion ecosystem**, including `base_model`, `lora`, `controlnet`, etc.
|
43 |
5. supports multi reference image technology, including `IPAdapter`, `ReferenceOnly`, `ReferenceNet`, `IPAdapterFaceID`.
|
44 |
6. training codes (comming very soon).
|
45 |
|
|
|
46 |
# News
|
47 |
- [03/27/2024] release `MuseV` project and trained model `musev`, `muse_referencenet`, `muse_referencenet_pose`.
|
48 |
|
@@ -281,7 +282,19 @@ In duffy case, pose of the vision condition frame is not aligned with of the fir
|
|
281 |
<td>video</td>
|
282 |
<td>prompt</td>
|
283 |
</tr>
|
284 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
285 |
|
286 |
<tr>
|
287 |
<td>
|
@@ -339,7 +352,19 @@ please refer to [MuseV](https://github.com/TMElyralab/MuseV)
|
|
339 |
|
340 |
# Acknowledgements
|
341 |
|
342 |
-
MuseV
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
343 |
|
344 |
<!-- # Contribution 暂时不需要组织开源共建 -->
|
345 |
|
|
|
34 |
|
35 |
We will soon release `MuseTalk`, a real-time high quality lip sync model, which can be applied with MuseV as a complete virtual human generation solution. Please stay tuned!
|
36 |
|
37 |
+
# Overview
|
38 |
`MuseV` is a diffusion-based virtual human video generation framework, which
|
39 |
+
1. supports **infinite length** generation using a novel **Visual Conditioned Parallel Denoising scheme**.
|
40 |
2. checkpoint available for virtual human video generation trained on human dataset.
|
41 |
3. supports Image2Video, Text2Image2Video, Video2Video.
|
42 |
4. compatible with the **Stable Diffusion ecosystem**, including `base_model`, `lora`, `controlnet`, etc.
|
43 |
5. supports multi reference image technology, including `IPAdapter`, `ReferenceOnly`, `ReferenceNet`, `IPAdapterFaceID`.
|
44 |
6. training codes (comming very soon).
|
45 |
|
46 |
+
|
47 |
# News
|
48 |
- [03/27/2024] release `MuseV` project and trained model `musev`, `muse_referencenet`, `muse_referencenet_pose`.
|
49 |
|
|
|
282 |
<td>video</td>
|
283 |
<td>prompt</td>
|
284 |
</tr>
|
285 |
+
|
286 |
+
<tr>
|
287 |
+
<td>
|
288 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/fX1ND0YqDp1LV0LEh2eFN.png" width="200">
|
289 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/pe2aQt5FU66tplNZCOZaB.png" width="200">
|
290 |
+
</td>
|
291 |
+
<td>
|
292 |
+
<video width="900" src="https://cdn-uploads.huggingface.co/production/uploads/65f9352ed760cfdf5eb80e16/IMPIDjR7-w5A_xc6ZHIzT.mp4" controls preload></video>
|
293 |
+
</td>
|
294 |
+
<td>
|
295 |
+
(masterpiece, best quality, highres:1)
|
296 |
+
</td>
|
297 |
+
</tr>
|
298 |
|
299 |
<tr>
|
300 |
<td>
|
|
|
352 |
|
353 |
# Acknowledgements
|
354 |
|
355 |
+
1. MuseV has referred much to [TuneAVideo](https://github.com/showlab/Tune-A-Video), [diffusers](https://github.com/huggingface/diffusers), [Moore-AnimateAnyone](https://github.com/MooreThreads/Moore-AnimateAnyone/tree/master/src/pipelines), [animatediff](https://github.com/guoyww/AnimateDiff), [IP-Adapter](https://github.com/tencent-ailab/IP-Adapter), [AnimateAnyone](https://arxiv.org/abs/2311.17117), [VideoFusion](https://arxiv.org/abs/2303.08320).
|
356 |
+
2. MuseV has been built on `ucf101` and `webvid` datasets.
|
357 |
+
|
358 |
+
Thanks for open-sourcing!
|
359 |
+
|
360 |
+
# Limitation
|
361 |
+
There are still many limitations, including
|
362 |
+
|
363 |
+
1. Limited types of video generation and limited motion range, partly because of limited types of training data. The released `MuseV` has been trained on approximately 60K human text-video pairs with resolution `512*320`. `MuseV` has greater motion range while lower video quality at lower resolution. `MuseV` tends to generate less motion range with high video quality. Trained on larger, higher resolution, higher quality text-video dataset may make `MuseV` better.
|
364 |
+
1. Watermarks may appear because of `webvid`. A cleaner dataset withour watermarks may solve this issue.
|
365 |
+
1. Limited types of long video generation. Visual Conditioned Parallel Denoise can solve accumulated error of video generation, but the current method is only suitable for relatively fixed camera scenes.
|
366 |
+
1. Undertrained referencenet and IP-Adapter, beacause of limited time and limited resources.
|
367 |
+
1. Understructured code. `MuseV` supports rich and dynamic features, but with complex and unrefacted codes. It takes time to familiarize.
|
368 |
|
369 |
<!-- # Contribution 暂时不需要组织开源共建 -->
|
370 |
|