Spaces:

TencentARC
/

RollingForcing

Running on Zero

App Files Files Community

kunhaokhliu commited on 6 days ago

Commit

5d2a97a

1 Parent(s): 21a626f

Add application file

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitignore +8 -0
LICENSE +81 -0
README.md +1 -1
app.py +187 -0
configs/default_config.yaml +20 -0
configs/rolling_forcing_dmd.yaml +48 -0
inference.py +197 -0
model/__init__.py +14 -0
model/base.py +230 -0
model/causvid.py +391 -0
model/diffusion.py +125 -0
model/dmd.py +332 -0
model/gan.py +295 -0
model/ode_regression.py +138 -0
model/sid.py +283 -0
pipeline/__init__.py +13 -0
pipeline/bidirectional_diffusion_inference.py +110 -0
pipeline/bidirectional_inference.py +71 -0
pipeline/causal_diffusion_inference.py +342 -0
pipeline/rolling_forcing_inference.py +372 -0
pipeline/rolling_forcing_training.py +464 -0
prompts/example_prompts.txt +16 -0
requirements.txt +45 -0
train.py +45 -0
trainer/__init__.py +11 -0
trainer/diffusion.py +265 -0
trainer/distillation.py +398 -0
trainer/gan.py +464 -0
trainer/ode.py +242 -0
utils/dataset.py +220 -0
utils/distributed.py +125 -0
utils/lmdb.py +72 -0
utils/loss.py +81 -0
utils/misc.py +39 -0
utils/scheduler.py +194 -0
utils/wan_wrapper.py +313 -0
wan/README.md +2 -0
wan/__init__.py +3 -0
wan/configs/__init__.py +42 -0
wan/configs/shared_config.py +19 -0
wan/configs/wan_i2v_14B.py +35 -0
wan/configs/wan_t2v_14B.py +29 -0
wan/configs/wan_t2v_1_3B.py +29 -0
wan/distributed/__init__.py +0 -0
wan/distributed/fsdp.py +33 -0
wan/distributed/xdit_context_parallel.py +192 -0
wan/image2video.py +347 -0
wan/modules/__init__.py +16 -0
wan/modules/attention.py +185 -0
wan/modules/causal_model.py +1127 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,8 @@

+__pycache__
+*.egg-info
+.cache
+wan_models
+checkpoints
+videos
+logs

LICENSE ADDED Viewed

	@@ -0,0 +1,81 @@

+Tencent is pleased to support the community by making RollingForcing available.
+Copyright (C)  2025 Tencent.  All rights reserved.
+The open-source software and/or models included in this distribution may have been modified by Tencent (“Tencent Modifications”). All Tencent Modifications are Copyright (C) Tencent.
+RollingForcing is licensed under the License Terms of RollingForcing, except for the third-party components listed below, which remain licensed under their respective original terms. RollingForcing does not impose any additional restrictions beyond those specified in the original licenses of these third-party components. Users are required to comply with all applicable terms and conditions of the original licenses and to ensure that the use of these third-party components conforms to all relevant laws and regulations.
+For the avoidance of doubt, RollingForcing refers solely to training code, inference code, parameters, and weights made publicly available by Tencent in accordance with the License Terms of RollingForcing.
+Terms of the License Terms of RollingForcing:
+--------------------------------------------------------------------
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and /or sublicense copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+- You agree to use RollingForcing only for academic purposes, and refrain from using it for any commercial or production purposes under any circumstances.
+- The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+Dependencies and Licenses:
+This open-source project, RollingForcing, builds upon the following open-source models and/or software components, each of which remains licensed under its original license. Certain models or software may include modifications made by Tencent (“Tencent Modifications”), which are Copyright (C) Tencent.
+In case you believe there have been errors in the attribution below, you may submit the concerns to us for review and correction.
+Open Source Model Licensed under the  Apache-2.0:
+--------------------------------------------------------------------
+1. Wan-AI/Wan2.1-T2V-1.3B
+Copyright (c) 2025 Wan Team
+Terms of the Apache-2.0：
+--------------------------------------------------------------------
+Apache License
+                       Version 2.0, January 2004
+                    http://www.apache.org/licenses/
+TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+Definitions.
+"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
+"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
+"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
+"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
+"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
+"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
+"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
+"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
+"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
+"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
+Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
+Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
+Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
+(a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
+Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
+Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
+Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
+Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
+Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
+END OF TERMS AND CONDITIONS

README.md CHANGED Viewed

@@ -8,7 +8,7 @@ sdk_version: 5.49.1
 app_file: app.py
 pinned: false
 license: other
-short_description: 'Rolling Forcing: Autoregressive Long Video Diffusion in Real'
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 app_file: app.py
 pinned: false
 license: other
+short_description: 'Rolling Forcing: Autoregressive Long Video Diffusion in Real Time'
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

app.py ADDED Viewed

	@@ -0,0 +1,187 @@

+import os
+import argparse
+import time
+from typing import Optional
+import torch
+from torchvision.io import write_video
+from omegaconf import OmegaConf
+from einops import rearrange
+import app as gr
+from pipeline import CausalInferencePipeline
+from huggingface_hub import snapshot_download, hf_hub_download
+# -----------------------------
+# Globals (loaded once per process)
+# -----------------------------
+_PIPELINE: Optional[torch.nn.Module] = None
+_DEVICE: Optional[torch.device] = None
+def _ensure_gpu():
+    if not torch.cuda.is_available():
+        raise gr.Error("CUDA GPU is required to run this demo. Please run on a machine with an NVIDIA GPU.")
+    # Bind to GPU:0 by default
+    torch.cuda.set_device(0)
+def _load_pipeline(config_path: str, checkpoint_path: Optional[str], use_ema: bool) -> torch.nn.Module:
+    global _PIPELINE, _DEVICE
+    if _PIPELINE is not None:
+        return _PIPELINE
+    _ensure_gpu()
+    _DEVICE = torch.device("cuda:0")
+    # Load and merge configs
+    config = OmegaConf.load(config_path)
+    default_config = OmegaConf.load("configs/default_config.yaml")
+    config = OmegaConf.merge(default_config, config)
+    # Choose pipeline type based on config
+    pipeline = CausalInferencePipeline(config, device=_DEVICE)
+    # Load checkpoint if provided
+    if checkpoint_path and os.path.exists(checkpoint_path):
+        state_dict = torch.load(checkpoint_path, map_location="cpu")
+        if use_ema and 'generator_ema' in state_dict:
+            state_dict_to_load = state_dict['generator_ema']
+            # Remove possible FSDP prefix
+            from collections import OrderedDict
+            new_state_dict = OrderedDict()
+            for k, v in state_dict_to_load.items():
+                new_state_dict[k.replace("_fsdp_wrapped_module.", "")] = v
+            state_dict_to_load = new_state_dict
+        else:
+            state_dict_to_load = state_dict.get('generator', state_dict)
+        pipeline.generator.load_state_dict(state_dict_to_load, strict=False)
+    # The codebase assumes bfloat16 on GPU
+    pipeline = pipeline.to(device=_DEVICE, dtype=torch.bfloat16)
+    pipeline.eval()
+    # Quick sanity path check for Wan models to give friendly errors
+    wan_dir = os.path.join('wan_models', 'Wan2.1-T2V-1.3B')
+    if not os.path.isdir(wan_dir):
+        raise gr.Error(
+            "Wan2.1-T2V-1.3B not found at 'wan_models/Wan2.1-T2V-1.3B'.\n"
+            "Please download it first, e.g.:\n"
+            "huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir-use-symlinks False --local-dir wan_models/Wan2.1-T2V-1.3B"
+        )
+    _PIPELINE = pipeline
+    return _PIPELINE
+def build_predict(config_path: str, checkpoint_path: Optional[str], output_dir: str, use_ema: bool):
+    os.makedirs(output_dir, exist_ok=True)
+    def predict(prompt: str, num_frames: int) -> str:
+        if not prompt or not prompt.strip():
+            raise gr.Error("Please enter a non-empty text prompt.")
+        num_frames = int(num_frames)
+        if num_frames % 3 != 0 or not (21 <= num_frames <= 252):
+            raise gr.Error("Number of frames must be a multiple of 3 between 21 and 252.")
+        pipeline = _load_pipeline(config_path, checkpoint_path, use_ema)
+        # Prepare inputs
+        prompts = [prompt.strip()]
+        noise = torch.randn([1, num_frames, 16, 60, 104], device=_DEVICE, dtype=torch.bfloat16)
+        torch.set_grad_enabled(False)
+        with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
+            video = pipeline.inference_rolling_forcing(
+                noise=noise,
+                text_prompts=prompts,
+                return_latents=False,
+                initial_latent=None,
+            )
+        # video: [B=1, T, C, H, W] in [0,1]
+        video = rearrange(video, 'b t c h w -> b t h w c')[0]
+        video_uint8 = (video * 255.0).clamp(0, 255).to(torch.uint8).cpu()
+        # Save to a unique filepath
+        safe_stub = prompt[:60].replace(' ', '_').replace('/', '_')
+        ts = int(time.time())
+        filepath = os.path.join(output_dir, f"{safe_stub or 'video'}_{ts}.mp4")
+        write_video(filepath, video_uint8, fps=16)
+        print(f"Saved generated video to {filepath}")
+        return filepath
+    return predict
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--config_path', type=str, default='configs/rolling_forcing_dmd.yaml',
+                        help='Path to the model config')
+    parser.add_argument('--checkpoint_path', type=str, default='checkpoints/rolling_forcing_dmd.pt',
+                        help='Path to rolling forcing checkpoint (.pt). If missing, will run with base weights only if available.')
+    parser.add_argument('--output_dir', type=str, default='videos/gradio', help='Where to save generated videos')
+    parser.add_argument('--no_ema', action='store_true', help='Disable EMA weights when loading checkpoint')
+    args = parser.parse_args()
+    # Download checkpoint from HuggingFace if not present
+    # 1️⃣ Equivalent to:
+    # huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir wan_models/Wan2.1-T2V-1.3B
+    wan_model_dir = snapshot_download(
+        repo_id="Wan-AI/Wan2.1-T2V-1.3B",
+        local_dir="wan_models/Wan2.1-T2V-1.3B",
+        local_dir_use_symlinks=False,  # same as --local-dir-use-symlinks False
+    )
+    print("Wan model downloaded to:", wan_model_dir)
+    # 2️⃣ Equivalent to:
+    # huggingface-cli download TencentARC/RollingForcing checkpoints/rolling_forcing_dmd.pt --local-dir .
+    rolling_ckpt_path = hf_hub_download(
+        repo_id="TencentARC/RollingForcing",
+        filename="checkpoints/rolling_forcing_dmd.pt",
+        local_dir=".",  # where to store it
+        local_dir_use_symlinks=False,
+    )
+    print("RollingForcing checkpoint downloaded to:", rolling_ckpt_path)
+    predict = build_predict(
+        config_path=args.config_path,
+        checkpoint_path=args.checkpoint_path,
+        output_dir=args.output_dir,
+        use_ema=not args.no_ema,
+    )
+    demo = gr.Interface(
+        fn=predict,
+        inputs=[
+            gr.Textbox(label="Text Prompt", lines=2, placeholder="A cinematic shot of a girl dancing in the sunset."),
+            gr.Slider(label="Number of Latent Frames", minimum=21, maximum=252, step=3, value=21),
+        ],
+        outputs=gr.Video(label="Generated Video", format="mp4"),
+        title="Rolling Forcing: Autoregressive Long Video Diffusion in Real Time",
+        description=(
+            "Enter a prompt and generate a video using the Rolling Forcing pipeline.\n"
+            "**Note:** although Rolling Forcing generates videos autoregressivelty, current Gradio demo does not support streaming outputs, so the entire video will be generated before it is displayed.\n"
+            "\n"
+            "If you find this demo useful, please consider giving it a ⭐ star on [GitHub](https://github.com/TencentARC/RollingForcing)--your support is crucial for sustaining this open-source project. "
+            "You can also dive deeper by reading the [paper](https://arxiv.org/abs/2509.25161) or exploring the [project page](https://kunhao-liu.github.io/Rolling_Forcing_Webpage) for more details."
+        ),
+        allow_flagging='never',
+    )
+    try:
+        # Gradio <= 3.x
+        demo.queue(concurrency_count=1, max_size=2)
+    except TypeError:
+        # Gradio >= 4.x
+        demo.queue(max_size=2)
+    demo.launch(show_error=True)
+if __name__ == "__main__":
+    main()

configs/default_config.yaml ADDED Viewed

	@@ -0,0 +1,20 @@

+independent_first_frame: false
+warp_denoising_step: false
+weight_decay: 0.01
+same_step_across_blocks: true
+discriminator_lr_multiplier: 1.0
+last_step_only: false
+i2v: false
+num_training_frames: 27
+gc_interval: 100
+context_noise: 0
+causal: true
+ckpt_step: 0
+prompt_name: MovieGenVideoBench
+prompt_path: prompts/MovieGenVideoBench.txt
+eval_first_n: 64
+num_samples: 1
+height: 480
+width: 832
+num_frames: 81

configs/rolling_forcing_dmd.yaml ADDED Viewed

	@@ -0,0 +1,48 @@

+generator_ckpt: checkpoints/ode_init.pt
+generator_fsdp_wrap_strategy: size
+real_score_fsdp_wrap_strategy: size
+fake_score_fsdp_wrap_strategy: size
+real_name: Wan2.1-T2V-14B
+text_encoder_fsdp_wrap_strategy: size
+denoising_step_list:
+- 1000
+- 800
+- 600
+- 400
+- 200
+warp_denoising_step: true # need to remove - 0 in denoising_step_list if warp_denoising_step is true
+ts_schedule: false
+num_train_timestep: 1000
+timestep_shift: 5.0
+guidance_scale: 3.0
+denoising_loss_type: flow
+mixed_precision: true
+seed: 0
+sharding_strategy: hybrid_full
+lr: 1.5e-06
+lr_critic: 4.0e-07
+beta1: 0.0
+beta2: 0.999
+beta1_critic: 0.0
+beta2_critic: 0.999
+data_path: prompts/vidprom_filtered_extended.txt
+batch_size: 1
+ema_weight: 0.99
+ema_start_step: 200
+total_batch_size: 64
+log_iters: 100
+negative_prompt: '色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走'
+dfake_gen_update_ratio: 5
+image_or_video_shape:
+- 1
+- 21
+- 16
+- 60
+- 104
+distribution_loss: dmd
+trainer: score_distillation
+gradient_checkpointing: true
+num_frame_per_block: 3
+load_raw_video: false
+model_kwargs:
+  timestep_shift: 5.0

inference.py ADDED Viewed

	@@ -0,0 +1,197 @@

+import argparse
+import torch
+import os
+from omegaconf import OmegaConf
+from collections import OrderedDict
+from tqdm import tqdm
+from torchvision import transforms
+from torchvision.io import write_video
+from einops import rearrange
+import torch.distributed as dist
+import imageio
+from torch.utils.data import DataLoader, SequentialSampler
+from torch.utils.data.distributed import DistributedSampler
+from pipeline import (
+    CausalDiffusionInferencePipeline,
+    CausalInferencePipeline
+)
+from utils.dataset import TextDataset, TextImagePairDataset
+from utils.misc import set_seed
+parser = argparse.ArgumentParser()
+parser.add_argument("--config_path", type=str, help="Path to the config file")
+parser.add_argument("--checkpoint_path", type=str, help="Path to the checkpoint folder")
+parser.add_argument("--data_path", type=str, help="Path to the dataset")
+parser.add_argument("--extended_prompt_path", type=str, help="Path to the extended prompt")
+parser.add_argument("--output_folder", type=str, help="Output folder")
+parser.add_argument("--num_output_frames", type=int, default=21,
+                    help="Number of overlap frames between sliding windows")
+parser.add_argument("--i2v", action="store_true", help="Whether to perform I2V (or T2V by default)")
+parser.add_argument("--use_ema", action="store_true", help="Whether to use EMA parameters")
+parser.add_argument("--seed", type=int, default=0, help="Random seed")
+parser.add_argument("--num_samples", type=int, default=1, help="Number of samples to generate per prompt")
+parser.add_argument("--save_with_index", action="store_true",
+                    help="Whether to save the video using the index or prompt as the filename")
+args = parser.parse_args()
+# Initialize distributed inference
+if "LOCAL_RANK" in os.environ:
+    dist.init_process_group(backend='nccl')
+    local_rank = int(os.environ["LOCAL_RANK"])
+    torch.cuda.set_device(local_rank)
+    device = torch.device(f"cuda:{local_rank}")
+    world_size = dist.get_world_size()
+    set_seed(args.seed + local_rank)
+else:
+    device = torch.device("cuda")
+    local_rank = 0
+    world_size = 1
+    set_seed(args.seed)
+torch.set_grad_enabled(False)
+config = OmegaConf.load(args.config_path)
+default_config = OmegaConf.load("configs/default_config.yaml")
+config = OmegaConf.merge(default_config, config)
+# Initialize pipeline
+if hasattr(config, 'denoising_step_list'):
+    # Few-step inference
+    pipeline = CausalInferencePipeline(config, device=device)
+else:
+    # Multi-step diffusion inference
+    pipeline = CausalDiffusionInferencePipeline(config, device=device)
+if args.checkpoint_path:
+    state_dict = torch.load(args.checkpoint_path, map_location="cpu")
+    if args.use_ema:
+        state_dict_to_load = state_dict['generator_ema']
+        def remove_fsdp_prefix(state_dict):
+            new_state_dict = OrderedDict()
+            for key, value in state_dict.items():
+                if "_fsdp_wrapped_module." in key:
+                    new_key = key.replace("_fsdp_wrapped_module.", "")
+                    new_state_dict[new_key] = value
+                else:
+                    new_state_dict[key] = value
+            return new_state_dict
+        state_dict_to_load = remove_fsdp_prefix(state_dict_to_load)
+    else:
+        state_dict_to_load = state_dict['generator']
+    pipeline.generator.load_state_dict(state_dict_to_load)
+pipeline = pipeline.to(device=device, dtype=torch.bfloat16)
+# Create dataset
+if args.i2v:
+    assert not dist.is_initialized(), "I2V does not support distributed inference yet"
+    transform = transforms.Compose([
+        transforms.Resize((480, 832)),
+        transforms.ToTensor(),
+        transforms.Normalize([0.5], [0.5])
+    ])
+    dataset = TextImagePairDataset(args.data_path, transform=transform)
+else:
+    dataset = TextDataset(prompt_path=args.data_path, extended_prompt_path=args.extended_prompt_path)
+num_prompts = len(dataset)
+print(f"Number of prompts: {num_prompts}")
+if dist.is_initialized():
+    sampler = DistributedSampler(dataset, shuffle=False, drop_last=True)
+else:
+    sampler = SequentialSampler(dataset)
+dataloader = DataLoader(dataset, batch_size=1, sampler=sampler, num_workers=0, drop_last=False)
+# Create output directory (only on main process to avoid race conditions)
+if local_rank == 0:
+    os.makedirs(args.output_folder, exist_ok=True)
+if dist.is_initialized():
+    dist.barrier()
+def encode(self, videos: torch.Tensor) -> torch.Tensor:
+    device, dtype = videos[0].device, videos[0].dtype
+    scale = [self.mean.to(device=device, dtype=dtype),
+             1.0 / self.std.to(device=device, dtype=dtype)]
+    output = [
+        self.model.encode(u.unsqueeze(0), scale).float().squeeze(0)
+        for u in videos
+    ]
+    output = torch.stack(output, dim=0)
+    return output
+for i, batch_data in tqdm(enumerate(dataloader), disable=(local_rank != 0)):
+    idx = batch_data['idx'].item()
+    # For DataLoader batch_size=1, the batch_data is already a single item, but in a batch container
+    # Unpack the batch data for convenience
+    if isinstance(batch_data, dict):
+        batch = batch_data
+    elif isinstance(batch_data, list):
+        batch = batch_data[0]  # First (and only) item in the batch
+    all_video = []
+    num_generated_frames = 0  # Number of generated (latent) frames
+    if args.i2v:
+        # For image-to-video, batch contains image and caption
+        prompt = batch['prompts'][0]  # Get caption from batch
+        prompts = [prompt] * args.num_samples
+        # Process the image
+        image = batch['image'].squeeze(0).unsqueeze(0).unsqueeze(2).to(device=device, dtype=torch.bfloat16)
+        # Encode the input image as the first latent
+        initial_latent = pipeline.vae.encode_to_latent(image).to(device=device, dtype=torch.bfloat16)
+        initial_latent = initial_latent.repeat(args.num_samples, 1, 1, 1, 1)
+        sampled_noise = torch.randn(
+            [args.num_samples, args.num_output_frames - 1, 16, 60, 104], device=device, dtype=torch.bfloat16
+        )
+    else:
+        # For text-to-video, batch is just the text prompt
+        prompt = batch['prompts'][0]
+        extended_prompt = batch['extended_prompts'][0] if 'extended_prompts' in batch else None
+        if extended_prompt is not None:
+            prompts = [extended_prompt] * args.num_samples
+        else:
+            prompts = [prompt] * args.num_samples
+        initial_latent = None
+        sampled_noise = torch.randn(
+            [args.num_samples, args.num_output_frames, 16, 60, 104], device=device, dtype=torch.bfloat16
+        )
+    # Generate 81 frames
+    video, latents = pipeline.inference_rolling_forcing(
+        noise=sampled_noise,
+        text_prompts=prompts,
+        return_latents=True,
+        initial_latent=initial_latent,
+    )
+    current_video = rearrange(video, 'b t c h w -> b t h w c').cpu()
+    all_video.append(current_video)
+    num_generated_frames += latents.shape[1]
+    # Final output video
+    video = 255.0 * torch.cat(all_video, dim=1)
+    # Clear VAE cache
+    pipeline.vae.model.clear_cache()
+    # Save the video if the current prompt is not a dummy prompt
+    if idx < num_prompts:
+        model = "regular" if not args.use_ema else "ema"
+        for seed_idx in range(args.num_samples):
+            # All processes save their videos
+            if args.save_with_index:
+                output_path = os.path.join(args.output_folder, f'{idx}-{seed_idx}_{model}.mp4')
+            else:
+                output_path = os.path.join(args.output_folder, f'{prompt[:100]}-{seed_idx}.mp4')
+            write_video(output_path, video[seed_idx], fps=16)
+            # imageio.mimwrite(output_path, video[seed_idx], fps=16, quality=8, output_params=["-loglevel", "error"])

model/__init__.py ADDED Viewed

	@@ -0,0 +1,14 @@

+from .diffusion import CausalDiffusion
+from .causvid import CausVid
+from .dmd import DMD
+from .gan import GAN
+from .sid import SiD
+from .ode_regression import ODERegression
+__all__ = [
+    "CausalDiffusion",
+    "CausVid",
+    "DMD",
+    "GAN",
+    "SiD",
+    "ODERegression"
+]

model/base.py ADDED Viewed

	@@ -0,0 +1,230 @@

+from typing import Tuple
+from einops import rearrange
+from torch import nn
+import torch.distributed as dist
+import torch
+from pipeline import RollingForcingTrainingPipeline
+from utils.loss import get_denoising_loss
+from utils.wan_wrapper import WanDiffusionWrapper, WanTextEncoder, WanVAEWrapper
+class BaseModel(nn.Module):
+    def __init__(self, args, device):
+        super().__init__()
+        self._initialize_models(args, device)
+        self.device = device
+        self.args = args
+        self.dtype = torch.bfloat16 if args.mixed_precision else torch.float32
+        if hasattr(args, "denoising_step_list"):
+            self.denoising_step_list = torch.tensor(args.denoising_step_list, dtype=torch.long)
+            if args.warp_denoising_step:
+                timesteps = torch.cat((self.scheduler.timesteps.cpu(), torch.tensor([0], dtype=torch.float32)))
+                self.denoising_step_list = timesteps[1000 - self.denoising_step_list]
+    def _initialize_models(self, args, device):
+        self.real_model_name = getattr(args, "real_name", "Wan2.1-T2V-1.3B")
+        self.fake_model_name = getattr(args, "fake_name", "Wan2.1-T2V-1.3B")
+        self.generator = WanDiffusionWrapper(**getattr(args, "model_kwargs", {}), is_causal=True)
+        self.generator.model.requires_grad_(True)
+        self.real_score = WanDiffusionWrapper(model_name=self.real_model_name, is_causal=False)
+        self.real_score.model.requires_grad_(False)
+        self.fake_score = WanDiffusionWrapper(model_name=self.fake_model_name, is_causal=False)
+        self.fake_score.model.requires_grad_(True)
+        self.text_encoder = WanTextEncoder()
+        self.text_encoder.requires_grad_(False)
+        self.vae = WanVAEWrapper()
+        self.vae.requires_grad_(False)
+        self.scheduler = self.generator.get_scheduler()
+        self.scheduler.timesteps = self.scheduler.timesteps.to(device)
+    def _get_timestep(
+            self,
+            min_timestep: int,
+            max_timestep: int,
+            batch_size: int,
+            num_frame: int,
+            num_frame_per_block: int,
+            uniform_timestep: bool = False
+    ) -> torch.Tensor:
+        """
+        Randomly generate a timestep tensor based on the generator's task type. It uniformly samples a timestep
+        from the range [min_timestep, max_timestep], and returns a tensor of shape [batch_size, num_frame].
+        - If uniform_timestep, it will use the same timestep for all frames.
+        - If not uniform_timestep, it will use a different timestep for each block.
+        """
+        if uniform_timestep:
+            timestep = torch.randint(
+                min_timestep,
+                max_timestep,
+                [batch_size, 1],
+                device=self.device,
+                dtype=torch.long
+            ).repeat(1, num_frame)
+            return timestep
+        else:
+            timestep = torch.randint(
+                min_timestep,
+                max_timestep,
+                [batch_size, num_frame],
+                device=self.device,
+                dtype=torch.long
+            )
+            # make the noise level the same within every block
+            if self.independent_first_frame:
+                # the first frame is always kept the same
+                timestep_from_second = timestep[:, 1:]
+                timestep_from_second = timestep_from_second.reshape(
+                    timestep_from_second.shape[0], -1, num_frame_per_block)
+                timestep_from_second[:, :, 1:] = timestep_from_second[:, :, 0:1]
+                timestep_from_second = timestep_from_second.reshape(
+                    timestep_from_second.shape[0], -1)
+                timestep = torch.cat([timestep[:, 0:1], timestep_from_second], dim=1)
+            else:
+                timestep = timestep.reshape(
+                    timestep.shape[0], -1, num_frame_per_block)
+                timestep[:, :, 1:] = timestep[:, :, 0:1]
+                timestep = timestep.reshape(timestep.shape[0], -1)
+            return timestep
+class RollingForcingModel(BaseModel):
+    def __init__(self, args, device):
+        super().__init__(args, device)
+        self.denoising_loss_func = get_denoising_loss(args.denoising_loss_type)()
+    def _run_generator(
+        self,
+        image_or_video_shape,
+        conditional_dict: dict,
+        initial_latent: torch.tensor = None
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        Optionally simulate the generator's input from noise using backward simulation
+        and then run the generator for one-step.
+        Input:
+            - image_or_video_shape: a list containing the shape of the image or video [B, F, C, H, W].
+            - conditional_dict: a dictionary containing the conditional information (e.g. text embeddings, image embeddings).
+            - unconditional_dict: a dictionary containing the unconditional information (e.g. null/negative text embeddings, null/negative image embeddings).
+            - clean_latent: a tensor containing the clean latents [B, F, C, H, W]. Need to be passed when no backward simulation is used.
+            - initial_latent: a tensor containing the initial latents [B, F, C, H, W].
+        Output:
+            - pred_image: a tensor with shape [B, F, C, H, W].
+            - denoised_timestep: an integer
+        """
+        # Step 1: Sample noise and backward simulate the generator's input
+        assert getattr(self.args, "backward_simulation", True), "Backward simulation needs to be enabled"
+        if initial_latent is not None:
+            conditional_dict["initial_latent"] = initial_latent
+        if self.args.i2v:
+            noise_shape = [image_or_video_shape[0], image_or_video_shape[1] - 1, *image_or_video_shape[2:]]
+        else:
+            noise_shape = image_or_video_shape.copy()
+        # During training, the number of generated frames should be uniformly sampled from
+        # [21, self.num_training_frames], but still being a multiple of self.num_frame_per_block
+        min_num_frames = 20 if self.args.independent_first_frame else 21
+        max_num_frames = self.num_training_frames - 1 if self.args.independent_first_frame else self.num_training_frames
+        assert max_num_frames % self.num_frame_per_block == 0
+        assert min_num_frames % self.num_frame_per_block == 0
+        max_num_blocks = max_num_frames // self.num_frame_per_block
+        min_num_blocks = min_num_frames // self.num_frame_per_block
+        num_generated_blocks = torch.randint(min_num_blocks, max_num_blocks + 1, (1,), device=self.device)
+        dist.broadcast(num_generated_blocks, src=0)
+        num_generated_blocks = num_generated_blocks.item()
+        num_generated_frames = num_generated_blocks * self.num_frame_per_block
+        if self.args.independent_first_frame and initial_latent is None:
+            num_generated_frames += 1
+            min_num_frames += 1
+        # Sync num_generated_frames across all processes
+        noise_shape[1] = num_generated_frames
+        pred_image_or_video, denoised_timestep_from, denoised_timestep_to = self._consistency_backward_simulation(
+            noise=torch.randn(noise_shape,
+                              device=self.device, dtype=self.dtype),
+            **conditional_dict,
+        )
+        # Slice last 21 frames
+        if pred_image_or_video.shape[1] > 21:
+            with torch.no_grad():
+                # Reencode to get image latent
+                latent_to_decode = pred_image_or_video[:, :-20, ...]
+                # Deccode to video
+                pixels = self.vae.decode_to_pixel(latent_to_decode)
+                frame = pixels[:, -1:, ...].to(self.dtype)
+                frame = rearrange(frame, "b t c h w -> b c t h w")
+                # Encode frame to get image latent
+                image_latent = self.vae.encode_to_latent(frame).to(self.dtype)
+            pred_image_or_video_last_21 = torch.cat([image_latent, pred_image_or_video[:, -20:, ...]], dim=1)
+        else:
+            pred_image_or_video_last_21 = pred_image_or_video
+        if num_generated_frames != min_num_frames:
+            # Currently, we do not use gradient for the first chunk, since it contains image latents
+            gradient_mask = torch.ones_like(pred_image_or_video_last_21, dtype=torch.bool)
+            if self.args.independent_first_frame:
+                gradient_mask[:, :1] = False
+            else:
+                gradient_mask[:, :self.num_frame_per_block] = False
+        else:
+            gradient_mask = None
+        pred_image_or_video_last_21 = pred_image_or_video_last_21.to(self.dtype)
+        return pred_image_or_video_last_21, gradient_mask, denoised_timestep_from, denoised_timestep_to
+    def _consistency_backward_simulation(
+        self,
+        noise: torch.Tensor,
+        **conditional_dict: dict
+    ) -> torch.Tensor:
+        """
+        Simulate the generator's input from noise to avoid training/inference mismatch.
+        See Sec 4.5 of the DMD2 paper (https://arxiv.org/abs/2405.14867) for details.
+        Here we use the consistency sampler (https://arxiv.org/abs/2303.01469)
+        Input:
+            - noise: a tensor sampled from N(0, 1) with shape [B, F, C, H, W] where the number of frame is 1 for images.
+            - conditional_dict: a dictionary containing the conditional information (e.g. text embeddings, image embeddings).
+        Output:
+            - output: a tensor with shape [B, T, F, C, H, W].
+            T is the total number of timesteps. output[0] is a pure noise and output[i] and i>0
+            represents the x0 prediction at each timestep.
+        """
+        if self.inference_pipeline is None:
+            self._initialize_inference_pipeline()
+        infer_w_rolling = torch.rand(1, device=self.device) > 0.5
+        dist.broadcast(infer_w_rolling, src=0)
+        if infer_w_rolling:
+            return self.inference_pipeline.inference_with_rolling_forcing(
+                noise=noise, **conditional_dict
+            )
+        else:
+            return self.inference_pipeline.inference_with_self_forcing(
+                noise=noise, **conditional_dict
+            )
+    def _initialize_inference_pipeline(self):
+        """
+        Lazy initialize the inference pipeline during the first backward simulation run.
+        Here we encapsulate the inference code with a model-dependent outside function.
+        We pass our FSDP-wrapped modules into the pipeline to save memory.
+        """
+        self.inference_pipeline = RollingForcingTrainingPipeline(
+            denoising_step_list=self.denoising_step_list,
+            scheduler=self.scheduler,
+            generator=self.generator,
+            num_frame_per_block=self.num_frame_per_block,
+            independent_first_frame=self.args.independent_first_frame,
+            same_step_across_blocks=self.args.same_step_across_blocks,
+            last_step_only=self.args.last_step_only,
+            num_max_frames=self.num_training_frames,
+            context_noise=self.args.context_noise
+        )

model/causvid.py ADDED Viewed

	@@ -0,0 +1,391 @@

+import torch.nn.functional as F
+from typing import Tuple
+import torch
+from model.base import BaseModel
+class CausVid(BaseModel):
+    def __init__(self, args, device):
+        """
+        Initialize the DMD (Distribution Matching Distillation) module.
+        This class is self-contained and compute generator and fake score losses
+        in the forward pass.
+        """
+        super().__init__(args, device)
+        self.num_frame_per_block = getattr(args, "num_frame_per_block", 1)
+        self.num_training_frames = getattr(args, "num_training_frames", 21)
+        if self.num_frame_per_block > 1:
+            self.generator.model.num_frame_per_block = self.num_frame_per_block
+        self.independent_first_frame = getattr(args, "independent_first_frame", False)
+        if self.independent_first_frame:
+            self.generator.model.independent_first_frame = True
+        if args.gradient_checkpointing:
+            self.generator.enable_gradient_checkpointing()
+            self.fake_score.enable_gradient_checkpointing()
+        # Step 2: Initialize all dmd hyperparameters
+        self.num_train_timestep = args.num_train_timestep
+        self.min_step = int(0.02 * self.num_train_timestep)
+        self.max_step = int(0.98 * self.num_train_timestep)
+        if hasattr(args, "real_guidance_scale"):
+            self.real_guidance_scale = args.real_guidance_scale
+            self.fake_guidance_scale = args.fake_guidance_scale
+        else:
+            self.real_guidance_scale = args.guidance_scale
+            self.fake_guidance_scale = 0.0
+        self.timestep_shift = getattr(args, "timestep_shift", 1.0)
+        self.teacher_forcing = getattr(args, "teacher_forcing", False)
+        if getattr(self.scheduler, "alphas_cumprod", None) is not None:
+            self.scheduler.alphas_cumprod = self.scheduler.alphas_cumprod.to(device)
+        else:
+            self.scheduler.alphas_cumprod = None
+    def _compute_kl_grad(
+        self, noisy_image_or_video: torch.Tensor,
+        estimated_clean_image_or_video: torch.Tensor,
+        timestep: torch.Tensor,
+        conditional_dict: dict, unconditional_dict: dict,
+        normalization: bool = True
+    ) -> Tuple[torch.Tensor, dict]:
+        """
+        Compute the KL grad (eq 7 in https://arxiv.org/abs/2311.18828).
+        Input:
+            - noisy_image_or_video: a tensor with shape [B, F, C, H, W] where the number of frame is 1 for images.
+            - estimated_clean_image_or_video: a tensor with shape [B, F, C, H, W] representing the estimated clean image or video.
+            - timestep: a tensor with shape [B, F] containing the randomly generated timestep.
+            - conditional_dict: a dictionary containing the conditional information (e.g. text embeddings, image embeddings).
+            - unconditional_dict: a dictionary containing the unconditional information (e.g. null/negative text embeddings, null/negative image embeddings).
+            - normalization: a boolean indicating whether to normalize the gradient.
+        Output:
+            - kl_grad: a tensor representing the KL grad.
+            - kl_log_dict: a dictionary containing the intermediate tensors for logging.
+        """
+        # Step 1: Compute the fake score
+        _, pred_fake_image_cond = self.fake_score(
+            noisy_image_or_video=noisy_image_or_video,
+            conditional_dict=conditional_dict,
+            timestep=timestep
+        )
+        if self.fake_guidance_scale != 0.0:
+            _, pred_fake_image_uncond = self.fake_score(
+                noisy_image_or_video=noisy_image_or_video,
+                conditional_dict=unconditional_dict,
+                timestep=timestep
+            )
+            pred_fake_image = pred_fake_image_cond + (
+                pred_fake_image_cond - pred_fake_image_uncond
+            ) * self.fake_guidance_scale
+        else:
+            pred_fake_image = pred_fake_image_cond
+        # Step 2: Compute the real score
+        # We compute the conditional and unconditional prediction
+        # and add them together to achieve cfg (https://arxiv.org/abs/2207.12598)
+        _, pred_real_image_cond = self.real_score(
+            noisy_image_or_video=noisy_image_or_video,
+            conditional_dict=conditional_dict,
+            timestep=timestep
+        )
+        _, pred_real_image_uncond = self.real_score(
+            noisy_image_or_video=noisy_image_or_video,
+            conditional_dict=unconditional_dict,
+            timestep=timestep
+        )
+        pred_real_image = pred_real_image_cond + (
+            pred_real_image_cond - pred_real_image_uncond
+        ) * self.real_guidance_scale
+        # Step 3: Compute the DMD gradient (DMD paper eq. 7).
+        grad = (pred_fake_image - pred_real_image)
+        # TODO: Change the normalizer for causal teacher
+        if normalization:
+            # Step 4: Gradient normalization (DMD paper eq. 8).
+            p_real = (estimated_clean_image_or_video - pred_real_image)
+            normalizer = torch.abs(p_real).mean(dim=[1, 2, 3, 4], keepdim=True)
+            grad = grad / normalizer
+        grad = torch.nan_to_num(grad)
+        return grad, {
+            "dmdtrain_gradient_norm": torch.mean(torch.abs(grad)).detach(),
+            "timestep": timestep.detach()
+        }
+    def compute_distribution_matching_loss(
+        self,
+        image_or_video: torch.Tensor,
+        conditional_dict: dict,
+        unconditional_dict: dict,
+        gradient_mask: torch.Tensor = None,
+    ) -> Tuple[torch.Tensor, dict]:
+        """
+        Compute the DMD loss (eq 7 in https://arxiv.org/abs/2311.18828).
+        Input:
+            - image_or_video: a tensor with shape [B, F, C, H, W] where the number of frame is 1 for images.
+            - conditional_dict: a dictionary containing the conditional information (e.g. text embeddings, image embeddings).
+            - unconditional_dict: a dictionary containing the unconditional information (e.g. null/negative text embeddings, null/negative image embeddings).
+            - gradient_mask: a boolean tensor with the same shape as image_or_video indicating which pixels to compute loss .
+        Output:
+            - dmd_loss: a scalar tensor representing the DMD loss.
+            - dmd_log_dict: a dictionary containing the intermediate tensors for logging.
+        """
+        original_latent = image_or_video
+        batch_size, num_frame = image_or_video.shape[:2]
+        with torch.no_grad():
+            # Step 1: Randomly sample timestep based on the given schedule and corresponding noise
+            timestep = self._get_timestep(
+                0,
+                self.num_train_timestep,
+                batch_size,
+                num_frame,
+                self.num_frame_per_block,
+                uniform_timestep=True
+            )
+            if self.timestep_shift > 1:
+                timestep = self.timestep_shift * \
+                    (timestep / 1000) / \
+                    (1 + (self.timestep_shift - 1) * (timestep / 1000)) * 1000
+            timestep = timestep.clamp(self.min_step, self.max_step)
+            noise = torch.randn_like(image_or_video)
+            noisy_latent = self.scheduler.add_noise(
+                image_or_video.flatten(0, 1),
+                noise.flatten(0, 1),
+                timestep.flatten(0, 1)
+            ).detach().unflatten(0, (batch_size, num_frame))
+            # Step 2: Compute the KL grad
+            grad, dmd_log_dict = self._compute_kl_grad(
+                noisy_image_or_video=noisy_latent,
+                estimated_clean_image_or_video=original_latent,
+                timestep=timestep,
+                conditional_dict=conditional_dict,
+                unconditional_dict=unconditional_dict
+            )
+        if gradient_mask is not None:
+            dmd_loss = 0.5 * F.mse_loss(original_latent.double(
+            )[gradient_mask], (original_latent.double() - grad.double()).detach()[gradient_mask], reduction="mean")
+        else:
+            dmd_loss = 0.5 * F.mse_loss(original_latent.double(
+            ), (original_latent.double() - grad.double()).detach(), reduction="mean")
+        return dmd_loss, dmd_log_dict
+    def _run_generator(
+        self,
+        image_or_video_shape,
+        conditional_dict: dict,
+        clean_latent: torch.tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        Optionally simulate the generator's input from noise using backward simulation
+        and then run the generator for one-step.
+        Input:
+            - image_or_video_shape: a list containing the shape of the image or video [B, F, C, H, W].
+            - conditional_dict: a dictionary containing the conditional information (e.g. text embeddings, image embeddings).
+            - unconditional_dict: a dictionary containing the unconditional information (e.g. null/negative text embeddings, null/negative image embeddings).
+            - clean_latent: a tensor containing the clean latents [B, F, C, H, W]. Need to be passed when no backward simulation is used.
+            - initial_latent: a tensor containing the initial latents [B, F, C, H, W].
+        Output:
+            - pred_image: a tensor with shape [B, F, C, H, W].
+        """
+        simulated_noisy_input = []
+        for timestep in self.denoising_step_list:
+            noise = torch.randn(
+                image_or_video_shape, device=self.device, dtype=self.dtype)
+            noisy_timestep = timestep * torch.ones(
+                image_or_video_shape[:2], device=self.device, dtype=torch.long)
+            if timestep != 0:
+                noisy_image = self.scheduler.add_noise(
+                    clean_latent.flatten(0, 1),
+                    noise.flatten(0, 1),
+                    noisy_timestep.flatten(0, 1)
+                ).unflatten(0, image_or_video_shape[:2])
+            else:
+                noisy_image = clean_latent
+            simulated_noisy_input.append(noisy_image)
+        simulated_noisy_input = torch.stack(simulated_noisy_input, dim=1)
+        # Step 2: Randomly sample a timestep and pick the corresponding input
+        index = self._get_timestep(
+            0,
+            len(self.denoising_step_list),
+            image_or_video_shape[0],
+            image_or_video_shape[1],
+            self.num_frame_per_block,
+            uniform_timestep=False
+        )
+        # select the corresponding timestep's noisy input from the stacked tensor [B, T, F, C, H, W]
+        noisy_input = torch.gather(
+            simulated_noisy_input, dim=1,
+            index=index.reshape(index.shape[0], 1, index.shape[1], 1, 1, 1).expand(
+                -1, -1, -1, *image_or_video_shape[2:]).to(self.device)
+        ).squeeze(1)
+        timestep = self.denoising_step_list[index].to(self.device)
+        _, pred_image_or_video = self.generator(
+            noisy_image_or_video=noisy_input,
+            conditional_dict=conditional_dict,
+            timestep=timestep,
+            clean_x=clean_latent if self.teacher_forcing else None,
+        )
+        gradient_mask = None  # timestep != 0
+        pred_image_or_video = pred_image_or_video.type_as(noisy_input)
+        return pred_image_or_video, gradient_mask
+    def generator_loss(
+        self,
+        image_or_video_shape,
+        conditional_dict: dict,
+        unconditional_dict: dict,
+        clean_latent: torch.Tensor,
+        initial_latent: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, dict]:
+        """
+        Generate image/videos from noise and compute the DMD loss.
+        The noisy input to the generator is backward simulated.
+        This removes the need of any datasets during distillation.
+        See Sec 4.5 of the DMD2 paper (https://arxiv.org/abs/2405.14867) for details.
+        Input:
+            - image_or_video_shape: a list containing the shape of the image or video [B, F, C, H, W].
+            - conditional_dict: a dictionary containing the conditional information (e.g. text embeddings, image embeddings).
+            - unconditional_dict: a dictionary containing the unconditional information (e.g. null/negative text embeddings, null/negative image embeddings).
+            - clean_latent: a tensor containing the clean latents [B, F, C, H, W]. Need to be passed when no backward simulation is used.
+        Output:
+            - loss: a scalar tensor representing the generator loss.
+            - generator_log_dict: a dictionary containing the intermediate tensors for logging.
+        """
+        # Step 1: Run generator on backward simulated noisy input
+        pred_image, gradient_mask = self._run_generator(
+            image_or_video_shape=image_or_video_shape,
+            conditional_dict=conditional_dict,
+            clean_latent=clean_latent
+        )
+        # Step 2: Compute the DMD loss
+        dmd_loss, dmd_log_dict = self.compute_distribution_matching_loss(
+            image_or_video=pred_image,
+            conditional_dict=conditional_dict,
+            unconditional_dict=unconditional_dict,
+            gradient_mask=gradient_mask
+        )
+        # Step 3: TODO: Implement the GAN loss
+        return dmd_loss, dmd_log_dict
+    def critic_loss(
+        self,
+        image_or_video_shape,
+        conditional_dict: dict,
+        unconditional_dict: dict,
+        clean_latent: torch.Tensor,
+        initial_latent: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, dict]:
+        """
+        Generate image/videos from noise and train the critic with generated samples.
+        The noisy input to the generator is backward simulated.
+        This removes the need of any datasets during distillation.
+        See Sec 4.5 of the DMD2 paper (https://arxiv.org/abs/2405.14867) for details.
+        Input:
+            - image_or_video_shape: a list containing the shape of the image or video [B, F, C, H, W].
+            - conditional_dict: a dictionary containing the conditional information (e.g. text embeddings, image embeddings).
+            - unconditional_dict: a dictionary containing the unconditional information (e.g. null/negative text embeddings, null/negative image embeddings).
+            - clean_latent: a tensor containing the clean latents [B, F, C, H, W]. Need to be passed when no backward simulation is used.
+        Output:
+            - loss: a scalar tensor representing the generator loss.
+            - critic_log_dict: a dictionary containing the intermediate tensors for logging.
+        """
+        # Step 1: Run generator on backward simulated noisy input
+        with torch.no_grad():
+            generated_image, _ = self._run_generator(
+                image_or_video_shape=image_or_video_shape,
+                conditional_dict=conditional_dict,
+                clean_latent=clean_latent
+            )
+        # Step 2: Compute the fake prediction
+        critic_timestep = self._get_timestep(
+            0,
+            self.num_train_timestep,
+            image_or_video_shape[0],
+            image_or_video_shape[1],
+            self.num_frame_per_block,
+            uniform_timestep=True
+        )
+        if self.timestep_shift > 1:
+            critic_timestep = self.timestep_shift * \
+                (critic_timestep / 1000) / (1 + (self.timestep_shift - 1) * (critic_timestep / 1000)) * 1000
+        critic_timestep = critic_timestep.clamp(self.min_step, self.max_step)
+        critic_noise = torch.randn_like(generated_image)
+        noisy_generated_image = self.scheduler.add_noise(
+            generated_image.flatten(0, 1),
+            critic_noise.flatten(0, 1),
+            critic_timestep.flatten(0, 1)
+        ).unflatten(0, image_or_video_shape[:2])
+        _, pred_fake_image = self.fake_score(
+            noisy_image_or_video=noisy_generated_image,
+            conditional_dict=conditional_dict,
+            timestep=critic_timestep
+        )
+        # Step 3: Compute the denoising loss for the fake critic
+        if self.args.denoising_loss_type == "flow":
+            from utils.wan_wrapper import WanDiffusionWrapper
+            flow_pred = WanDiffusionWrapper._convert_x0_to_flow_pred(
+                scheduler=self.scheduler,
+                x0_pred=pred_fake_image.flatten(0, 1),
+                xt=noisy_generated_image.flatten(0, 1),
+                timestep=critic_timestep.flatten(0, 1)
+            )
+            pred_fake_noise = None
+        else:
+            flow_pred = None
+            pred_fake_noise = self.scheduler.convert_x0_to_noise(
+                x0=pred_fake_image.flatten(0, 1),
+                xt=noisy_generated_image.flatten(0, 1),
+                timestep=critic_timestep.flatten(0, 1)
+            ).unflatten(0, image_or_video_shape[:2])
+        denoising_loss = self.denoising_loss_func(
+            x=generated_image.flatten(0, 1),
+            x_pred=pred_fake_image.flatten(0, 1),
+            noise=critic_noise.flatten(0, 1),
+            noise_pred=pred_fake_noise,
+            alphas_cumprod=self.scheduler.alphas_cumprod,
+            timestep=critic_timestep.flatten(0, 1),
+            flow_pred=flow_pred
+        )
+        # Step 4: TODO: Compute the GAN loss
+        # Step 5: Debugging Log
+        critic_log_dict = {
+            "critic_timestep": critic_timestep.detach()
+        }
+        return denoising_loss, critic_log_dict

model/diffusion.py ADDED Viewed

	@@ -0,0 +1,125 @@

+from typing import Tuple
+import torch
+from model.base import BaseModel
+from utils.wan_wrapper import WanDiffusionWrapper, WanTextEncoder, WanVAEWrapper
+class CausalDiffusion(BaseModel):
+    def __init__(self, args, device):
+        """
+        Initialize the Diffusion loss module.
+        """
+        super().__init__(args, device)
+        self.num_frame_per_block = getattr(args, "num_frame_per_block", 1)
+        if self.num_frame_per_block > 1:
+            self.generator.model.num_frame_per_block = self.num_frame_per_block
+        self.independent_first_frame = getattr(args, "independent_first_frame", False)
+        if self.independent_first_frame:
+            self.generator.model.independent_first_frame = True
+        if args.gradient_checkpointing:
+            self.generator.enable_gradient_checkpointing()
+        # Step 2: Initialize all hyperparameters
+        self.num_train_timestep = args.num_train_timestep
+        self.min_step = int(0.02 * self.num_train_timestep)
+        self.max_step = int(0.98 * self.num_train_timestep)
+        self.guidance_scale = args.guidance_scale
+        self.timestep_shift = getattr(args, "timestep_shift", 1.0)
+        self.teacher_forcing = getattr(args, "teacher_forcing", False)
+        # Noise augmentation in teacher forcing, we add small noise to clean context latents
+        self.noise_augmentation_max_timestep = getattr(args, "noise_augmentation_max_timestep", 0)
+    def _initialize_models(self, args):
+        self.generator = WanDiffusionWrapper(**getattr(args, "model_kwargs", {}), is_causal=True)
+        self.generator.model.requires_grad_(True)
+        self.text_encoder = WanTextEncoder()
+        self.text_encoder.requires_grad_(False)
+        self.vae = WanVAEWrapper()
+        self.vae.requires_grad_(False)
+    def generator_loss(
+        self,
+        image_or_video_shape,
+        conditional_dict: dict,
+        unconditional_dict: dict,
+        clean_latent: torch.Tensor,
+        initial_latent: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, dict]:
+        """
+        Generate image/videos from noise and compute the DMD loss.
+        The noisy input to the generator is backward simulated.
+        This removes the need of any datasets during distillation.
+        See Sec 4.5 of the DMD2 paper (https://arxiv.org/abs/2405.14867) for details.
+        Input:
+            - image_or_video_shape: a list containing the shape of the image or video [B, F, C, H, W].
+            - conditional_dict: a dictionary containing the conditional information (e.g. text embeddings, image embeddings).
+            - unconditional_dict: a dictionary containing the unconditional information (e.g. null/negative text embeddings, null/negative image embeddings).
+            - clean_latent: a tensor containing the clean latents [B, F, C, H, W]. Need to be passed when no backward simulation is used.
+        Output:
+            - loss: a scalar tensor representing the generator loss.
+            - generator_log_dict: a dictionary containing the intermediate tensors for logging.
+        """
+        noise = torch.randn_like(clean_latent)
+        batch_size, num_frame = image_or_video_shape[:2]
+        # Step 2: Randomly sample a timestep and add noise to denoiser inputs
+        index = self._get_timestep(
+            0,
+            self.scheduler.num_train_timesteps,
+            image_or_video_shape[0],
+            image_or_video_shape[1],
+            self.num_frame_per_block,
+            uniform_timestep=False
+        )
+        timestep = self.scheduler.timesteps[index].to(dtype=self.dtype, device=self.device)
+        noisy_latents = self.scheduler.add_noise(
+            clean_latent.flatten(0, 1),
+            noise.flatten(0, 1),
+            timestep.flatten(0, 1)
+        ).unflatten(0, (batch_size, num_frame))
+        training_target = self.scheduler.training_target(clean_latent, noise, timestep)
+        # Step 3: Noise augmentation, also add small noise to clean context latents
+        if self.noise_augmentation_max_timestep > 0:
+            index_clean_aug = self._get_timestep(
+                0,
+                self.noise_augmentation_max_timestep,
+                image_or_video_shape[0],
+                image_or_video_shape[1],
+                self.num_frame_per_block,
+                uniform_timestep=False
+            )
+            timestep_clean_aug = self.scheduler.timesteps[index_clean_aug].to(dtype=self.dtype, device=self.device)
+            clean_latent_aug = self.scheduler.add_noise(
+                clean_latent.flatten(0, 1),
+                noise.flatten(0, 1),
+                timestep_clean_aug.flatten(0, 1)
+            ).unflatten(0, (batch_size, num_frame))
+        else:
+            clean_latent_aug = clean_latent
+            timestep_clean_aug = None
+        # Compute loss
+        flow_pred, x0_pred = self.generator(
+            noisy_image_or_video=noisy_latents,
+            conditional_dict=conditional_dict,
+            timestep=timestep,
+            clean_x=clean_latent_aug if self.teacher_forcing else None,
+            aug_t=timestep_clean_aug if self.teacher_forcing else None
+        )
+        # loss = torch.nn.functional.mse_loss(flow_pred.float(), training_target.float())
+        loss = torch.nn.functional.mse_loss(
+            flow_pred.float(), training_target.float(), reduction='none'
+        ).mean(dim=(2, 3, 4))
+        loss = loss * self.scheduler.training_weight(timestep).unflatten(0, (batch_size, num_frame))
+        loss = loss.mean()
+        log_dict = {
+            "x0": clean_latent.detach(),
+            "x0_pred": x0_pred.detach()
+        }
+        return loss, log_dict

model/dmd.py ADDED Viewed

	@@ -0,0 +1,332 @@

+from pipeline import RollingForcingTrainingPipeline
+import torch.nn.functional as F
+from typing import Optional, Tuple
+import torch
+from model.base import RollingForcingModel
+class DMD(RollingForcingModel):
+    def __init__(self, args, device):
+        """
+        Initialize the DMD (Distribution Matching Distillation) module.
+        This class is self-contained and compute generator and fake score losses
+        in the forward pass.
+        """
+        super().__init__(args, device)
+        self.num_frame_per_block = getattr(args, "num_frame_per_block", 1)
+        self.same_step_across_blocks = getattr(args, "same_step_across_blocks", True)
+        self.num_training_frames = getattr(args, "num_training_frames", 21)
+        if self.num_frame_per_block > 1:
+            self.generator.model.num_frame_per_block = self.num_frame_per_block
+        self.independent_first_frame = getattr(args, "independent_first_frame", False)
+        if self.independent_first_frame:
+            self.generator.model.independent_first_frame = True
+        if args.gradient_checkpointing:
+            self.generator.enable_gradient_checkpointing()
+            self.fake_score.enable_gradient_checkpointing()
+        # this will be init later with fsdp-wrapped modules
+        self.inference_pipeline: RollingForcingTrainingPipeline = None
+        # Step 2: Initialize all dmd hyperparameters
+        self.num_train_timestep = args.num_train_timestep
+        self.min_step = int(0.02 * self.num_train_timestep)
+        self.max_step = int(0.98 * self.num_train_timestep)
+        if hasattr(args, "real_guidance_scale"):
+            self.real_guidance_scale = args.real_guidance_scale
+            self.fake_guidance_scale = args.fake_guidance_scale
+        else:
+            self.real_guidance_scale = args.guidance_scale
+            self.fake_guidance_scale = 0.0
+        self.timestep_shift = getattr(args, "timestep_shift", 1.0)
+        self.ts_schedule = getattr(args, "ts_schedule", True)
+        self.ts_schedule_max = getattr(args, "ts_schedule_max", False)
+        self.min_score_timestep = getattr(args, "min_score_timestep", 0)
+        if getattr(self.scheduler, "alphas_cumprod", None) is not None:
+            self.scheduler.alphas_cumprod = self.scheduler.alphas_cumprod.to(device)
+        else:
+            self.scheduler.alphas_cumprod = None
+    def _compute_kl_grad(
+        self, noisy_image_or_video: torch.Tensor,
+        estimated_clean_image_or_video: torch.Tensor,
+        timestep: torch.Tensor,
+        conditional_dict: dict, unconditional_dict: dict,
+        normalization: bool = True
+    ) -> Tuple[torch.Tensor, dict]:
+        """
+        Compute the KL grad (eq 7 in https://arxiv.org/abs/2311.18828).
+        Input:
+            - noisy_image_or_video: a tensor with shape [B, F, C, H, W] where the number of frame is 1 for images.
+            - estimated_clean_image_or_video: a tensor with shape [B, F, C, H, W] representing the estimated clean image or video.
+            - timestep: a tensor with shape [B, F] containing the randomly generated timestep.
+            - conditional_dict: a dictionary containing the conditional information (e.g. text embeddings, image embeddings).
+            - unconditional_dict: a dictionary containing the unconditional information (e.g. null/negative text embeddings, null/negative image embeddings).
+            - normalization: a boolean indicating whether to normalize the gradient.
+        Output:
+            - kl_grad: a tensor representing the KL grad.
+            - kl_log_dict: a dictionary containing the intermediate tensors for logging.
+        """
+        # Step 1: Compute the fake score
+        _, pred_fake_image_cond = self.fake_score(
+            noisy_image_or_video=noisy_image_or_video,
+            conditional_dict=conditional_dict,
+            timestep=timestep
+        )
+        if self.fake_guidance_scale != 0.0:
+            _, pred_fake_image_uncond = self.fake_score(
+                noisy_image_or_video=noisy_image_or_video,
+                conditional_dict=unconditional_dict,
+                timestep=timestep
+            )
+            pred_fake_image = pred_fake_image_cond + (
+                pred_fake_image_cond - pred_fake_image_uncond
+            ) * self.fake_guidance_scale
+        else:
+            pred_fake_image = pred_fake_image_cond
+        # Step 2: Compute the real score
+        # We compute the conditional and unconditional prediction
+        # and add them together to achieve cfg (https://arxiv.org/abs/2207.12598)
+        _, pred_real_image_cond = self.real_score(
+            noisy_image_or_video=noisy_image_or_video,
+            conditional_dict=conditional_dict,
+            timestep=timestep
+        )
+        _, pred_real_image_uncond = self.real_score(
+            noisy_image_or_video=noisy_image_or_video,
+            conditional_dict=unconditional_dict,
+            timestep=timestep
+        )
+        pred_real_image = pred_real_image_cond + (
+            pred_real_image_cond - pred_real_image_uncond
+        ) * self.real_guidance_scale
+        # Step 3: Compute the DMD gradient (DMD paper eq. 7).
+        grad = (pred_fake_image - pred_real_image)
+        # TODO: Change the normalizer for causal teacher
+        if normalization:
+            # Step 4: Gradient normalization (DMD paper eq. 8).
+            p_real = (estimated_clean_image_or_video - pred_real_image)
+            normalizer = torch.abs(p_real).mean(dim=[1, 2, 3, 4], keepdim=True)
+            grad = grad / normalizer
+        grad = torch.nan_to_num(grad)
+        return grad, {
+            "dmdtrain_gradient_norm": torch.mean(torch.abs(grad)).detach(),
+            "timestep": timestep.detach()
+        }
+    def compute_distribution_matching_loss(
+        self,
+        image_or_video: torch.Tensor,
+        conditional_dict: dict,
+        unconditional_dict: dict,
+        gradient_mask: Optional[torch.Tensor] = None,
+        denoised_timestep_from: int = 0,
+        denoised_timestep_to: int = 0
+    ) -> Tuple[torch.Tensor, dict]:
+        """
+        Compute the DMD loss (eq 7 in https://arxiv.org/abs/2311.18828).
+        Input:
+            - image_or_video: a tensor with shape [B, F, C, H, W] where the number of frame is 1 for images.
+            - conditional_dict: a dictionary containing the conditional information (e.g. text embeddings, image embeddings).
+            - unconditional_dict: a dictionary containing the unconditional information (e.g. null/negative text embeddings, null/negative image embeddings).
+            - gradient_mask: a boolean tensor with the same shape as image_or_video indicating which pixels to compute loss .
+        Output:
+            - dmd_loss: a scalar tensor representing the DMD loss.
+            - dmd_log_dict: a dictionary containing the intermediate tensors for logging.
+        """
+        original_latent = image_or_video
+        batch_size, num_frame = image_or_video.shape[:2]
+        with torch.no_grad():
+            # Step 1: Randomly sample timestep based on the given schedule and corresponding noise
+            min_timestep = denoised_timestep_to if self.ts_schedule and denoised_timestep_to is not None else self.min_score_timestep
+            max_timestep = denoised_timestep_from if self.ts_schedule_max and denoised_timestep_from is not None else self.num_train_timestep
+            timestep = self._get_timestep(
+                min_timestep,
+                max_timestep,
+                batch_size,
+                num_frame,
+                self.num_frame_per_block,
+                uniform_timestep=True
+            )
+            # TODO:should we change it to `timestep = self.scheduler.timesteps[timestep]`?
+            if self.timestep_shift > 1:
+                timestep = self.timestep_shift * \
+                    (timestep / 1000) / \
+                    (1 + (self.timestep_shift - 1) * (timestep / 1000)) * 1000
+            timestep = timestep.clamp(self.min_step, self.max_step)
+            noise = torch.randn_like(image_or_video)
+            noisy_latent = self.scheduler.add_noise(
+                image_or_video.flatten(0, 1),
+                noise.flatten(0, 1),
+                timestep.flatten(0, 1)
+            ).detach().unflatten(0, (batch_size, num_frame))
+            # Step 2: Compute the KL grad
+            grad, dmd_log_dict = self._compute_kl_grad(
+                noisy_image_or_video=noisy_latent,
+                estimated_clean_image_or_video=original_latent,
+                timestep=timestep,
+                conditional_dict=conditional_dict,
+                unconditional_dict=unconditional_dict
+            )
+        if gradient_mask is not None:
+            dmd_loss = 0.5 * F.mse_loss(original_latent.double(
+            )[gradient_mask], (original_latent.double() - grad.double()).detach()[gradient_mask], reduction="mean")
+        else:
+            dmd_loss = 0.5 * F.mse_loss(original_latent.double(
+            ), (original_latent.double() - grad.double()).detach(), reduction="mean")
+        return dmd_loss, dmd_log_dict
+    def generator_loss(
+        self,
+        image_or_video_shape,
+        conditional_dict: dict,
+        unconditional_dict: dict,
+        clean_latent: torch.Tensor,
+        initial_latent: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, dict]:
+        """
+        Generate image/videos from noise and compute the DMD loss.
+        The noisy input to the generator is backward simulated.
+        This removes the need of any datasets during distillation.
+        See Sec 4.5 of the DMD2 paper (https://arxiv.org/abs/2405.14867) for details.
+        Input:
+            - image_or_video_shape: a list containing the shape of the image or video [B, F, C, H, W].
+            - conditional_dict: a dictionary containing the conditional information (e.g. text embeddings, image embeddings).
+            - unconditional_dict: a dictionary containing the unconditional information (e.g. null/negative text embeddings, null/negative image embeddings).
+            - clean_latent: a tensor containing the clean latents [B, F, C, H, W]. Need to be passed when no backward simulation is used.
+        Output:
+            - loss: a scalar tensor representing the generator loss.
+            - generator_log_dict: a dictionary containing the intermediate tensors for logging.
+        """
+        # Step 1: Unroll generator to obtain fake videos
+        pred_image, gradient_mask, denoised_timestep_from, denoised_timestep_to = self._run_generator(
+            image_or_video_shape=image_or_video_shape,
+            conditional_dict=conditional_dict,
+            initial_latent=initial_latent
+        )
+        # Step 2: Compute the DMD loss
+        dmd_loss, dmd_log_dict = self.compute_distribution_matching_loss(
+            image_or_video=pred_image,
+            conditional_dict=conditional_dict,
+            unconditional_dict=unconditional_dict,
+            gradient_mask=gradient_mask,
+            denoised_timestep_from=denoised_timestep_from,
+            denoised_timestep_to=denoised_timestep_to
+        )
+        return dmd_loss, dmd_log_dict
+    def critic_loss(
+        self,
+        image_or_video_shape,
+        conditional_dict: dict,
+        unconditional_dict: dict,
+        clean_latent: torch.Tensor,
+        initial_latent: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, dict]:
+        """
+        Generate image/videos from noise and train the critic with generated samples.
+        The noisy input to the generator is backward simulated.
+        This removes the need of any datasets during distillation.
+        See Sec 4.5 of the DMD2 paper (https://arxiv.org/abs/2405.14867) for details.
+        Input:
+            - image_or_video_shape: a list containing the shape of the image or video [B, F, C, H, W].
+            - conditional_dict: a dictionary containing the conditional information (e.g. text embeddings, image embeddings).
+            - unconditional_dict: a dictionary containing the unconditional information (e.g. null/negative text embeddings, null/negative image embeddings).
+            - clean_latent: a tensor containing the clean latents [B, F, C, H, W]. Need to be passed when no backward simulation is used.
+        Output:
+            - loss: a scalar tensor representing the generator loss.
+            - critic_log_dict: a dictionary containing the intermediate tensors for logging.
+        """
+        # Step 1: Run generator on backward simulated noisy input
+        with torch.no_grad():
+            generated_image, _, denoised_timestep_from, denoised_timestep_to = self._run_generator(
+                image_or_video_shape=image_or_video_shape,
+                conditional_dict=conditional_dict,
+                initial_latent=initial_latent
+            )
+        # Step 2: Compute the fake prediction
+        min_timestep = denoised_timestep_to if self.ts_schedule and denoised_timestep_to is not None else self.min_score_timestep
+        max_timestep = denoised_timestep_from if self.ts_schedule_max and denoised_timestep_from is not None else self.num_train_timestep
+        critic_timestep = self._get_timestep(
+            min_timestep,
+            max_timestep,
+            image_or_video_shape[0],
+            image_or_video_shape[1],
+            self.num_frame_per_block,
+            uniform_timestep=True
+        )
+        if self.timestep_shift > 1:
+            critic_timestep = self.timestep_shift * \
+                (critic_timestep / 1000) / (1 + (self.timestep_shift - 1) * (critic_timestep / 1000)) * 1000
+        critic_timestep = critic_timestep.clamp(self.min_step, self.max_step)
+        critic_noise = torch.randn_like(generated_image)
+        noisy_generated_image = self.scheduler.add_noise(
+            generated_image.flatten(0, 1),
+            critic_noise.flatten(0, 1),
+            critic_timestep.flatten(0, 1)
+        ).unflatten(0, image_or_video_shape[:2])
+        _, pred_fake_image = self.fake_score(
+            noisy_image_or_video=noisy_generated_image,
+            conditional_dict=conditional_dict,
+            timestep=critic_timestep
+        )
+        # Step 3: Compute the denoising loss for the fake critic
+        if self.args.denoising_loss_type == "flow":
+            from utils.wan_wrapper import WanDiffusionWrapper
+            flow_pred = WanDiffusionWrapper._convert_x0_to_flow_pred(
+                scheduler=self.scheduler,
+                x0_pred=pred_fake_image.flatten(0, 1),
+                xt=noisy_generated_image.flatten(0, 1),
+                timestep=critic_timestep.flatten(0, 1)
+            )
+            pred_fake_noise = None
+        else:
+            flow_pred = None
+            pred_fake_noise = self.scheduler.convert_x0_to_noise(
+                x0=pred_fake_image.flatten(0, 1),
+                xt=noisy_generated_image.flatten(0, 1),
+                timestep=critic_timestep.flatten(0, 1)
+            ).unflatten(0, image_or_video_shape[:2])
+        denoising_loss = self.denoising_loss_func(
+            x=generated_image.flatten(0, 1),
+            x_pred=pred_fake_image.flatten(0, 1),
+            noise=critic_noise.flatten(0, 1),
+            noise_pred=pred_fake_noise,
+            alphas_cumprod=self.scheduler.alphas_cumprod,
+            timestep=critic_timestep.flatten(0, 1),
+            flow_pred=flow_pred
+        )
+        # Step 5: Debugging Log
+        critic_log_dict = {
+            "critic_timestep": critic_timestep.detach()
+        }
+        return denoising_loss, critic_log_dict

model/gan.py ADDED Viewed

	@@ -0,0 +1,295 @@

+import copy
+from pipeline import RollingForcingTrainingPipeline
+import torch.nn.functional as F
+from typing import Tuple
+import torch
+from model.base import RollingForcingModel
+class GAN(RollingForcingModel):
+    def __init__(self, args, device):
+        """
+        Initialize the GAN module.
+        This class is self-contained and compute generator and fake score losses
+        in the forward pass.
+        """
+        super().__init__(args, device)
+        self.num_frame_per_block = getattr(args, "num_frame_per_block", 1)
+        self.same_step_across_blocks = getattr(args, "same_step_across_blocks", True)
+        self.concat_time_embeddings = getattr(args, "concat_time_embeddings", False)
+        self.num_class = args.num_class
+        self.relativistic_discriminator = getattr(args, "relativistic_discriminator", False)
+        if self.num_frame_per_block > 1:
+            self.generator.model.num_frame_per_block = self.num_frame_per_block
+        self.fake_score.adding_cls_branch(
+            atten_dim=1536, num_class=args.num_class, time_embed_dim=1536 if self.concat_time_embeddings else 0)
+        self.fake_score.model.requires_grad_(True)
+        self.independent_first_frame = getattr(args, "independent_first_frame", False)
+        if self.independent_first_frame:
+            self.generator.model.independent_first_frame = True
+        if args.gradient_checkpointing:
+            self.generator.enable_gradient_checkpointing()
+            self.fake_score.enable_gradient_checkpointing()
+        # this will be init later with fsdp-wrapped modules
+        self.inference_pipeline: RollingForcingTrainingPipeline = None
+        # Step 2: Initialize all dmd hyperparameters
+        self.num_train_timestep = args.num_train_timestep
+        self.min_step = int(0.02 * self.num_train_timestep)
+        self.max_step = int(0.98 * self.num_train_timestep)
+        if hasattr(args, "real_guidance_scale"):
+            self.real_guidance_scale = args.real_guidance_scale
+            self.fake_guidance_scale = args.fake_guidance_scale
+        else:
+            self.real_guidance_scale = args.guidance_scale
+            self.fake_guidance_scale = 0.0
+        self.timestep_shift = getattr(args, "timestep_shift", 1.0)
+        self.critic_timestep_shift = getattr(args, "critic_timestep_shift", self.timestep_shift)
+        self.ts_schedule = getattr(args, "ts_schedule", True)
+        self.ts_schedule_max = getattr(args, "ts_schedule_max", False)
+        self.min_score_timestep = getattr(args, "min_score_timestep", 0)
+        self.gan_g_weight = getattr(args, "gan_g_weight", 1e-2)
+        self.gan_d_weight = getattr(args, "gan_d_weight", 1e-2)
+        self.r1_weight = getattr(args, "r1_weight", 0.0)
+        self.r2_weight = getattr(args, "r2_weight", 0.0)
+        self.r1_sigma = getattr(args, "r1_sigma", 0.01)
+        self.r2_sigma = getattr(args, "r2_sigma", 0.01)
+        if getattr(self.scheduler, "alphas_cumprod", None) is not None:
+            self.scheduler.alphas_cumprod = self.scheduler.alphas_cumprod.to(device)
+        else:
+            self.scheduler.alphas_cumprod = None
+    def _run_cls_pred_branch(self,
+                             noisy_image_or_video: torch.Tensor,
+                             conditional_dict: dict,
+                             timestep: torch.Tensor) -> torch.Tensor:
+        """
+            Run the classifier prediction branch on the generated image or video.
+            Input:
+                - image_or_video: a tensor with shape [B, F, C, H, W].
+            Output:
+                - cls_pred: a tensor with shape [B, 1, 1, 1, 1] representing the feature map for classification.
+        """
+        _, _, noisy_logit = self.fake_score(
+            noisy_image_or_video=noisy_image_or_video,
+            conditional_dict=conditional_dict,
+            timestep=timestep,
+            classify_mode=True,
+            concat_time_embeddings=self.concat_time_embeddings
+        )
+        return noisy_logit
+    def generator_loss(
+        self,
+        image_or_video_shape,
+        conditional_dict: dict,
+        unconditional_dict: dict,
+        clean_latent: torch.Tensor,
+        initial_latent: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, dict]:
+        """
+        Generate image/videos from noise and compute the DMD loss.
+        The noisy input to the generator is backward simulated.
+        This removes the need of any datasets during distillation.
+        See Sec 4.5 of the DMD2 paper (https://arxiv.org/abs/2405.14867) for details.
+        Input:
+            - image_or_video_shape: a list containing the shape of the image or video [B, F, C, H, W].
+            - conditional_dict: a dictionary containing the conditional information (e.g. text embeddings, image embeddings).
+            - unconditional_dict: a dictionary containing the unconditional information (e.g. null/negative text embeddings, null/negative image embeddings).
+            - clean_latent: a tensor containing the clean latents [B, F, C, H, W]. Need to be passed when no backward simulation is used.
+        Output:
+            - loss: a scalar tensor representing the generator loss.
+            - generator_log_dict: a dictionary containing the intermediate tensors for logging.
+        """
+        # Step 1: Unroll generator to obtain fake videos
+        pred_image, gradient_mask, denoised_timestep_from, denoised_timestep_to = self._run_generator(
+            image_or_video_shape=image_or_video_shape,
+            conditional_dict=conditional_dict,
+            initial_latent=initial_latent
+        )
+        # Step 2: Get timestep and add noise to generated/real latents
+        min_timestep = denoised_timestep_to if self.ts_schedule and denoised_timestep_to is not None else self.min_score_timestep
+        max_timestep = denoised_timestep_from if self.ts_schedule_max and denoised_timestep_from is not None else self.num_train_timestep
+        critic_timestep = self._get_timestep(
+            min_timestep,
+            max_timestep,
+            image_or_video_shape[0],
+            image_or_video_shape[1],
+            self.num_frame_per_block,
+            uniform_timestep=True
+        )
+        if self.critic_timestep_shift > 1:
+            critic_timestep = self.critic_timestep_shift * \
+                (critic_timestep / 1000) / (1 + (self.critic_timestep_shift - 1) * (critic_timestep / 1000)) * 1000
+        critic_timestep = critic_timestep.clamp(self.min_step, self.max_step)
+        critic_noise = torch.randn_like(pred_image)
+        noisy_fake_latent = self.scheduler.add_noise(
+            pred_image.flatten(0, 1),
+            critic_noise.flatten(0, 1),
+            critic_timestep.flatten(0, 1)
+        ).unflatten(0, image_or_video_shape[:2])
+        # Step 4: Compute the real GAN discriminator loss
+        real_image_or_video = clean_latent.clone()
+        critic_noise = torch.randn_like(real_image_or_video)
+        noisy_real_latent = self.scheduler.add_noise(
+            real_image_or_video.flatten(0, 1),
+            critic_noise.flatten(0, 1),
+            critic_timestep.flatten(0, 1)
+        ).unflatten(0, image_or_video_shape[:2])
+        conditional_dict["prompt_embeds"] = torch.concatenate(
+            (conditional_dict["prompt_embeds"], conditional_dict["prompt_embeds"]), dim=0)
+        critic_timestep = torch.concatenate((critic_timestep, critic_timestep), dim=0)
+        noisy_latent = torch.concatenate((noisy_fake_latent, noisy_real_latent), dim=0)
+        _, _, noisy_logit = self.fake_score(
+            noisy_image_or_video=noisy_latent,
+            conditional_dict=conditional_dict,
+            timestep=critic_timestep,
+            classify_mode=True,
+            concat_time_embeddings=self.concat_time_embeddings
+        )
+        noisy_fake_logit, noisy_real_logit = noisy_logit.chunk(2, dim=0)
+        if not self.relativistic_discriminator:
+            gan_G_loss = F.softplus(-noisy_fake_logit.float()).mean() * self.gan_g_weight
+        else:
+            relative_fake_logit = noisy_fake_logit - noisy_real_logit
+            gan_G_loss = F.softplus(-relative_fake_logit.float()).mean() * self.gan_g_weight
+        return gan_G_loss
+    def critic_loss(
+        self,
+        image_or_video_shape,
+        conditional_dict: dict,
+        unconditional_dict: dict,
+        clean_latent: torch.Tensor,
+        real_image_or_video: torch.Tensor,
+        initial_latent: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, dict]:
+        """
+        Generate image/videos from noise and train the critic with generated samples.
+        The noisy input to the generator is backward simulated.
+        This removes the need of any datasets during distillation.
+        See Sec 4.5 of the DMD2 paper (https://arxiv.org/abs/2405.14867) for details.
+        Input:
+            - image_or_video_shape: a list containing the shape of the image or video [B, F, C, H, W].
+            - conditional_dict: a dictionary containing the conditional information (e.g. text embeddings, image embeddings).
+            - unconditional_dict: a dictionary containing the unconditional information (e.g. null/negative text embeddings, null/negative image embeddings).
+            - clean_latent: a tensor containing the clean latents [B, F, C, H, W]. Need to be passed when no backward simulation is used.
+        Output:
+            - loss: a scalar tensor representing the generator loss.
+            - critic_log_dict: a dictionary containing the intermediate tensors for logging.
+        """
+        # Step 1: Run generator on backward simulated noisy input
+        with torch.no_grad():
+            generated_image, _, denoised_timestep_from, denoised_timestep_to, num_sim_steps = self._run_generator(
+                image_or_video_shape=image_or_video_shape,
+                conditional_dict=conditional_dict,
+                initial_latent=initial_latent
+            )
+        # Step 2: Get timestep and add noise to generated/real latents
+        min_timestep = denoised_timestep_to if self.ts_schedule and denoised_timestep_to is not None else self.min_score_timestep
+        max_timestep = denoised_timestep_from if self.ts_schedule_max and denoised_timestep_from is not None else self.num_train_timestep
+        critic_timestep = self._get_timestep(
+            min_timestep,
+            max_timestep,
+            image_or_video_shape[0],
+            image_or_video_shape[1],
+            self.num_frame_per_block,
+            uniform_timestep=True
+        )
+        if self.critic_timestep_shift > 1:
+            critic_timestep = self.critic_timestep_shift * \
+                (critic_timestep / 1000) / (1 + (self.critic_timestep_shift - 1) * (critic_timestep / 1000)) * 1000
+        critic_timestep = critic_timestep.clamp(self.min_step, self.max_step)
+        critic_noise = torch.randn_like(generated_image)
+        noisy_fake_latent = self.scheduler.add_noise(
+            generated_image.flatten(0, 1),
+            critic_noise.flatten(0, 1),
+            critic_timestep.flatten(0, 1)
+        ).unflatten(0, image_or_video_shape[:2])
+        # Step 4: Compute the real GAN discriminator loss
+        noisy_real_latent = self.scheduler.add_noise(
+            real_image_or_video.flatten(0, 1),
+            critic_noise.flatten(0, 1),
+            critic_timestep.flatten(0, 1)
+        ).unflatten(0, image_or_video_shape[:2])
+        conditional_dict_cloned = copy.deepcopy(conditional_dict)
+        conditional_dict_cloned["prompt_embeds"] = torch.concatenate(
+            (conditional_dict_cloned["prompt_embeds"], conditional_dict_cloned["prompt_embeds"]), dim=0)
+        _, _, noisy_logit = self.fake_score(
+            noisy_image_or_video=torch.concatenate((noisy_fake_latent, noisy_real_latent), dim=0),
+            conditional_dict=conditional_dict_cloned,
+            timestep=torch.concatenate((critic_timestep, critic_timestep), dim=0),
+            classify_mode=True,
+            concat_time_embeddings=self.concat_time_embeddings
+        )
+        noisy_fake_logit, noisy_real_logit = noisy_logit.chunk(2, dim=0)
+        if not self.relativistic_discriminator:
+            gan_D_loss = F.softplus(-noisy_real_logit.float()).mean() + F.softplus(noisy_fake_logit.float()).mean()
+        else:
+            relative_real_logit = noisy_real_logit - noisy_fake_logit
+            gan_D_loss = F.softplus(-relative_real_logit.float()).mean()
+        gan_D_loss = gan_D_loss * self.gan_d_weight
+        # R1 regularization
+        if self.r1_weight > 0.:
+            noisy_real_latent_perturbed = noisy_real_latent.clone()
+            epison_real = self.r1_sigma * torch.randn_like(noisy_real_latent_perturbed)
+            noisy_real_latent_perturbed = noisy_real_latent_perturbed + epison_real
+            noisy_real_logit_perturbed = self._run_cls_pred_branch(
+                noisy_image_or_video=noisy_real_latent_perturbed,
+                conditional_dict=conditional_dict,
+                timestep=critic_timestep
+            )
+            r1_grad = (noisy_real_logit_perturbed - noisy_real_logit) / self.r1_sigma
+            r1_loss = self.r1_weight * torch.mean((r1_grad)**2)
+        else:
+            r1_loss = torch.zeros_like(gan_D_loss)
+        # R2 regularization
+        if self.r2_weight > 0.:
+            noisy_fake_latent_perturbed = noisy_fake_latent.clone()
+            epison_generated = self.r2_sigma * torch.randn_like(noisy_fake_latent_perturbed)
+            noisy_fake_latent_perturbed = noisy_fake_latent_perturbed + epison_generated
+            noisy_fake_logit_perturbed = self._run_cls_pred_branch(
+                noisy_image_or_video=noisy_fake_latent_perturbed,
+                conditional_dict=conditional_dict,
+                timestep=critic_timestep
+            )
+            r2_grad = (noisy_fake_logit_perturbed - noisy_fake_logit) / self.r2_sigma
+            r2_loss = self.r2_weight * torch.mean((r2_grad)**2)
+        else:
+            r2_loss = torch.zeros_like(r2_loss)
+        critic_log_dict = {
+            "critic_timestep": critic_timestep.detach(),
+            'noisy_real_logit': noisy_real_logit.detach(),
+            'noisy_fake_logit': noisy_fake_logit.detach(),
+        }
+        return (gan_D_loss, r1_loss, r2_loss), critic_log_dict

model/ode_regression.py ADDED Viewed

	@@ -0,0 +1,138 @@

+import torch.nn.functional as F
+from typing import Tuple
+import torch
+from model.base import BaseModel
+from utils.wan_wrapper import WanDiffusionWrapper, WanTextEncoder, WanVAEWrapper
+class ODERegression(BaseModel):
+    def __init__(self, args, device):
+        """
+        Initialize the ODERegression module.
+        This class is self-contained and compute generator losses
+        in the forward pass given precomputed ode solution pairs.
+        This class supports the ode regression loss for both causal and bidirectional models.
+        See Sec 4.3 of CausVid https://arxiv.org/abs/2412.07772 for details
+        """
+        super().__init__(args, device)
+        # Step 1: Initialize all models
+        self.generator = WanDiffusionWrapper(**getattr(args, "model_kwargs", {}), is_causal=True)
+        self.generator.model.requires_grad_(True)
+        if getattr(args, "generator_ckpt", False):
+            print(f"Loading pretrained generator from {args.generator_ckpt}")
+            state_dict = torch.load(args.generator_ckpt, map_location="cpu")[
+                'generator']
+            self.generator.load_state_dict(
+                state_dict, strict=True
+            )
+        self.num_frame_per_block = getattr(args, "num_frame_per_block", 1)
+        if self.num_frame_per_block > 1:
+            self.generator.model.num_frame_per_block = self.num_frame_per_block
+        self.independent_first_frame = getattr(args, "independent_first_frame", False)
+        if self.independent_first_frame:
+            self.generator.model.independent_first_frame = True
+        if args.gradient_checkpointing:
+            self.generator.enable_gradient_checkpointing()
+        # Step 2: Initialize all hyperparameters
+        self.timestep_shift = getattr(args, "timestep_shift", 1.0)
+    def _initialize_models(self, args):
+        self.generator = WanDiffusionWrapper(**getattr(args, "model_kwargs", {}), is_causal=True)
+        self.generator.model.requires_grad_(True)
+        self.text_encoder = WanTextEncoder()
+        self.text_encoder.requires_grad_(False)
+        self.vae = WanVAEWrapper()
+        self.vae.requires_grad_(False)
+    @torch.no_grad()
+    def _prepare_generator_input(self, ode_latent: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        Given a tensor containing the whole ODE sampling trajectories,
+        randomly choose an intermediate timestep and return the latent as well as the corresponding timestep.
+        Input:
+            - ode_latent: a tensor containing the whole ODE sampling trajectories [batch_size, num_denoising_steps, num_frames, num_channels, height, width].
+        Output:
+            - noisy_input: a tensor containing the selected latent [batch_size, num_frames, num_channels, height, width].
+            - timestep: a tensor containing the corresponding timestep [batch_size].
+        """
+        batch_size, num_denoising_steps, num_frames, num_channels, height, width = ode_latent.shape
+        # Step 1: Randomly choose a timestep for each frame
+        index = self._get_timestep(
+            0,
+            len(self.denoising_step_list),
+            batch_size,
+            num_frames,
+            self.num_frame_per_block,
+            uniform_timestep=False
+        )
+        if self.args.i2v:
+            index[:, 0] = len(self.denoising_step_list) - 1
+        noisy_input = torch.gather(
+            ode_latent, dim=1,
+            index=index.reshape(batch_size, 1, num_frames, 1, 1, 1).expand(
+                -1, -1, -1, num_channels, height, width).to(self.device)
+        ).squeeze(1)
+        timestep = self.denoising_step_list[index].to(self.device)
+        # if self.extra_noise_step > 0:
+        #     random_timestep = torch.randint(0, self.extra_noise_step, [
+        #                                     batch_size, num_frames], device=self.device, dtype=torch.long)
+        #     perturbed_noisy_input = self.scheduler.add_noise(
+        #         noisy_input.flatten(0, 1),
+        #         torch.randn_like(noisy_input.flatten(0, 1)),
+        #         random_timestep.flatten(0, 1)
+        #     ).detach().unflatten(0, (batch_size, num_frames)).type_as(noisy_input)
+        #     noisy_input[timestep == 0] = perturbed_noisy_input[timestep == 0]
+        return noisy_input, timestep
+    def generator_loss(self, ode_latent: torch.Tensor, conditional_dict: dict) -> Tuple[torch.Tensor, dict]:
+        """
+        Generate image/videos from noisy latents and compute the ODE regression loss.
+        Input:
+            - ode_latent: a tensor containing the ODE latents [batch_size, num_denoising_steps, num_frames, num_channels, height, width].
+            They are ordered from most noisy to clean latents.
+            - conditional_dict: a dictionary containing the conditional information (e.g. text embeddings, image embeddings).
+        Output:
+            - loss: a scalar tensor representing the generator loss.
+            - log_dict: a dictionary containing additional information for loss timestep breakdown.
+        """
+        # Step 1: Run generator on noisy latents
+        target_latent = ode_latent[:, -1]
+        noisy_input, timestep = self._prepare_generator_input(
+            ode_latent=ode_latent)
+        _, pred_image_or_video = self.generator(
+            noisy_image_or_video=noisy_input,
+            conditional_dict=conditional_dict,
+            timestep=timestep
+        )
+        # Step 2: Compute the regression loss
+        mask = timestep != 0
+        loss = F.mse_loss(
+            pred_image_or_video[mask], target_latent[mask], reduction="mean")
+        log_dict = {
+            "unnormalized_loss": F.mse_loss(pred_image_or_video, target_latent, reduction='none').mean(dim=[1, 2, 3, 4]).detach(),
+            "timestep": timestep.float().mean(dim=1).detach(),
+            "input": noisy_input.detach(),
+            "output": pred_image_or_video.detach(),
+        }
+        return loss, log_dict

model/sid.py ADDED Viewed

	@@ -0,0 +1,283 @@

+from pipeline import RollingForcingTrainingPipeline
+from typing import Optional, Tuple
+import torch
+from model.base import RollingForcingModel
+class SiD(RollingForcingModel):
+    def __init__(self, args, device):
+        """
+        Initialize the DMD (Distribution Matching Distillation) module.
+        This class is self-contained and compute generator and fake score losses
+        in the forward pass.
+        """
+        super().__init__(args, device)
+        self.num_frame_per_block = getattr(args, "num_frame_per_block", 1)
+        if self.num_frame_per_block > 1:
+            self.generator.model.num_frame_per_block = self.num_frame_per_block
+        if args.gradient_checkpointing:
+            self.generator.enable_gradient_checkpointing()
+            self.fake_score.enable_gradient_checkpointing()
+            self.real_score.enable_gradient_checkpointing()
+        # this will be init later with fsdp-wrapped modules
+        self.inference_pipeline: RollingForcingTrainingPipeline = None
+        # Step 2: Initialize all dmd hyperparameters
+        self.num_train_timestep = args.num_train_timestep
+        self.min_step = int(0.02 * self.num_train_timestep)
+        self.max_step = int(0.98 * self.num_train_timestep)
+        if hasattr(args, "real_guidance_scale"):
+            self.real_guidance_scale = args.real_guidance_scale
+        else:
+            self.real_guidance_scale = args.guidance_scale
+        self.timestep_shift = getattr(args, "timestep_shift", 1.0)
+        self.sid_alpha = getattr(args, "sid_alpha", 1.0)
+        self.ts_schedule = getattr(args, "ts_schedule", True)
+        self.ts_schedule_max = getattr(args, "ts_schedule_max", False)
+        if getattr(self.scheduler, "alphas_cumprod", None) is not None:
+            self.scheduler.alphas_cumprod = self.scheduler.alphas_cumprod.to(device)
+        else:
+            self.scheduler.alphas_cumprod = None
+    def compute_distribution_matching_loss(
+        self,
+        image_or_video: torch.Tensor,
+        conditional_dict: dict,
+        unconditional_dict: dict,
+        gradient_mask: Optional[torch.Tensor] = None,
+        denoised_timestep_from: int = 0,
+        denoised_timestep_to: int = 0
+    ) -> Tuple[torch.Tensor, dict]:
+        """
+        Compute the DMD loss (eq 7 in https://arxiv.org/abs/2311.18828).
+        Input:
+            - image_or_video: a tensor with shape [B, F, C, H, W] where the number of frame is 1 for images.
+            - conditional_dict: a dictionary containing the conditional information (e.g. text embeddings, image embeddings).
+            - unconditional_dict: a dictionary containing the unconditional information (e.g. null/negative text embeddings, null/negative image embeddings).
+            - gradient_mask: a boolean tensor with the same shape as image_or_video indicating which pixels to compute loss .
+        Output:
+            - dmd_loss: a scalar tensor representing the DMD loss.
+            - dmd_log_dict: a dictionary containing the intermediate tensors for logging.
+        """
+        original_latent = image_or_video
+        batch_size, num_frame = image_or_video.shape[:2]
+        # Step 1: Randomly sample timestep based on the given schedule and corresponding noise
+        min_timestep = denoised_timestep_to if self.ts_schedule and denoised_timestep_to is not None else self.min_score_timestep
+        max_timestep = denoised_timestep_from if self.ts_schedule_max and denoised_timestep_from is not None else self.num_train_timestep
+        timestep = self._get_timestep(
+            min_timestep,
+            max_timestep,
+            batch_size,
+            num_frame,
+            self.num_frame_per_block,
+            uniform_timestep=True
+        )
+        if self.timestep_shift > 1:
+            timestep = self.timestep_shift * \
+                (timestep / 1000) / \
+                (1 + (self.timestep_shift - 1) * (timestep / 1000)) * 1000
+        timestep = timestep.clamp(self.min_step, self.max_step)
+        noise = torch.randn_like(image_or_video)
+        noisy_latent = self.scheduler.add_noise(
+            image_or_video.flatten(0, 1),
+            noise.flatten(0, 1),
+            timestep.flatten(0, 1)
+        ).unflatten(0, (batch_size, num_frame))
+        # Step 2: SiD (May be wrap it?)
+        noisy_image_or_video = noisy_latent
+        # Step 2.1: Compute the fake score
+        _, pred_fake_image = self.fake_score(
+            noisy_image_or_video=noisy_image_or_video,
+            conditional_dict=conditional_dict,
+            timestep=timestep
+        )
+        # Step 2.2: Compute the real score
+        # We compute the conditional and unconditional prediction
+        # and add them together to achieve cfg (https://arxiv.org/abs/2207.12598)
+        # NOTE: This step may cause OOM issue, which can be addressed by the CFG-free technique
+        _, pred_real_image_cond = self.real_score(
+            noisy_image_or_video=noisy_image_or_video,
+            conditional_dict=conditional_dict,
+            timestep=timestep
+        )
+        _, pred_real_image_uncond = self.real_score(
+            noisy_image_or_video=noisy_image_or_video,
+            conditional_dict=unconditional_dict,
+            timestep=timestep
+        )
+        pred_real_image = pred_real_image_cond + (
+            pred_real_image_cond - pred_real_image_uncond
+        ) * self.real_guidance_scale
+        # Step 2.3: SiD Loss
+        # TODO: Add alpha
+        # TODO: Double?
+        sid_loss = (pred_real_image.double() - pred_fake_image.double()) * ((pred_real_image.double() - original_latent.double()) - self.sid_alpha * (pred_real_image.double() - pred_fake_image.double()))
+        # Step 2.4: Loss normalizer
+        with torch.no_grad():
+            p_real = (original_latent - pred_real_image)
+            normalizer = torch.abs(p_real).mean(dim=[1, 2, 3, 4], keepdim=True)
+        sid_loss = sid_loss / normalizer
+        sid_loss = torch.nan_to_num(sid_loss)
+        num_frame = sid_loss.shape[1]
+        sid_loss = sid_loss.mean()
+        sid_log_dict = {
+            "dmdtrain_gradient_norm": torch.zeros_like(sid_loss),
+            "timestep": timestep.detach()
+        }
+        return sid_loss, sid_log_dict
+    def generator_loss(
+        self,
+        image_or_video_shape,
+        conditional_dict: dict,
+        unconditional_dict: dict,
+        clean_latent: torch.Tensor,
+        initial_latent: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, dict]:
+        """
+        Generate image/videos from noise and compute the DMD loss.
+        The noisy input to the generator is backward simulated.
+        This removes the need of any datasets during distillation.
+        See Sec 4.5 of the DMD2 paper (https://arxiv.org/abs/2405.14867) for details.
+        Input:
+            - image_or_video_shape: a list containing the shape of the image or video [B, F, C, H, W].
+            - conditional_dict: a dictionary containing the conditional information (e.g. text embeddings, image embeddings).
+            - unconditional_dict: a dictionary containing the unconditional information (e.g. null/negative text embeddings, null/negative image embeddings).
+            - clean_latent: a tensor containing the clean latents [B, F, C, H, W]. Need to be passed when no backward simulation is used.
+        Output:
+            - loss: a scalar tensor representing the generator loss.
+            - generator_log_dict: a dictionary containing the intermediate tensors for logging.
+        """
+        # Step 1: Unroll generator to obtain fake videos
+        pred_image, gradient_mask, denoised_timestep_from, denoised_timestep_to = self._run_generator(
+            image_or_video_shape=image_or_video_shape,
+            conditional_dict=conditional_dict,
+            initial_latent=initial_latent
+        )
+        # Step 2: Compute the DMD loss
+        dmd_loss, dmd_log_dict = self.compute_distribution_matching_loss(
+            image_or_video=pred_image,
+            conditional_dict=conditional_dict,
+            unconditional_dict=unconditional_dict,
+            gradient_mask=gradient_mask,
+            denoised_timestep_from=denoised_timestep_from,
+            denoised_timestep_to=denoised_timestep_to
+        )
+        return dmd_loss, dmd_log_dict
+    def critic_loss(
+        self,
+        image_or_video_shape,
+        conditional_dict: dict,
+        unconditional_dict: dict,
+        clean_latent: torch.Tensor,
+        initial_latent: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, dict]:
+        """
+        Generate image/videos from noise and train the critic with generated samples.
+        The noisy input to the generator is backward simulated.
+        This removes the need of any datasets during distillation.
+        See Sec 4.5 of the DMD2 paper (https://arxiv.org/abs/2405.14867) for details.
+        Input:
+            - image_or_video_shape: a list containing the shape of the image or video [B, F, C, H, W].
+            - conditional_dict: a dictionary containing the conditional information (e.g. text embeddings, image embeddings).
+            - unconditional_dict: a dictionary containing the unconditional information (e.g. null/negative text embeddings, null/negative image embeddings).
+            - clean_latent: a tensor containing the clean latents [B, F, C, H, W]. Need to be passed when no backward simulation is used.
+        Output:
+            - loss: a scalar tensor representing the generator loss.
+            - critic_log_dict: a dictionary containing the intermediate tensors for logging.
+        """
+        # Step 1: Run generator on backward simulated noisy input
+        with torch.no_grad():
+            generated_image, _, denoised_timestep_from, denoised_timestep_to = self._run_generator(
+                image_or_video_shape=image_or_video_shape,
+                conditional_dict=conditional_dict,
+                initial_latent=initial_latent
+            )
+        # Step 2: Compute the fake prediction
+        min_timestep = denoised_timestep_to if self.ts_schedule and denoised_timestep_to is not None else self.min_score_timestep
+        max_timestep = denoised_timestep_from if self.ts_schedule_max and denoised_timestep_from is not None else self.num_train_timestep
+        critic_timestep = self._get_timestep(
+            min_timestep,
+            max_timestep,
+            image_or_video_shape[0],
+            image_or_video_shape[1],
+            self.num_frame_per_block,
+            uniform_timestep=True
+        )
+        if self.timestep_shift > 1:
+            critic_timestep = self.timestep_shift * \
+                (critic_timestep / 1000) / (1 + (self.timestep_shift - 1) * (critic_timestep / 1000)) * 1000
+        critic_timestep = critic_timestep.clamp(self.min_step, self.max_step)
+        critic_noise = torch.randn_like(generated_image)
+        noisy_generated_image = self.scheduler.add_noise(
+            generated_image.flatten(0, 1),
+            critic_noise.flatten(0, 1),
+            critic_timestep.flatten(0, 1)
+        ).unflatten(0, image_or_video_shape[:2])
+        _, pred_fake_image = self.fake_score(
+            noisy_image_or_video=noisy_generated_image,
+            conditional_dict=conditional_dict,
+            timestep=critic_timestep
+        )
+        # Step 3: Compute the denoising loss for the fake critic
+        if self.args.denoising_loss_type == "flow":
+            from utils.wan_wrapper import WanDiffusionWrapper
+            flow_pred = WanDiffusionWrapper._convert_x0_to_flow_pred(
+                scheduler=self.scheduler,
+                x0_pred=pred_fake_image.flatten(0, 1),
+                xt=noisy_generated_image.flatten(0, 1),
+                timestep=critic_timestep.flatten(0, 1)
+            )
+            pred_fake_noise = None
+        else:
+            flow_pred = None
+            pred_fake_noise = self.scheduler.convert_x0_to_noise(
+                x0=pred_fake_image.flatten(0, 1),
+                xt=noisy_generated_image.flatten(0, 1),
+                timestep=critic_timestep.flatten(0, 1)
+            ).unflatten(0, image_or_video_shape[:2])
+        denoising_loss = self.denoising_loss_func(
+            x=generated_image.flatten(0, 1),
+            x_pred=pred_fake_image.flatten(0, 1),
+            noise=critic_noise.flatten(0, 1),
+            noise_pred=pred_fake_noise,
+            alphas_cumprod=self.scheduler.alphas_cumprod,
+            timestep=critic_timestep.flatten(0, 1),
+            flow_pred=flow_pred
+        )
+        # Step 5: Debugging Log
+        critic_log_dict = {
+            "critic_timestep": critic_timestep.detach()
+        }
+        return denoising_loss, critic_log_dict

pipeline/__init__.py ADDED Viewed

	@@ -0,0 +1,13 @@

+from .bidirectional_diffusion_inference import BidirectionalDiffusionInferencePipeline
+from .bidirectional_inference import BidirectionalInferencePipeline
+from .causal_diffusion_inference import CausalDiffusionInferencePipeline
+from .rolling_forcing_inference import CausalInferencePipeline
+from .rolling_forcing_training import RollingForcingTrainingPipeline
+__all__ = [
+    "BidirectionalDiffusionInferencePipeline",
+    "BidirectionalInferencePipeline",
+    "CausalDiffusionInferencePipeline",
+    "CausalInferencePipeline",
+    "RollingForcingTrainingPipeline"
+]

pipeline/bidirectional_diffusion_inference.py ADDED Viewed

	@@ -0,0 +1,110 @@

+from tqdm import tqdm
+from typing import List
+import torch
+from wan.utils.fm_solvers import FlowDPMSolverMultistepScheduler, get_sampling_sigmas, retrieve_timesteps
+from wan.utils.fm_solvers_unipc import FlowUniPCMultistepScheduler
+from utils.wan_wrapper import WanDiffusionWrapper, WanTextEncoder, WanVAEWrapper
+class BidirectionalDiffusionInferencePipeline(torch.nn.Module):
+    def __init__(
+            self,
+            args,
+            device,
+            generator=None,
+            text_encoder=None,
+            vae=None
+    ):
+        super().__init__()
+        # Step 1: Initialize all models
+        self.generator = WanDiffusionWrapper(
+            **getattr(args, "model_kwargs", {}), is_causal=False) if generator is None else generator
+        self.text_encoder = WanTextEncoder() if text_encoder is None else text_encoder
+        self.vae = WanVAEWrapper() if vae is None else vae
+        # Step 2: Initialize scheduler
+        self.num_train_timesteps = args.num_train_timestep
+        self.sampling_steps = 50
+        self.sample_solver = 'unipc'
+        self.shift = 8.0
+        self.args = args
+    def inference(
+        self,
+        noise: torch.Tensor,
+        text_prompts: List[str],
+        return_latents=False
+    ) -> torch.Tensor:
+        """
+        Perform inference on the given noise and text prompts.
+        Inputs:
+            noise (torch.Tensor): The input noise tensor of shape
+                (batch_size, num_frames, num_channels, height, width).
+            text_prompts (List[str]): The list of text prompts.
+        Outputs:
+            video (torch.Tensor): The generated video tensor of shape
+                (batch_size, num_frames, num_channels, height, width). It is normalized to be in the range [0, 1].
+        """
+        conditional_dict = self.text_encoder(
+            text_prompts=text_prompts
+        )
+        unconditional_dict = self.text_encoder(
+            text_prompts=[self.args.negative_prompt] * len(text_prompts)
+        )
+        latents = noise
+        sample_scheduler = self._initialize_sample_scheduler(noise)
+        for _, t in enumerate(tqdm(sample_scheduler.timesteps)):
+            latent_model_input = latents
+            timestep = t * torch.ones([latents.shape[0], 21], device=noise.device, dtype=torch.float32)
+            flow_pred_cond, _ = self.generator(latent_model_input, conditional_dict, timestep)
+            flow_pred_uncond, _ = self.generator(latent_model_input, unconditional_dict, timestep)
+            flow_pred = flow_pred_uncond + self.args.guidance_scale * (
+                flow_pred_cond - flow_pred_uncond)
+            temp_x0 = sample_scheduler.step(
+                flow_pred.unsqueeze(0),
+                t,
+                latents.unsqueeze(0),
+                return_dict=False)[0]
+            latents = temp_x0.squeeze(0)
+        x0 = latents
+        video = self.vae.decode_to_pixel(x0)
+        video = (video * 0.5 + 0.5).clamp(0, 1)
+        del sample_scheduler
+        if return_latents:
+            return video, latents
+        else:
+            return video
+    def _initialize_sample_scheduler(self, noise):
+        if self.sample_solver == 'unipc':
+            sample_scheduler = FlowUniPCMultistepScheduler(
+                num_train_timesteps=self.num_train_timesteps,
+                shift=1,
+                use_dynamic_shifting=False)
+            sample_scheduler.set_timesteps(
+                self.sampling_steps, device=noise.device, shift=self.shift)
+            self.timesteps = sample_scheduler.timesteps
+        elif self.sample_solver == 'dpm++':
+            sample_scheduler = FlowDPMSolverMultistepScheduler(
+                num_train_timesteps=self.num_train_timesteps,
+                shift=1,
+                use_dynamic_shifting=False)
+            sampling_sigmas = get_sampling_sigmas(self.sampling_steps, self.shift)
+            self.timesteps, _ = retrieve_timesteps(
+                sample_scheduler,
+                device=noise.device,
+                sigmas=sampling_sigmas)
+        else:
+            raise NotImplementedError("Unsupported solver.")
+        return sample_scheduler

pipeline/bidirectional_inference.py ADDED Viewed

	@@ -0,0 +1,71 @@

+from typing import List
+import torch
+from utils.wan_wrapper import WanDiffusionWrapper, WanTextEncoder, WanVAEWrapper
+class BidirectionalInferencePipeline(torch.nn.Module):
+    def __init__(
+            self,
+            args,
+            device,
+            generator=None,
+            text_encoder=None,
+            vae=None
+    ):
+        super().__init__()
+        # Step 1: Initialize all models
+        self.generator = WanDiffusionWrapper(
+            **getattr(args, "model_kwargs", {}), is_causal=False) if generator is None else generator
+        self.text_encoder = WanTextEncoder() if text_encoder is None else text_encoder
+        self.vae = WanVAEWrapper() if vae is None else vae
+        # Step 2: Initialize all bidirectional wan hyperparmeters
+        self.scheduler = self.generator.get_scheduler()
+        self.denoising_step_list = torch.tensor(
+            args.denoising_step_list, dtype=torch.long, device=device)
+        if self.denoising_step_list[-1] == 0:
+            self.denoising_step_list = self.denoising_step_list[:-1]  # remove the zero timestep for inference
+        if args.warp_denoising_step:
+            timesteps = torch.cat((self.scheduler.timesteps.cpu(), torch.tensor([0], dtype=torch.float32)))
+            self.denoising_step_list = timesteps[1000 - self.denoising_step_list]
+    def inference(self, noise: torch.Tensor, text_prompts: List[str]) -> torch.Tensor:
+        """
+        Perform inference on the given noise and text prompts.
+        Inputs:
+            noise (torch.Tensor): The input noise tensor of shape
+                (batch_size, num_frames, num_channels, height, width).
+            text_prompts (List[str]): The list of text prompts.
+        Outputs:
+            video (torch.Tensor): The generated video tensor of shape
+                (batch_size, num_frames, num_channels, height, width). It is normalized to be in the range [0, 1].
+        """
+        conditional_dict = self.text_encoder(
+            text_prompts=text_prompts
+        )
+        # initial point
+        noisy_image_or_video = noise
+        # use the last n-1 timesteps to simulate the generator's input
+        for index, current_timestep in enumerate(self.denoising_step_list[:-1]):
+            _, pred_image_or_video = self.generator(
+                noisy_image_or_video=noisy_image_or_video,
+                conditional_dict=conditional_dict,
+                timestep=torch.ones(
+                    noise.shape[:2], dtype=torch.long, device=noise.device) * current_timestep
+            )  # [B, F, C, H, W]
+            next_timestep = self.denoising_step_list[index + 1] * torch.ones(
+                noise.shape[:2], dtype=torch.long, device=noise.device)
+            noisy_image_or_video = self.scheduler.add_noise(
+                pred_image_or_video.flatten(0, 1),
+                torch.randn_like(pred_image_or_video.flatten(0, 1)),
+                next_timestep.flatten(0, 1)
+            ).unflatten(0, noise.shape[:2])
+        video = self.vae.decode_to_pixel(pred_image_or_video)
+        video = (video * 0.5 + 0.5).clamp(0, 1)
+        return video

pipeline/causal_diffusion_inference.py ADDED Viewed

	@@ -0,0 +1,342 @@

+from tqdm import tqdm
+from typing import List, Optional
+import torch
+from wan.utils.fm_solvers import FlowDPMSolverMultistepScheduler, get_sampling_sigmas, retrieve_timesteps
+from wan.utils.fm_solvers_unipc import FlowUniPCMultistepScheduler
+from utils.wan_wrapper import WanDiffusionWrapper, WanTextEncoder, WanVAEWrapper
+class CausalDiffusionInferencePipeline(torch.nn.Module):
+    def __init__(
+            self,
+            args,
+            device,
+            generator=None,
+            text_encoder=None,
+            vae=None
+    ):
+        super().__init__()
+        # Step 1: Initialize all models
+        self.generator = WanDiffusionWrapper(
+            **getattr(args, "model_kwargs", {}), is_causal=True) if generator is None else generator
+        self.text_encoder = WanTextEncoder() if text_encoder is None else text_encoder
+        self.vae = WanVAEWrapper() if vae is None else vae
+        # Step 2: Initialize scheduler
+        self.num_train_timesteps = args.num_train_timestep
+        self.sampling_steps = 50
+        self.sample_solver = 'unipc'
+        self.shift = args.timestep_shift
+        self.num_transformer_blocks = 30
+        self.frame_seq_length = 1560
+        self.kv_cache_pos = None
+        self.kv_cache_neg = None
+        self.crossattn_cache_pos = None
+        self.crossattn_cache_neg = None
+        self.args = args
+        self.num_frame_per_block = getattr(args, "num_frame_per_block", 1)
+        self.independent_first_frame = args.independent_first_frame
+        self.local_attn_size = self.generator.model.local_attn_size
+        print(f"KV inference with {self.num_frame_per_block} frames per block")
+        if self.num_frame_per_block > 1:
+            self.generator.model.num_frame_per_block = self.num_frame_per_block
+    def inference(
+        self,
+        noise: torch.Tensor,
+        text_prompts: List[str],
+        initial_latent: Optional[torch.Tensor] = None,
+        return_latents: bool = False,
+        start_frame_index: Optional[int] = 0
+    ) -> torch.Tensor:
+        """
+        Perform inference on the given noise and text prompts.
+        Inputs:
+            noise (torch.Tensor): The input noise tensor of shape
+                (batch_size, num_output_frames, num_channels, height, width).
+            text_prompts (List[str]): The list of text prompts.
+            initial_latent (torch.Tensor): The initial latent tensor of shape
+                (batch_size, num_input_frames, num_channels, height, width).
+                If num_input_frames is 1, perform image to video.
+                If num_input_frames is greater than 1, perform video extension.
+            return_latents (bool): Whether to return the latents.
+            start_frame_index (int): In long video generation, where does the current window start?
+        Outputs:
+            video (torch.Tensor): The generated video tensor of shape
+                (batch_size, num_frames, num_channels, height, width). It is normalized to be in the range [0, 1].
+        """
+        batch_size, num_frames, num_channels, height, width = noise.shape
+        if not self.independent_first_frame or (self.independent_first_frame and initial_latent is not None):
+            # If the first frame is independent and the first frame is provided, then the number of frames in the
+            # noise should still be a multiple of num_frame_per_block
+            assert num_frames % self.num_frame_per_block == 0
+            num_blocks = num_frames // self.num_frame_per_block
+        elif self.independent_first_frame and initial_latent is None:
+            # Using a [1, 4, 4, 4, 4, 4] model to generate a video without image conditioning
+            assert (num_frames - 1) % self.num_frame_per_block == 0
+            num_blocks = (num_frames - 1) // self.num_frame_per_block
+        num_input_frames = initial_latent.shape[1] if initial_latent is not None else 0
+        num_output_frames = num_frames + num_input_frames  # add the initial latent frames
+        conditional_dict = self.text_encoder(
+            text_prompts=text_prompts
+        )
+        unconditional_dict = self.text_encoder(
+            text_prompts=[self.args.negative_prompt] * len(text_prompts)
+        )
+        output = torch.zeros(
+            [batch_size, num_output_frames, num_channels, height, width],
+            device=noise.device,
+            dtype=noise.dtype
+        )
+        # Step 1: Initialize KV cache to all zeros
+        if self.kv_cache_pos is None:
+            self._initialize_kv_cache(
+                batch_size=batch_size,
+                dtype=noise.dtype,
+                device=noise.device
+            )
+            self._initialize_crossattn_cache(
+                batch_size=batch_size,
+                dtype=noise.dtype,
+                device=noise.device
+            )
+        else:
+            # reset cross attn cache
+            for block_index in range(self.num_transformer_blocks):
+                self.crossattn_cache_pos[block_index]["is_init"] = False
+                self.crossattn_cache_neg[block_index]["is_init"] = False
+            # reset kv cache
+            for block_index in range(len(self.kv_cache_pos)):
+                self.kv_cache_pos[block_index]["global_end_index"] = torch.tensor(
+                    [0], dtype=torch.long, device=noise.device)
+                self.kv_cache_pos[block_index]["local_end_index"] = torch.tensor(
+                    [0], dtype=torch.long, device=noise.device)
+                self.kv_cache_neg[block_index]["global_end_index"] = torch.tensor(
+                    [0], dtype=torch.long, device=noise.device)
+                self.kv_cache_neg[block_index]["local_end_index"] = torch.tensor(
+                    [0], dtype=torch.long, device=noise.device)
+        # Step 2: Cache context feature
+        current_start_frame = start_frame_index
+        cache_start_frame = 0
+        if initial_latent is not None:
+            timestep = torch.ones([batch_size, 1], device=noise.device, dtype=torch.int64) * 0
+            if self.independent_first_frame:
+                # Assume num_input_frames is 1 + self.num_frame_per_block * num_input_blocks
+                assert (num_input_frames - 1) % self.num_frame_per_block == 0
+                num_input_blocks = (num_input_frames - 1) // self.num_frame_per_block
+                output[:, :1] = initial_latent[:, :1]
+                self.generator(
+                    noisy_image_or_video=initial_latent[:, :1],
+                    conditional_dict=conditional_dict,
+                    timestep=timestep * 0,
+                    kv_cache=self.kv_cache_pos,
+                    crossattn_cache=self.crossattn_cache_pos,
+                    current_start=current_start_frame * self.frame_seq_length,
+                    cache_start=cache_start_frame * self.frame_seq_length
+                )
+                self.generator(
+                    noisy_image_or_video=initial_latent[:, :1],
+                    conditional_dict=unconditional_dict,
+                    timestep=timestep * 0,
+                    kv_cache=self.kv_cache_neg,
+                    crossattn_cache=self.crossattn_cache_neg,
+                    current_start=current_start_frame * self.frame_seq_length,
+                    cache_start=cache_start_frame * self.frame_seq_length
+                )
+                current_start_frame += 1
+                cache_start_frame += 1
+            else:
+                # Assume num_input_frames is self.num_frame_per_block * num_input_blocks
+                assert num_input_frames % self.num_frame_per_block == 0
+                num_input_blocks = num_input_frames // self.num_frame_per_block
+            for block_index in range(num_input_blocks):
+                current_ref_latents = \
+                    initial_latent[:, cache_start_frame:cache_start_frame + self.num_frame_per_block]
+                output[:, cache_start_frame:cache_start_frame + self.num_frame_per_block] = current_ref_latents
+                self.generator(
+                    noisy_image_or_video=current_ref_latents,
+                    conditional_dict=conditional_dict,
+                    timestep=timestep * 0,
+                    kv_cache=self.kv_cache_pos,
+                    crossattn_cache=self.crossattn_cache_pos,
+                    current_start=current_start_frame * self.frame_seq_length,
+                    cache_start=cache_start_frame * self.frame_seq_length
+                )
+                self.generator(
+                    noisy_image_or_video=current_ref_latents,
+                    conditional_dict=unconditional_dict,
+                    timestep=timestep * 0,
+                    kv_cache=self.kv_cache_neg,
+                    crossattn_cache=self.crossattn_cache_neg,
+                    current_start=current_start_frame * self.frame_seq_length,
+                    cache_start=cache_start_frame * self.frame_seq_length
+                )
+                current_start_frame += self.num_frame_per_block
+                cache_start_frame += self.num_frame_per_block
+        # Step 3: Temporal denoising loop
+        all_num_frames = [self.num_frame_per_block] * num_blocks
+        if self.independent_first_frame and initial_latent is None:
+            all_num_frames = [1] + all_num_frames
+        for current_num_frames in all_num_frames:
+            noisy_input = noise[
+                :, cache_start_frame - num_input_frames:cache_start_frame + current_num_frames - num_input_frames]
+            latents = noisy_input
+            # Step 3.1: Spatial denoising loop
+            sample_scheduler = self._initialize_sample_scheduler(noise)
+            for _, t in enumerate(tqdm(sample_scheduler.timesteps)):
+                latent_model_input = latents
+                timestep = t * torch.ones(
+                    [batch_size, current_num_frames], device=noise.device, dtype=torch.float32
+                )
+                flow_pred_cond, _ = self.generator(
+                    noisy_image_or_video=latent_model_input,
+                    conditional_dict=conditional_dict,
+                    timestep=timestep,
+                    kv_cache=self.kv_cache_pos,
+                    crossattn_cache=self.crossattn_cache_pos,
+                    current_start=current_start_frame * self.frame_seq_length,
+                    cache_start=cache_start_frame * self.frame_seq_length
+                )
+                flow_pred_uncond, _ = self.generator(
+                    noisy_image_or_video=latent_model_input,
+                    conditional_dict=unconditional_dict,
+                    timestep=timestep,
+                    kv_cache=self.kv_cache_neg,
+                    crossattn_cache=self.crossattn_cache_neg,
+                    current_start=current_start_frame * self.frame_seq_length,
+                    cache_start=cache_start_frame * self.frame_seq_length
+                )
+                flow_pred = flow_pred_uncond + self.args.guidance_scale * (
+                    flow_pred_cond - flow_pred_uncond)
+                temp_x0 = sample_scheduler.step(
+                    flow_pred,
+                    t,
+                    latents,
+                    return_dict=False)[0]
+                latents = temp_x0
+                print(f"kv_cache['local_end_index']: {self.kv_cache_pos[0]['local_end_index']}")
+                print(f"kv_cache['global_end_index']: {self.kv_cache_pos[0]['global_end_index']}")
+            # Step 3.2: record the model's output
+            output[:, cache_start_frame:cache_start_frame + current_num_frames] = latents
+            # Step 3.3: rerun with timestep zero to update KV cache using clean context
+            self.generator(
+                noisy_image_or_video=latents,
+                conditional_dict=conditional_dict,
+                timestep=timestep * 0,
+                kv_cache=self.kv_cache_pos,
+                crossattn_cache=self.crossattn_cache_pos,
+                current_start=current_start_frame * self.frame_seq_length,
+                cache_start=cache_start_frame * self.frame_seq_length
+            )
+            self.generator(
+                noisy_image_or_video=latents,
+                conditional_dict=unconditional_dict,
+                timestep=timestep * 0,
+                kv_cache=self.kv_cache_neg,
+                crossattn_cache=self.crossattn_cache_neg,
+                current_start=current_start_frame * self.frame_seq_length,
+                cache_start=cache_start_frame * self.frame_seq_length
+            )
+            # Step 3.4: update the start and end frame indices
+            current_start_frame += current_num_frames
+            cache_start_frame += current_num_frames
+        # Step 4: Decode the output
+        video = self.vae.decode_to_pixel(output)
+        video = (video * 0.5 + 0.5).clamp(0, 1)
+        if return_latents:
+            return video, output
+        else:
+            return video
+    def _initialize_kv_cache(self, batch_size, dtype, device):
+        """
+        Initialize a Per-GPU KV cache for the Wan model.
+        """
+        kv_cache_pos = []
+        kv_cache_neg = []
+        if self.local_attn_size != -1:
+            # Use the local attention size to compute the KV cache size
+            kv_cache_size = self.local_attn_size * self.frame_seq_length
+        else:
+            # Use the default KV cache size
+            kv_cache_size = 32760
+        for _ in range(self.num_transformer_blocks):
+            kv_cache_pos.append({
+                "k": torch.zeros([batch_size, kv_cache_size, 12, 128], dtype=dtype, device=device),
+                "v": torch.zeros([batch_size, kv_cache_size, 12, 128], dtype=dtype, device=device),
+                "global_end_index": torch.tensor([0], dtype=torch.long, device=device),
+                "local_end_index": torch.tensor([0], dtype=torch.long, device=device)
+            })
+            kv_cache_neg.append({
+                "k": torch.zeros([batch_size, kv_cache_size, 12, 128], dtype=dtype, device=device),
+                "v": torch.zeros([batch_size, kv_cache_size, 12, 128], dtype=dtype, device=device),
+                "global_end_index": torch.tensor([0], dtype=torch.long, device=device),
+                "local_end_index": torch.tensor([0], dtype=torch.long, device=device)
+            })
+        self.kv_cache_pos = kv_cache_pos  # always store the clean cache
+        self.kv_cache_neg = kv_cache_neg  # always store the clean cache
+    def _initialize_crossattn_cache(self, batch_size, dtype, device):
+        """
+        Initialize a Per-GPU cross-attention cache for the Wan model.
+        """
+        crossattn_cache_pos = []
+        crossattn_cache_neg = []
+        for _ in range(self.num_transformer_blocks):
+            crossattn_cache_pos.append({
+                "k": torch.zeros([batch_size, 512, 12, 128], dtype=dtype, device=device),
+                "v": torch.zeros([batch_size, 512, 12, 128], dtype=dtype, device=device),
+                "is_init": False
+            })
+            crossattn_cache_neg.append({
+                "k": torch.zeros([batch_size, 512, 12, 128], dtype=dtype, device=device),
+                "v": torch.zeros([batch_size, 512, 12, 128], dtype=dtype, device=device),
+                "is_init": False
+            })
+        self.crossattn_cache_pos = crossattn_cache_pos  # always store the clean cache
+        self.crossattn_cache_neg = crossattn_cache_neg  # always store the clean cache
+    def _initialize_sample_scheduler(self, noise):
+        if self.sample_solver == 'unipc':
+            sample_scheduler = FlowUniPCMultistepScheduler(
+                num_train_timesteps=self.num_train_timesteps,
+                shift=1,
+                use_dynamic_shifting=False)
+            sample_scheduler.set_timesteps(
+                self.sampling_steps, device=noise.device, shift=self.shift)
+            self.timesteps = sample_scheduler.timesteps
+        elif self.sample_solver == 'dpm++':
+            sample_scheduler = FlowDPMSolverMultistepScheduler(
+                num_train_timesteps=self.num_train_timesteps,
+                shift=1,
+                use_dynamic_shifting=False)
+            sampling_sigmas = get_sampling_sigmas(self.sampling_steps, self.shift)
+            self.timesteps, _ = retrieve_timesteps(
+                sample_scheduler,
+                device=noise.device,
+                sigmas=sampling_sigmas)
+        else:
+            raise NotImplementedError("Unsupported solver.")
+        return sample_scheduler

pipeline/rolling_forcing_inference.py ADDED Viewed

	@@ -0,0 +1,372 @@

+from typing import List, Optional
+import torch
+from utils.wan_wrapper import WanDiffusionWrapper, WanTextEncoder, WanVAEWrapper
+class CausalInferencePipeline(torch.nn.Module):
+    def __init__(
+            self,
+            args,
+            device,
+            generator=None,
+            text_encoder=None,
+            vae=None
+    ):
+        super().__init__()
+        # Step 1: Initialize all models
+        self.generator = WanDiffusionWrapper(
+            **getattr(args, "model_kwargs", {}), is_causal=True) if generator is None else generator
+        self.text_encoder = WanTextEncoder() if text_encoder is None else text_encoder
+        self.vae = WanVAEWrapper() if vae is None else vae
+        # Step 2: Initialize all causal hyperparmeters
+        self.scheduler = self.generator.get_scheduler()
+        self.denoising_step_list = torch.tensor(
+            args.denoising_step_list, dtype=torch.long)
+        if args.warp_denoising_step:
+            timesteps = torch.cat((self.scheduler.timesteps.cpu(), torch.tensor([0], dtype=torch.float32)))
+            self.denoising_step_list = timesteps[1000 - self.denoising_step_list]
+        self.num_transformer_blocks = 30
+        self.frame_seq_length = 1560
+        self.kv_cache_clean = None
+        self.args = args
+        self.num_frame_per_block = getattr(args, "num_frame_per_block", 1)
+        self.independent_first_frame = args.independent_first_frame
+        self.local_attn_size = self.generator.model.local_attn_size
+        print(f"KV inference with {self.num_frame_per_block} frames per block")
+        if self.num_frame_per_block > 1:
+            self.generator.model.num_frame_per_block = self.num_frame_per_block
+    def inference_rolling_forcing(
+        self,
+        noise: torch.Tensor,
+        text_prompts: List[str],
+        initial_latent: Optional[torch.Tensor] = None,
+        return_latents: bool = False,
+        profile: bool = False
+    ) -> torch.Tensor:
+        """
+        Perform inference on the given noise and text prompts.
+        Inputs:
+            noise (torch.Tensor): The input noise tensor of shape
+                (batch_size, num_output_frames, num_channels, height, width).
+            text_prompts (List[str]): The list of text prompts.
+            initial_latent (torch.Tensor): The initial latent tensor of shape
+                (batch_size, num_input_frames, num_channels, height, width).
+                If num_input_frames is 1, perform image to video.
+                If num_input_frames is greater than 1, perform video extension.
+            return_latents (bool): Whether to return the latents.
+        Outputs:
+            video (torch.Tensor): The generated video tensor of shape
+                (batch_size, num_output_frames, num_channels, height, width).
+                It is normalized to be in the range [0, 1].
+        """
+        batch_size, num_frames, num_channels, height, width = noise.shape
+        if not self.independent_first_frame or (self.independent_first_frame and initial_latent is not None):
+            # If the first frame is independent and the first frame is provided, then the number of frames in the
+            # noise should still be a multiple of num_frame_per_block
+            assert num_frames % self.num_frame_per_block == 0
+            num_blocks = num_frames // self.num_frame_per_block
+        else:
+            # Using a [1, 4, 4, 4, 4, 4, ...] model to generate a video without image conditioning
+            assert (num_frames - 1) % self.num_frame_per_block == 0
+            num_blocks = (num_frames - 1) // self.num_frame_per_block
+        num_input_frames = initial_latent.shape[1] if initial_latent is not None else 0
+        num_output_frames = num_frames + num_input_frames  # add the initial latent frames
+        conditional_dict = self.text_encoder(
+            text_prompts=text_prompts
+        )
+        output = torch.zeros(
+            [batch_size, num_output_frames, num_channels, height, width],
+            device=noise.device,
+            dtype=noise.dtype
+        )
+        # Set up profiling if requested
+        if profile:
+            init_start = torch.cuda.Event(enable_timing=True)
+            init_end = torch.cuda.Event(enable_timing=True)
+            diffusion_start = torch.cuda.Event(enable_timing=True)
+            diffusion_end = torch.cuda.Event(enable_timing=True)
+            vae_start = torch.cuda.Event(enable_timing=True)
+            vae_end = torch.cuda.Event(enable_timing=True)
+            block_times = []
+            block_start = torch.cuda.Event(enable_timing=True)
+            block_end = torch.cuda.Event(enable_timing=True)
+            init_start.record()
+        # Step 1: Initialize KV cache to all zeros
+        if self.kv_cache_clean is None:
+            self._initialize_kv_cache(
+                batch_size=batch_size,
+                dtype=noise.dtype,
+                device=noise.device
+            )
+            self._initialize_crossattn_cache(
+                batch_size=batch_size,
+                dtype=noise.dtype,
+                device=noise.device
+            )
+        else:
+            # reset cross attn cache
+            for block_index in range(self.num_transformer_blocks):
+                self.crossattn_cache[block_index]["is_init"] = False
+            # reset kv cache
+            for block_index in range(len(self.kv_cache_clean)):
+                self.kv_cache_clean[block_index]["global_end_index"] = torch.tensor(
+                    [0], dtype=torch.long, device=noise.device)
+                self.kv_cache_clean[block_index]["local_end_index"] = torch.tensor(
+                    [0], dtype=torch.long, device=noise.device)
+        # Step 2: Cache context feature
+        if initial_latent is not None:
+            timestep = torch.ones([batch_size, 1], device=noise.device, dtype=torch.int64) * 0
+            if self.independent_first_frame:
+                # Assume num_input_frames is 1 + self.num_frame_per_block * num_input_blocks
+                assert (num_input_frames - 1) % self.num_frame_per_block == 0
+                num_input_blocks = (num_input_frames - 1) // self.num_frame_per_block
+                output[:, :1] = initial_latent[:, :1]
+                self.generator(
+                    noisy_image_or_video=initial_latent[:, :1],
+                    conditional_dict=conditional_dict,
+                    timestep=timestep * 0,
+                    kv_cache=self.kv_cache_clean,
+                    crossattn_cache=self.crossattn_cache,
+                    current_start=current_start_frame * self.frame_seq_length,
+                )
+                current_start_frame += 1
+            else:
+                # Assume num_input_frames is self.num_frame_per_block * num_input_blocks
+                assert num_input_frames % self.num_frame_per_block == 0
+                num_input_blocks = num_input_frames // self.num_frame_per_block
+            for _ in range(num_input_blocks):
+                current_ref_latents = \
+                    initial_latent[:, current_start_frame:current_start_frame + self.num_frame_per_block]
+                output[:, current_start_frame:current_start_frame + self.num_frame_per_block] = current_ref_latents
+                self.generator(
+                    noisy_image_or_video=current_ref_latents,
+                    conditional_dict=conditional_dict,
+                    timestep=timestep * 0,
+                    kv_cache=self.kv_cache_clean,
+                    crossattn_cache=self.crossattn_cache,
+                    current_start=current_start_frame * self.frame_seq_length,
+                )
+                current_start_frame += self.num_frame_per_block
+        if profile:
+            init_end.record()
+            torch.cuda.synchronize()
+            diffusion_start.record()
+        # implementing rolling forcing
+        # construct the rolling forcing windows
+        num_denoising_steps = len(self.denoising_step_list)
+        rolling_window_length_blocks = num_denoising_steps
+        window_start_blocks = []
+        window_end_blocks = []
+        window_num = num_blocks + rolling_window_length_blocks - 1
+        for window_index in range(window_num):
+            start_block = max(0, window_index - rolling_window_length_blocks + 1)
+            end_block = min(num_blocks - 1, window_index)
+            window_start_blocks.append(start_block)
+            window_end_blocks.append(end_block)
+        # init noisy cache
+        noisy_cache = torch.zeros(
+            [batch_size, num_output_frames, num_channels, height, width],
+            device=noise.device,
+            dtype=noise.dtype
+        )
+        # init denosing timestep, same accross windows
+        shared_timestep = torch.ones(
+            [batch_size, rolling_window_length_blocks * self.num_frame_per_block],
+            device=noise.device,
+            dtype=torch.float32)
+        for index, current_timestep in enumerate(reversed(self.denoising_step_list)): # from clean to noisy
+            shared_timestep[:, index * self.num_frame_per_block:(index + 1) * self.num_frame_per_block] *= current_timestep
+        # Denoising loop with rolling forcing
+        for window_index in range(window_num):
+            if profile:
+                block_start.record()
+            print('window_index:', window_index)
+            start_block = window_start_blocks[window_index]
+            end_block = window_end_blocks[window_index] # include
+            print(f"start_block: {start_block}, end_block: {end_block}")
+            current_start_frame = start_block * self.num_frame_per_block
+            current_end_frame = (end_block + 1) * self.num_frame_per_block # not include
+            current_num_frames = current_end_frame - current_start_frame
+            # noisy_input: new noise and previous denoised noisy frames, only last block is pure noise
+            if current_num_frames == rolling_window_length_blocks * self.num_frame_per_block or current_start_frame == 0:
+                noisy_input = torch.cat([
+                    noisy_cache[:, current_start_frame : current_end_frame - self.num_frame_per_block],
+                    noise[:, current_end_frame - self.num_frame_per_block : current_end_frame ]
+                ], dim=1)
+            else: # at the end of the video
+                noisy_input = noisy_cache[:, current_start_frame:current_end_frame]
+            # init denosing timestep
+            if current_num_frames == rolling_window_length_blocks * self.num_frame_per_block:
+                current_timestep = shared_timestep
+            elif current_start_frame == 0:
+                current_timestep = shared_timestep[:,-current_num_frames:]
+            elif current_end_frame == num_frames:
+                current_timestep = shared_timestep[:,:current_num_frames]
+            else:
+                raise ValueError("current_num_frames should be equal to rolling_window_length_blocks * self.num_frame_per_block, or the first or last window.")
+            # calling DiT
+            _, denoised_pred = self.generator(
+                    noisy_image_or_video=noisy_input,
+                    conditional_dict=conditional_dict,
+                    timestep=current_timestep,
+                    kv_cache=self.kv_cache_clean,
+                    crossattn_cache=self.crossattn_cache,
+                    current_start=current_start_frame * self.frame_seq_length
+                )
+            output[:, current_start_frame:current_end_frame] = denoised_pred
+            # update noisy_cache, which is detached from the computation graph
+            with torch.no_grad():
+                for block_idx in range(start_block, end_block + 1):
+                    block_time_step = current_timestep[:,
+                                    (block_idx - start_block)*self.num_frame_per_block :
+                                    (block_idx - start_block+1)*self.num_frame_per_block].mean().item()
+                    matches = torch.abs(self.denoising_step_list - block_time_step) < 1e-4
+                    block_timestep_index = torch.nonzero(matches, as_tuple=True)[0]
+                    if block_timestep_index == len(self.denoising_step_list) - 1:
+                        continue
+                    next_timestep = self.denoising_step_list[block_timestep_index + 1].to(noise.device)
+                    noisy_cache[:, block_idx * self.num_frame_per_block:
+                                    (block_idx+1) * self.num_frame_per_block] = \
+                        self.scheduler.add_noise(
+                            denoised_pred.flatten(0, 1),
+                            torch.randn_like(denoised_pred.flatten(0, 1)),
+                            next_timestep * torch.ones(
+                                [batch_size * current_num_frames], device=noise.device, dtype=torch.long)
+                        ).unflatten(0, denoised_pred.shape[:2])[:, (block_idx - start_block)*self.num_frame_per_block:
+                                                                    (block_idx - start_block+1)*self.num_frame_per_block]
+            # rerun with timestep zero to update the clean cache, which is also detached from the computation graph
+            with torch.no_grad():
+                context_timestep = torch.ones_like(current_timestep) * self.args.context_noise
+                # # add context noise
+                # denoised_pred = self.scheduler.add_noise(
+                #     denoised_pred.flatten(0, 1),
+                #     torch.randn_like(denoised_pred.flatten(0, 1)),
+                #     context_timestep * torch.ones(
+                #         [batch_size * current_num_frames], device=noise.device, dtype=torch.long)
+                # ).unflatten(0, denoised_pred.shape[:2])
+                # only cache the first block
+                denoised_pred = denoised_pred[:,:self.num_frame_per_block]
+                context_timestep = context_timestep[:,:self.num_frame_per_block]
+                self.generator(
+                    noisy_image_or_video=denoised_pred,
+                    conditional_dict=conditional_dict,
+                    timestep=context_timestep,
+                    kv_cache=self.kv_cache_clean,
+                    crossattn_cache=self.crossattn_cache,
+                    current_start=current_start_frame * self.frame_seq_length,
+                    updating_cache=True,
+                )
+            if profile:
+                block_end.record()
+                torch.cuda.synchronize()
+                block_time = block_start.elapsed_time(block_end)
+                block_times.append(block_time)
+        if profile:
+            # End diffusion timing and synchronize CUDA
+            diffusion_end.record()
+            torch.cuda.synchronize()
+            diffusion_time = diffusion_start.elapsed_time(diffusion_end)
+            init_time = init_start.elapsed_time(init_end)
+            vae_start.record()
+        # Step 4: Decode the output
+        video = self.vae.decode_to_pixel(output, use_cache=False)
+        video = (video * 0.5 + 0.5).clamp(0, 1)
+        if profile:
+            # End VAE timing and synchronize CUDA
+            vae_end.record()
+            torch.cuda.synchronize()
+            vae_time = vae_start.elapsed_time(vae_end)
+            total_time = init_time + diffusion_time + vae_time
+            print("Profiling results:")
+            print(f"  - Initialization/caching time: {init_time:.2f} ms ({100 * init_time / total_time:.2f}%)")
+            print(f"  - Diffusion generation time: {diffusion_time:.2f} ms ({100 * diffusion_time / total_time:.2f}%)")
+            for i, block_time in enumerate(block_times):
+                print(f"    - Block {i} generation time: {block_time:.2f} ms ({100 * block_time / diffusion_time:.2f}% of diffusion)")
+            print(f"  - VAE decoding time: {vae_time:.2f} ms ({100 * vae_time / total_time:.2f}%)")
+            print(f"  - Total time: {total_time:.2f} ms")
+        if return_latents:
+            return video, output
+        else:
+            return video
+    def _initialize_kv_cache(self, batch_size, dtype, device):
+        """
+        Initialize a Per-GPU KV cache for the Wan model.
+        """
+        kv_cache_clean = []
+        # if self.local_attn_size != -1:
+        #     # Use the local attention size to compute the KV cache size
+        #     kv_cache_size = self.local_attn_size * self.frame_seq_length
+        # else:
+        #     # Use the default KV cache size
+        kv_cache_size = 1560 * 24
+        for _ in range(self.num_transformer_blocks):
+            kv_cache_clean.append({
+                "k": torch.zeros([batch_size, kv_cache_size, 12, 128], dtype=dtype, device=device),
+                "v": torch.zeros([batch_size, kv_cache_size, 12, 128], dtype=dtype, device=device),
+                "global_end_index": torch.tensor([0], dtype=torch.long, device=device),
+                "local_end_index": torch.tensor([0], dtype=torch.long, device=device)
+            })
+        self.kv_cache_clean = kv_cache_clean  # always store the clean cache
+    def _initialize_crossattn_cache(self, batch_size, dtype, device):
+        """
+        Initialize a Per-GPU cross-attention cache for the Wan model.
+        """
+        crossattn_cache = []
+        for _ in range(self.num_transformer_blocks):
+            crossattn_cache.append({
+                "k": torch.zeros([batch_size, 512, 12, 128], dtype=dtype, device=device),
+                "v": torch.zeros([batch_size, 512, 12, 128], dtype=dtype, device=device),
+                "is_init": False
+            })
+        self.crossattn_cache = crossattn_cache

pipeline/rolling_forcing_training.py ADDED Viewed

	@@ -0,0 +1,464 @@

+from utils.wan_wrapper import WanDiffusionWrapper
+from utils.scheduler import SchedulerInterface
+from typing import List, Optional
+import torch
+import torch.distributed as dist
+class RollingForcingTrainingPipeline:
+    def __init__(self,
+                 denoising_step_list: List[int],
+                 scheduler: SchedulerInterface,
+                 generator: WanDiffusionWrapper,
+                 num_frame_per_block=3,
+                 independent_first_frame: bool = False,
+                 same_step_across_blocks: bool = False,
+                 last_step_only: bool = False,
+                 num_max_frames: int = 21,
+                 context_noise: int = 0,
+                 **kwargs):
+        super().__init__()
+        self.scheduler = scheduler
+        self.generator = generator
+        self.denoising_step_list = denoising_step_list
+        if self.denoising_step_list[-1] == 0:
+            self.denoising_step_list = self.denoising_step_list[:-1]  # remove the zero timestep for inference
+        # Wan specific hyperparameters
+        self.num_transformer_blocks = 30
+        self.frame_seq_length = 1560
+        self.num_frame_per_block = num_frame_per_block
+        self.context_noise = context_noise
+        self.i2v = False
+        self.kv_cache_clean = None
+        self.kv_cache2 = None
+        self.independent_first_frame = independent_first_frame
+        self.same_step_across_blocks = same_step_across_blocks
+        self.last_step_only = last_step_only
+        self.kv_cache_size = num_max_frames * self.frame_seq_length
+    def generate_and_sync_list(self, num_blocks, num_denoising_steps, device):
+        rank = dist.get_rank() if dist.is_initialized() else 0
+        if rank == 0:
+            # Generate random indices
+            indices = torch.randint(
+                low=0,
+                high=num_denoising_steps,
+                size=(num_blocks,),
+                device=device
+            )
+            if self.last_step_only:
+                indices = torch.ones_like(indices) * (num_denoising_steps - 1)
+        else:
+            indices = torch.empty(num_blocks, dtype=torch.long, device=device)
+        dist.broadcast(indices, src=0)  # Broadcast the random indices to all ranks
+        return indices.tolist()
+    def generate_list(self, num_blocks, num_denoising_steps, device):
+        # Generate random indices
+        indices = torch.randint(
+            low=0,
+            high=num_denoising_steps,
+            size=(num_blocks,),
+            device=device
+        )
+        if self.last_step_only:
+            indices = torch.ones_like(indices) * (num_denoising_steps - 1)
+        return indices.tolist()
+    def inference_with_rolling_forcing(
+            self,
+            noise: torch.Tensor,
+            initial_latent: Optional[torch.Tensor] = None,
+            return_sim_step: bool = False,
+            **conditional_dict
+    ) -> torch.Tensor:
+        batch_size, num_frames, num_channels, height, width = noise.shape
+        if not self.independent_first_frame or (self.independent_first_frame and initial_latent is not None):
+            # If the first frame is independent and the first frame is provided, then the number of frames in the
+            # noise should still be a multiple of num_frame_per_block
+            assert num_frames % self.num_frame_per_block == 0
+            num_blocks = num_frames // self.num_frame_per_block
+        else:
+            # Using a [1, 4, 4, 4, 4, 4, ...] model to generate a video without image conditioning
+            assert (num_frames - 1) % self.num_frame_per_block == 0
+            num_blocks = (num_frames - 1) // self.num_frame_per_block
+        num_input_frames = initial_latent.shape[1] if initial_latent is not None else 0
+        num_output_frames = num_frames + num_input_frames  # add the initial latent frames
+        output = torch.zeros(
+            [batch_size, num_output_frames, num_channels, height, width],
+            device=noise.device,
+            dtype=noise.dtype
+        )
+        # Step 1: Initialize KV cache to all zeros
+        self._initialize_kv_cache(
+            batch_size=batch_size, dtype=noise.dtype, device=noise.device
+        )
+        self._initialize_crossattn_cache(
+            batch_size=batch_size, dtype=noise.dtype, device=noise.device
+        )
+        # implementing rolling forcing
+        # construct the rolling forcing windows
+        num_denoising_steps = len(self.denoising_step_list)
+        rolling_window_length_blocks = num_denoising_steps
+        window_start_blocks = []
+        window_end_blocks = []
+        window_num = num_blocks + rolling_window_length_blocks - 1
+        for window_index in range(window_num):
+            start_block = max(0, window_index - rolling_window_length_blocks + 1)
+            end_block = min(num_blocks - 1, window_index)
+            window_start_blocks.append(start_block)
+            window_end_blocks.append(end_block)
+        # exit_flag indicates the window at which the model will backpropagate gradients.
+        exit_flag = torch.randint(high=rolling_window_length_blocks, device=noise.device, size=())
+        start_gradient_frame_index = num_output_frames - 21
+        # init noisy cache
+        noisy_cache = torch.zeros(
+            [batch_size, num_output_frames, num_channels, height, width],
+            device=noise.device,
+            dtype=noise.dtype
+        )
+        # init denosing timestep, same accross windows
+        shared_timestep = torch.ones(
+            [batch_size, rolling_window_length_blocks * self.num_frame_per_block],
+            device=noise.device,
+            dtype=torch.float32)
+        for index, current_timestep in enumerate(reversed(self.denoising_step_list)): # from clean to noisy
+            shared_timestep[:, index * self.num_frame_per_block:(index + 1) * self.num_frame_per_block] *= current_timestep
+        # Denoising loop with rolling forcing
+        for window_index in range(window_num):
+            start_block = window_start_blocks[window_index]
+            end_block = window_end_blocks[window_index] # include
+            current_start_frame = start_block * self.num_frame_per_block
+            current_end_frame = (end_block + 1) * self.num_frame_per_block # not include
+            current_num_frames = current_end_frame - current_start_frame
+            # noisy_input: new noise and previous denoised noisy frames, only last block is pure noise
+            if current_num_frames == rolling_window_length_blocks * self.num_frame_per_block or current_start_frame == 0:
+                noisy_input = torch.cat([
+                    noisy_cache[:, current_start_frame : current_end_frame - self.num_frame_per_block],
+                    noise[:, current_end_frame - self.num_frame_per_block : current_end_frame ]
+                ], dim=1)
+            else: # at the end of the video
+                noisy_input = noisy_cache[:, current_start_frame:current_end_frame].clone()
+            # init denosing timestep
+            if current_num_frames == rolling_window_length_blocks * self.num_frame_per_block:
+                current_timestep = shared_timestep
+            elif current_start_frame == 0:
+                current_timestep = shared_timestep[:,-current_num_frames:]
+            elif current_end_frame == num_frames:
+                current_timestep = shared_timestep[:,:current_num_frames]
+            else:
+                raise ValueError("current_num_frames should be equal to rolling_window_length_blocks * self.num_frame_per_block, or the first or last window.")
+            require_grad = window_index % rolling_window_length_blocks == exit_flag
+            if current_end_frame <= start_gradient_frame_index:
+                require_grad = False
+            # calling DiT
+            if not require_grad:
+                with torch.no_grad():
+                    _, denoised_pred = self.generator(
+                        noisy_image_or_video=noisy_input,
+                        conditional_dict=conditional_dict,
+                        timestep=current_timestep,
+                        kv_cache=self.kv_cache_clean,
+                        crossattn_cache=self.crossattn_cache,
+                        current_start=current_start_frame * self.frame_seq_length
+                    )
+            else:
+                _, denoised_pred = self.generator(
+                        noisy_image_or_video=noisy_input,
+                        conditional_dict=conditional_dict,
+                        timestep=current_timestep,
+                        kv_cache=self.kv_cache_clean,
+                        crossattn_cache=self.crossattn_cache,
+                        current_start=current_start_frame * self.frame_seq_length
+                    )
+                output[:, current_start_frame:current_end_frame] = denoised_pred
+            # update noisy_cache, which is detached from the computation graph
+            with torch.no_grad():
+                for block_idx in range(start_block, end_block + 1):
+                    block_time_step = current_timestep[:,
+                                    (block_idx - start_block)*self.num_frame_per_block :
+                                    (block_idx - start_block+1)*self.num_frame_per_block].mean().item()
+                    matches = torch.abs(self.denoising_step_list - block_time_step) < 1e-4
+                    block_timestep_index = torch.nonzero(matches, as_tuple=True)[0]
+                    if block_timestep_index == len(self.denoising_step_list) - 1:
+                        continue
+                    next_timestep = self.denoising_step_list[block_timestep_index + 1].to(noise.device)
+                    noisy_cache[:, block_idx * self.num_frame_per_block:
+                                    (block_idx+1) * self.num_frame_per_block] = \
+                        self.scheduler.add_noise(
+                            denoised_pred.flatten(0, 1),
+                            torch.randn_like(denoised_pred.flatten(0, 1)),
+                            next_timestep * torch.ones(
+                                [batch_size * current_num_frames], device=noise.device, dtype=torch.long)
+                        ).unflatten(0, denoised_pred.shape[:2])[:, (block_idx - start_block)*self.num_frame_per_block:
+                                                                    (block_idx - start_block+1)*self.num_frame_per_block]
+            # rerun with timestep zero to update the clean cache, which is also detached from the computation graph
+            with torch.no_grad():
+                context_timestep = torch.ones_like(current_timestep) * self.context_noise
+                # # add context noise
+                # denoised_pred = self.scheduler.add_noise(
+                #     denoised_pred.flatten(0, 1),
+                #     torch.randn_like(denoised_pred.flatten(0, 1)),
+                #     context_timestep * torch.ones(
+                #         [batch_size * current_num_frames], device=noise.device, dtype=torch.long)
+                # ).unflatten(0, denoised_pred.shape[:2])
+                # only cache the first block
+                denoised_pred = denoised_pred[:,:self.num_frame_per_block]
+                context_timestep = context_timestep[:,:self.num_frame_per_block]
+                self.generator(
+                    noisy_image_or_video=denoised_pred,
+                    conditional_dict=conditional_dict,
+                    timestep=context_timestep,
+                    kv_cache=self.kv_cache_clean,
+                    crossattn_cache=self.crossattn_cache,
+                    current_start=current_start_frame * self.frame_seq_length,
+                    updating_cache=True,
+                )
+        # Step 3.5: Return the denoised timestep
+        # can ignore since not used
+        denoised_timestep_from, denoised_timestep_to = None, None
+        return output, denoised_timestep_from, denoised_timestep_to
+    def inference_with_self_forcing(
+            self,
+            noise: torch.Tensor,
+            initial_latent: Optional[torch.Tensor] = None,
+            return_sim_step: bool = False,
+            **conditional_dict
+    ) -> torch.Tensor:
+        batch_size, num_frames, num_channels, height, width = noise.shape
+        if not self.independent_first_frame or (self.independent_first_frame and initial_latent is not None):
+            # If the first frame is independent and the first frame is provided, then the number of frames in the
+            # noise should still be a multiple of num_frame_per_block
+            assert num_frames % self.num_frame_per_block == 0
+            num_blocks = num_frames // self.num_frame_per_block
+        else:
+            # Using a [1, 4, 4, 4, 4, 4, ...] model to generate a video without image conditioning
+            assert (num_frames - 1) % self.num_frame_per_block == 0
+            num_blocks = (num_frames - 1) // self.num_frame_per_block
+        num_input_frames = initial_latent.shape[1] if initial_latent is not None else 0
+        num_output_frames = num_frames + num_input_frames  # add the initial latent frames
+        output = torch.zeros(
+            [batch_size, num_output_frames, num_channels, height, width],
+            device=noise.device,
+            dtype=noise.dtype
+        )
+        # Step 1: Initialize KV cache to all zeros
+        self._initialize_kv_cache(
+            batch_size=batch_size, dtype=noise.dtype, device=noise.device
+        )
+        self._initialize_crossattn_cache(
+            batch_size=batch_size, dtype=noise.dtype, device=noise.device
+        )
+        # if self.kv_cache_clean is None:
+        #     self._initialize_kv_cache(
+        #         batch_size=batch_size,
+        #         dtype=noise.dtype,
+        #         device=noise.device,
+        #     )
+        #     self._initialize_crossattn_cache(
+        #         batch_size=batch_size,
+        #         dtype=noise.dtype,
+        #         device=noise.device
+        #     )
+        # else:
+        #     # reset cross attn cache
+        #     for block_index in range(self.num_transformer_blocks):
+        #         self.crossattn_cache[block_index]["is_init"] = False
+        #     # reset kv cache
+        #     for block_index in range(len(self.kv_cache_clean)):
+        #         self.kv_cache_clean[block_index]["global_end_index"] = torch.tensor(
+        #             [0], dtype=torch.long, device=noise.device)
+        #         self.kv_cache_clean[block_index]["local_end_index"] = torch.tensor(
+        #             [0], dtype=torch.long, device=noise.device)
+        # Step 2: Cache context feature
+        current_start_frame = 0
+        if initial_latent is not None:
+            timestep = torch.ones([batch_size, 1], device=noise.device, dtype=torch.int64) * 0
+            # Assume num_input_frames is 1 + self.num_frame_per_block * num_input_blocks
+            output[:, :1] = initial_latent
+            with torch.no_grad():
+                self.generator(
+                    noisy_image_or_video=initial_latent,
+                    conditional_dict=conditional_dict,
+                    timestep=timestep * 0,
+                    kv_cache=self.kv_cache_clean,
+                    crossattn_cache=self.crossattn_cache,
+                    current_start=current_start_frame * self.frame_seq_length
+                )
+            current_start_frame += 1
+        # Step 3: Temporal denoising loop
+        all_num_frames = [self.num_frame_per_block] * num_blocks
+        if self.independent_first_frame and initial_latent is None:
+            all_num_frames = [1] + all_num_frames
+        num_denoising_steps = len(self.denoising_step_list)
+        exit_flags = self.generate_and_sync_list(len(all_num_frames), num_denoising_steps, device=noise.device)
+        start_gradient_frame_index = num_output_frames - 21
+        # for block_index in range(num_blocks):
+        for block_index, current_num_frames in enumerate(all_num_frames):
+            noisy_input = noise[
+                :, current_start_frame - num_input_frames:current_start_frame + current_num_frames - num_input_frames]
+            # Step 3.1: Spatial denoising loop
+            for index, current_timestep in enumerate(self.denoising_step_list):
+                if self.same_step_across_blocks:
+                    exit_flag = (index == exit_flags[0])
+                else:
+                    exit_flag = (index == exit_flags[block_index])  # Only backprop at the randomly selected timestep (consistent across all ranks)
+                timestep = torch.ones(
+                    [batch_size, current_num_frames],
+                    device=noise.device,
+                    dtype=torch.int64) * current_timestep
+                if not exit_flag:
+                    with torch.no_grad():
+                        _, denoised_pred = self.generator(
+                            noisy_image_or_video=noisy_input,
+                            conditional_dict=conditional_dict,
+                            timestep=timestep,
+                            kv_cache=self.kv_cache_clean,
+                            crossattn_cache=self.crossattn_cache,
+                            current_start=current_start_frame * self.frame_seq_length
+                        )
+                        next_timestep = self.denoising_step_list[index + 1]
+                        noisy_input = self.scheduler.add_noise(
+                            denoised_pred.flatten(0, 1),
+                            torch.randn_like(denoised_pred.flatten(0, 1)),
+                            next_timestep * torch.ones(
+                                [batch_size * current_num_frames], device=noise.device, dtype=torch.long)
+                        ).unflatten(0, denoised_pred.shape[:2])
+                else:
+                    # for getting real output
+                    # with torch.set_grad_enabled(current_start_frame >= start_gradient_frame_index):
+                    if current_start_frame < start_gradient_frame_index:
+                        with torch.no_grad():
+                            _, denoised_pred = self.generator(
+                                noisy_image_or_video=noisy_input,
+                                conditional_dict=conditional_dict,
+                                timestep=timestep,
+                                kv_cache=self.kv_cache_clean,
+                                crossattn_cache=self.crossattn_cache,
+                                current_start=current_start_frame * self.frame_seq_length
+                            )
+                    else:
+                        _, denoised_pred = self.generator(
+                            noisy_image_or_video=noisy_input,
+                            conditional_dict=conditional_dict,
+                            timestep=timestep,
+                            kv_cache=self.kv_cache_clean,
+                            crossattn_cache=self.crossattn_cache,
+                            current_start=current_start_frame * self.frame_seq_length
+                        )
+                    break
+            # Step 3.2: record the model's output
+            output[:, current_start_frame:current_start_frame + current_num_frames] = denoised_pred
+            # Step 3.3: rerun with timestep zero to update the cache
+            context_timestep = torch.ones_like(timestep) * self.context_noise
+            # add context noise
+            denoised_pred = self.scheduler.add_noise(
+                denoised_pred.flatten(0, 1),
+                torch.randn_like(denoised_pred.flatten(0, 1)),
+                context_timestep * torch.ones(
+                    [batch_size * current_num_frames], device=noise.device, dtype=torch.long)
+            ).unflatten(0, denoised_pred.shape[:2])
+            with torch.no_grad():
+                self.generator(
+                    noisy_image_or_video=denoised_pred,
+                    conditional_dict=conditional_dict,
+                    timestep=context_timestep,
+                    kv_cache=self.kv_cache_clean,
+                    crossattn_cache=self.crossattn_cache,
+                    current_start=current_start_frame * self.frame_seq_length,
+                    updating_cache=True,
+                )
+            # Step 3.4: update the start and end frame indices
+            current_start_frame += current_num_frames
+        # Step 3.5: Return the denoised timestep
+        if not self.same_step_across_blocks:
+            denoised_timestep_from, denoised_timestep_to = None, None
+        elif exit_flags[0] == len(self.denoising_step_list) - 1:
+            denoised_timestep_to = 0
+            denoised_timestep_from = 1000 - torch.argmin(
+                (self.scheduler.timesteps.cuda() - self.denoising_step_list[exit_flags[0]].cuda()).abs(), dim=0).item()
+        else:
+            denoised_timestep_to = 1000 - torch.argmin(
+                (self.scheduler.timesteps.cuda() - self.denoising_step_list[exit_flags[0] + 1].cuda()).abs(), dim=0).item()
+            denoised_timestep_from = 1000 - torch.argmin(
+                (self.scheduler.timesteps.cuda() - self.denoising_step_list[exit_flags[0]].cuda()).abs(), dim=0).item()
+        if return_sim_step:
+            return output, denoised_timestep_from, denoised_timestep_to, exit_flags[0] + 1
+        return output, denoised_timestep_from, denoised_timestep_to
+    def _initialize_kv_cache(self, batch_size, dtype, device):
+        """
+        Initialize a Per-GPU KV cache for the Wan model.
+        """
+        kv_cache_clean = []
+        for _ in range(self.num_transformer_blocks):
+            kv_cache_clean.append({
+                "k": torch.zeros([batch_size, self.kv_cache_size, 12, 128], dtype=dtype, device=device),
+                "v": torch.zeros([batch_size, self.kv_cache_size, 12, 128], dtype=dtype, device=device),
+                "global_end_index": torch.tensor([0], dtype=torch.long, device=device),
+                "local_end_index": torch.tensor([0], dtype=torch.long, device=device)
+            })
+        self.kv_cache_clean = kv_cache_clean  # always store the clean cache
+    def _initialize_crossattn_cache(self, batch_size, dtype, device):
+        """
+        Initialize a Per-GPU cross-attention cache for the Wan model.
+        """
+        crossattn_cache = []
+        for _ in range(self.num_transformer_blocks):
+            crossattn_cache.append({
+                "k": torch.zeros([batch_size, 512, 12, 128], dtype=dtype, device=device),
+                "v": torch.zeros([batch_size, 512, 12, 128], dtype=dtype, device=device),
+                "is_init": False
+            })
+        self.crossattn_cache = crossattn_cache

prompts/example_prompts.txt ADDED Viewed

	@@ -0,0 +1,16 @@

+A cinematic scene from a classic western movie, featuring a rugged man riding a powerful horse through the vast Gobi Desert at sunset. The man, dressed in a dusty cowboy hat and a worn leather jacket, reins tightly on the horse's neck as he gallops across the golden sands. The sun sets dramatically behind them, casting long shadows and warm hues across the landscape. The background is filled with rolling dunes and sparse, rocky outcrops, emphasizing the harsh beauty of the desert. A dynamic wide shot from a low angle, capturing both the man and the expansive desert vista.
+A classic black-and-white photograph style image of an older man playing the piano. The man, with a weathered face and kind eyes, sits at an antique piano with his fingers gracefully moving over the keys. The lighting comes from the side, casting dramatic shadows on his face and emphasizing the texture of his hands. His posture is upright and focused, conveying a sense of deep concentration and passion for music. The background is blurred, revealing only hints of a cozy room with wooden floors and old furniture. A close-up shot from a slightly elevated angle, capturing both the man and the piano in detail.
+A dramatic post-apocalyptic scene in the style of a horror film, featuring a skeleton wearing a colorful flower hat and oversized sunglasses dancing wildly in a sunlit meadow at sunset. The skeleton has a weathered and somewhat decayed appearance, with bones visible through tattered remnants of clothing. The dance is energetic and almost comical, with exaggerated movements. The background is a vivid blend of warm oranges and pinks, with tall grasses and wildflowers swaying in the breeze. The sky is painted with rich hues of orange and pink, casting long shadows across the landscape. A dynamic medium shot from a low angle, capturing the skeleton's animated dance.
+A dynamic action scene in a modern gym, featuring a kangaroo wearing boxing gloves, engaged in an intense sparring session with a punching bag. The kangaroo has a muscular build and is positioned mid-punch, its front legs wrapped in red boxing gloves, eyes focused intently on the target. The background showcases a cluttered gym with heavy equipment and mats, creating a vivid and realistic setting. The kangaroo's movements are fluid and powerful, conveying both agility and strength. The scene captures a split-second moment of mid-action, with the kangaroo's tail swaying behind it. A high-angle shot emphasizing the kangaroo's dynamic pose and the surrounding gym environment.
+A dynamic action shot in the style of a high-energy sports magazine spread, featuring a golden retriever sprinting with all its might after a red sports car speeding down the road. The dog's fur glistens in the sunlight, and its eyes are filled with determination and excitement. It leaps forward, its tail wagging wildly, while the car speeds away in the background, leaving a trail of dust. The background shows a busy city street with blurred cars and pedestrians, adding to the sense of urgency. The photo has a crisp, vibrant color palette and a high-resolution quality. A medium-long shot capturing the dog's full run.
+A dynamic action shot in the style of a professional skateboard magazine, featuring a young male longboarder accelerating downhill. He is fully focused, his expression intense and determined, carving through tight turns with precision. His longboard glides smoothly over the pavement, creating a blur of motion. He wears a black longboard shirt, blue jeans, and white sneakers, with a backpack slung over one shoulder. His hair flows behind him as he moves, and he grips the board tightly with both hands. The background shows a scenic urban street with blurred buildings and trees, hinting at a lively cityscape. The photo captures the moment just after he exits a turn, with a slight bounce in the board and a sense of speed and agility. A medium shot with a slightly elevated camera angle.
+A dynamic hip-hop dance scene in a vibrant urban style, featuring an Asian girl in a bright yellow T-shirt and white pants. She is mid-dance move, arms stretched out and feet rhythmically stepping, exuding energy and confidence. Her hair is tied up in a ponytail, and she has a mischievous smile on her face. The background shows a bustling city street with blurred reflections of tall buildings and passing cars. The scene captures the lively and energetic atmosphere of a hip-hop performance, with a slightly grainy texture. A medium shot from a low-angle perspective.
+A dynamic tracking shot following a skateboarder performing a series of fluid tricks down a bustling city street. The skateboarder, wearing a black helmet and a colorful shirt, moves with grace and confidence, executing flips, grinds, and spins. The camera captures the skateboarder's fluid movements, capturing the essence of each trick with precision. The background showcases the urban environment, with tall buildings, busy traffic, and passersby in the distance. The lighting highlights the skateboarder's movements, creating a sense of speed and energy. The overall style is reminiscent of a skateboarding documentary, emphasizing the natural and dynamic nature of the tricks.
+A handheld camera captures a dog running through a park with a joyful exploration, the camera following the dog closely and bouncing and tilting with its movements. The dog bounds through the grass, tail wagging excitedly, sniffing at flowers and chasing after butterflies. Its fur glistens in the sunlight, and its eyes sparkle with enthusiasm. The park is filled with trees and colorful blooms, and the background shows a blurred path leading into the distance. The camera angle changes dynamically, providing a sense of the dog's lively energy and the vibrant environment around it.
+A handheld shot following a young child running through a field of tall grass, capturing the spontaneity and playfulness of their movements. The child has curly brown hair and a mischievous smile, arms swinging freely as they sprint across the green expanse. Their small feet kick up bits of grass and dirt, creating a trail behind them. The background features a blurred landscape with rolling hills and scattered wildflowers, bathed in warm sunlight. The photo has a natural, documentary-style quality, emphasizing the dynamic motion and joy of the moment. A dynamic handheld shot from a slightly elevated angle, following the child's energetic run.
+A high-speed action shot of a cheetah in its natural habitat, sprinting at full speed while chasing its prey across the savanna. The cheetah's golden fur glistens under the bright African sun, and its muscular body is stretched out in a powerful run. Its sharp eyes focus intently on the fleeing antelope, and its distinctive black tear marks streak down its face. The background is a blurred landscape with tall grass swaying in the wind, and distant acacia trees. The cheetah's tail is raised high, and its paws leave deep prints in the soft earth. A dynamic mid-shot capturing the intense moment of pursuit.
+A photograph in a soft, warm lighting style, capturing a young woman with a bright smile and a playful wink. She has long curly brown hair and warm hazel eyes, with a slightly flushed cheeks from laughter. She is dressed in a casual yet stylish outfit: a floral printed sundress with a flowy skirt and a fitted top. Her hands are on her hips, giving a casual pose. The background features a blurred outdoor garden setting with blooming flowers and greenery. A medium shot from a slightly above-the-shoulder angle, emphasizing her joyful expression and the natural movement of her face.
+A poignant moment captured in a realistic photographic style, showing a middle-aged man with a rugged face and slightly tousled hair, his chin quivering with emotion as he says a heartfelt goodbye to a loved one. He wears a simple grey sweater and jeans, standing on a dewy grassy field under a clear blue sky, with fluffy white clouds in the background. The camera angle is slightly from below, emphasizing his sorrowful expression and the depth of his feelings. A medium shot with a soft focus on the man's face and a blurred background.
+A realistic photo of a llama wearing colorful pajamas dancing energetically on a stage under vibrant disco lighting. The llama has large floppy ears and a playful expression, moving its legs in a lively dance. It wears a red and yellow striped pajama top and matching pajama pants, with a fluffy tail swaying behind it. The stage is adorned with glittering disco balls and colorful lights, casting a lively and joyful atmosphere. The background features blurred audience members and a backdrop with disco-themed decorations. A dynamic shot capturing the llama mid-dance from a slightly elevated angle.
+An adorable kangaroo, dressed in a cute green dress with polka dots, is wearing a small sun hat perched on its head. The kangaroo takes a pleasant stroll through the bustling streets of Mumbai during a vibrant and colorful festival. The background is filled with lively festival-goers in traditional Indian attire, adorned with intricate henna designs and bright jewelry. The scene is filled with colorful decorations, vendors selling various items, and people dancing and singing. The kangaroo moves gracefully, hopping along the cobblestone streets, its tail swinging behind it. The camera angle captures the kangaroo from a slight overhead perspective, highlighting its joyful expression and the festive atmosphere. A medium shot with dynamic movement.
+An atmospheric and dramatic arc shot around a lone tree standing in a vast, foggy field at dawn. The early morning light filters through the mist, casting a soft, warm glow on the tree and the surrounding landscape. The tree's branches stretch out against the backdrop of a gradually lightening sky, with the shadows shifting and changing as the sun rises. The field is dotted with tall grasses and scattered wildflowers, their silhouettes softened by the fog. The overall scene has a moody, ethereal quality, emphasizing the natural movement of the fog and the subtle changes in light and shadow. A dynamic arc shot capturing the transition from night to day.

requirements.txt ADDED Viewed

	@@ -0,0 +1,45 @@

+torch==2.5.1
+torchvision==0.20.1
+torchaudio==2.5.1
+opencv-python>=4.9.0.80
+diffusers==0.31.0
+transformers>=4.49.0
+tokenizers>=0.20.3
+accelerate>=1.1.1
+tqdm
+imageio
+easydict
+ftfy
+dashscope
+imageio-ffmpeg
+numpy==1.24.4
+wandb
+omegaconf
+einops
+av==13.1.0
+opencv-python
+open_clip_torch
+starlette
+pycocotools
+lmdb
+matplotlib
+sentencepiece
+pydantic==2.10.6
+scikit-image
+huggingface_hub
+dominate
+nvidia-pyindex
+nvidia-tensorrt
+pycuda
+onnx
+onnxruntime
+onnxscript
+onnxconverter_common
+flask
+flask-socketio
+torchao
+tensorboard
+ninja
+packaging
+--no-build-isolation
+flash-attn

train.py ADDED Viewed

	@@ -0,0 +1,45 @@

+import argparse
+import os
+from omegaconf import OmegaConf
+from trainer import DiffusionTrainer, GANTrainer, ODETrainer, ScoreDistillationTrainer
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config_path", type=str, required=True)
+    parser.add_argument("--no_save", action="store_true")
+    parser.add_argument("--no_visualize", action="store_true")
+    parser.add_argument("--logdir", type=str, default="", help="Path to the directory to save logs")
+    parser.add_argument("--wandb-save-dir", type=str, default="", help="Path to the directory to save wandb logs")
+    parser.add_argument("--disable-wandb", default=False, action="store_true")
+    args = parser.parse_args()
+    config = OmegaConf.load(args.config_path)
+    default_config = OmegaConf.load("configs/default_config.yaml")
+    config = OmegaConf.merge(default_config, config)
+    config.no_save = args.no_save
+    config.no_visualize = args.no_visualize
+    # get the filename of config_path
+    config_name = os.path.basename(args.config_path).split(".")[0]
+    config.config_name = config_name
+    config.logdir = args.logdir
+    config.wandb_save_dir = args.wandb_save_dir
+    config.disable_wandb = args.disable_wandb
+    if config.trainer == "diffusion":
+        trainer = DiffusionTrainer(config)
+    elif config.trainer == "gan":
+        trainer = GANTrainer(config)
+    elif config.trainer == "ode":
+        trainer = ODETrainer(config)
+    elif config.trainer == "score_distillation":
+        trainer = ScoreDistillationTrainer(config)
+    trainer.train()
+if __name__ == "__main__":
+    main()

trainer/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+from .diffusion import Trainer as DiffusionTrainer
+from .gan import Trainer as GANTrainer
+from .ode import Trainer as ODETrainer
+from .distillation import Trainer as ScoreDistillationTrainer
+__all__ = [
+    "DiffusionTrainer",
+    "GANTrainer",
+    "ODETrainer",
+    "ScoreDistillationTrainer"
+]

trainer/diffusion.py ADDED Viewed

	@@ -0,0 +1,265 @@

+import gc
+import logging
+from model import CausalDiffusion
+from utils.dataset import ShardingLMDBDataset, cycle
+from utils.misc import set_seed
+import torch.distributed as dist
+from omegaconf import OmegaConf
+import torch
+import wandb
+import time
+import os
+from utils.distributed import EMA_FSDP, barrier, fsdp_wrap, fsdp_state_dict, launch_distributed_job
+class Trainer:
+    def __init__(self, config):
+        self.config = config
+        self.step = 0
+        # Step 1: Initialize the distributed training environment (rank, seed, dtype, logging etc.)
+        torch.backends.cuda.matmul.allow_tf32 = True
+        torch.backends.cudnn.allow_tf32 = True
+        launch_distributed_job()
+        global_rank = dist.get_rank()
+        self.dtype = torch.bfloat16 if config.mixed_precision else torch.float32
+        self.device = torch.cuda.current_device()
+        self.is_main_process = global_rank == 0
+        self.causal = config.causal
+        self.disable_wandb = config.disable_wandb
+        # use a random seed for the training
+        if config.seed == 0:
+            random_seed = torch.randint(0, 10000000, (1,), device=self.device)
+            dist.broadcast(random_seed, src=0)
+            config.seed = random_seed.item()
+        set_seed(config.seed + global_rank)
+        if self.is_main_process and not self.disable_wandb:
+            wandb.login(host=config.wandb_host, key=config.wandb_key)
+            wandb.init(
+                config=OmegaConf.to_container(config, resolve=True),
+                name=config.config_name,
+                mode="online",
+                entity=config.wandb_entity,
+                project=config.wandb_project,
+                dir=config.wandb_save_dir
+            )
+        self.output_path = config.logdir
+        # Step 2: Initialize the model and optimizer
+        self.model = CausalDiffusion(config, device=self.device)
+        self.model.generator = fsdp_wrap(
+            self.model.generator,
+            sharding_strategy=config.sharding_strategy,
+            mixed_precision=config.mixed_precision,
+            wrap_strategy=config.generator_fsdp_wrap_strategy
+        )
+        self.model.text_encoder = fsdp_wrap(
+            self.model.text_encoder,
+            sharding_strategy=config.sharding_strategy,
+            mixed_precision=config.mixed_precision,
+            wrap_strategy=config.text_encoder_fsdp_wrap_strategy
+        )
+        if not config.no_visualize or config.load_raw_video:
+            self.model.vae = self.model.vae.to(
+                device=self.device, dtype=torch.bfloat16 if config.mixed_precision else torch.float32)
+        self.generator_optimizer = torch.optim.AdamW(
+            [param for param in self.model.generator.parameters()
+             if param.requires_grad],
+            lr=config.lr,
+            betas=(config.beta1, config.beta2),
+            weight_decay=config.weight_decay
+        )
+        # Step 3: Initialize the dataloader
+        dataset = ShardingLMDBDataset(config.data_path, max_pair=int(1e8))
+        sampler = torch.utils.data.distributed.DistributedSampler(
+            dataset, shuffle=True, drop_last=True)
+        dataloader = torch.utils.data.DataLoader(
+            dataset,
+            batch_size=config.batch_size,
+            sampler=sampler,
+            num_workers=8)
+        if dist.get_rank() == 0:
+            print("DATASET SIZE %d" % len(dataset))
+        self.dataloader = cycle(dataloader)
+        ##############################################################################################################
+        # 6. Set up EMA parameter containers
+        rename_param = (
+            lambda name: name.replace("_fsdp_wrapped_module.", "")
+            .replace("_checkpoint_wrapped_module.", "")
+            .replace("_orig_mod.", "")
+        )
+        self.name_to_trainable_params = {}
+        for n, p in self.model.generator.named_parameters():
+            if not p.requires_grad:
+                continue
+            renamed_n = rename_param(n)
+            self.name_to_trainable_params[renamed_n] = p
+        ema_weight = config.ema_weight
+        self.generator_ema = None
+        if (ema_weight is not None) and (ema_weight > 0.0):
+            print(f"Setting up EMA with weight {ema_weight}")
+            self.generator_ema = EMA_FSDP(self.model.generator, decay=ema_weight)
+        ##############################################################################################################
+        # 7. (If resuming) Load the model and optimizer, lr_scheduler, ema's statedicts
+        if getattr(config, "generator_ckpt", False):
+            print(f"Loading pretrained generator from {config.generator_ckpt}")
+            state_dict = torch.load(config.generator_ckpt, map_location="cpu")
+            if "generator" in state_dict:
+                state_dict = state_dict["generator"]
+            elif "model" in state_dict:
+                state_dict = state_dict["model"]
+            self.model.generator.load_state_dict(
+                state_dict, strict=True
+            )
+        ##############################################################################################################
+        # Let's delete EMA params for early steps to save some computes at training and inference
+        if self.step < config.ema_start_step:
+            self.generator_ema = None
+        self.max_grad_norm = 10.0
+        self.previous_time = None
+    def save(self):
+        print("Start gathering distributed model states...")
+        generator_state_dict = fsdp_state_dict(
+            self.model.generator)
+        if self.config.ema_start_step < self.step:
+            state_dict = {
+                "generator": generator_state_dict,
+                "generator_ema": self.generator_ema.state_dict(),
+            }
+        else:
+            state_dict = {
+                "generator": generator_state_dict,
+            }
+        if self.is_main_process:
+            os.makedirs(os.path.join(self.output_path,
+                        f"checkpoint_model_{self.step:06d}"), exist_ok=True)
+            torch.save(state_dict, os.path.join(self.output_path,
+                       f"checkpoint_model_{self.step:06d}", "model.pt"))
+            print("Model saved to", os.path.join(self.output_path,
+                  f"checkpoint_model_{self.step:06d}", "model.pt"))
+    def train_one_step(self, batch):
+        self.log_iters = 1
+        if self.step % 20 == 0:
+            torch.cuda.empty_cache()
+        # Step 1: Get the next batch of text prompts
+        text_prompts = batch["prompts"]
+        if not self.config.load_raw_video:  # precomputed latent
+            clean_latent = batch["ode_latent"][:, -1].to(
+                device=self.device, dtype=self.dtype)
+        else:  # encode raw video to latent
+            frames = batch["frames"].to(
+                device=self.device, dtype=self.dtype)
+            with torch.no_grad():
+                clean_latent = self.model.vae.encode_to_latent(
+                    frames).to(device=self.device, dtype=self.dtype)
+        image_latent = clean_latent[:, 0:1, ]
+        batch_size = len(text_prompts)
+        image_or_video_shape = list(self.config.image_or_video_shape)
+        image_or_video_shape[0] = batch_size
+        # Step 2: Extract the conditional infos
+        with torch.no_grad():
+            conditional_dict = self.model.text_encoder(
+                text_prompts=text_prompts)
+            if not getattr(self, "unconditional_dict", None):
+                unconditional_dict = self.model.text_encoder(
+                    text_prompts=[self.config.negative_prompt] * batch_size)
+                unconditional_dict = {k: v.detach()
+                                      for k, v in unconditional_dict.items()}
+                self.unconditional_dict = unconditional_dict  # cache the unconditional_dict
+            else:
+                unconditional_dict = self.unconditional_dict
+        # Step 3: Train the generator
+        generator_loss, log_dict = self.model.generator_loss(
+            image_or_video_shape=image_or_video_shape,
+            conditional_dict=conditional_dict,
+            unconditional_dict=unconditional_dict,
+            clean_latent=clean_latent,
+            initial_latent=image_latent
+        )
+        self.generator_optimizer.zero_grad()
+        generator_loss.backward()
+        generator_grad_norm = self.model.generator.clip_grad_norm_(
+            self.max_grad_norm)
+        self.generator_optimizer.step()
+        # Increment the step since we finished gradient update
+        self.step += 1
+        wandb_loss_dict = {
+            "generator_loss": generator_loss.item(),
+            "generator_grad_norm": generator_grad_norm.item(),
+        }
+        # Step 4: Logging
+        if self.is_main_process:
+            if not self.disable_wandb:
+                wandb.log(wandb_loss_dict, step=self.step)
+        if self.step % self.config.gc_interval == 0:
+            if dist.get_rank() == 0:
+                logging.info("DistGarbageCollector: Running GC.")
+            gc.collect()
+        # Step 5. Create EMA params
+        # TODO: Implement EMA
+    def generate_video(self, pipeline, prompts, image=None):
+        batch_size = len(prompts)
+        sampled_noise = torch.randn(
+            [batch_size, 21, 16, 60, 104], device="cuda", dtype=self.dtype
+        )
+        video, _ = pipeline.inference(
+            noise=sampled_noise,
+            text_prompts=prompts,
+            return_latents=True
+        )
+        current_video = video.permute(0, 1, 3, 4, 2).cpu().numpy() * 255.0
+        return current_video
+    def train(self):
+        while True:
+            batch = next(self.dataloader)
+            self.train_one_step(batch)
+            if (not self.config.no_save) and self.step % self.config.log_iters == 0:
+                torch.cuda.empty_cache()
+                self.save()
+                torch.cuda.empty_cache()
+            barrier()
+            if self.is_main_process:
+                current_time = time.time()
+                if self.previous_time is None:
+                    self.previous_time = current_time
+                else:
+                    if not self.disable_wandb:
+                        wandb.log({"per iteration time": current_time - self.previous_time}, step=self.step)
+                    self.previous_time = current_time

trainer/distillation.py ADDED Viewed

	@@ -0,0 +1,398 @@

+import gc
+import logging
+from utils.dataset import ShardingLMDBDataset, cycle
+from utils.dataset import TextDataset
+from utils.distributed import EMA_FSDP, fsdp_wrap, fsdp_state_dict, launch_distributed_job
+from utils.misc import (
+    set_seed,
+    merge_dict_list
+)
+import torch.distributed as dist
+from omegaconf import OmegaConf
+from model import CausVid, DMD, SiD
+import torch
+from torch.utils.tensorboard import SummaryWriter
+import time
+import os
+class Trainer:
+    def __init__(self, config):
+        self.config = config
+        self.step = 0
+        # Step 1: Initialize the distributed training environment (rank, seed, dtype, logging etc.)
+        torch.backends.cuda.matmul.allow_tf32 = True
+        torch.backends.cudnn.allow_tf32 = True
+        launch_distributed_job()
+        global_rank = dist.get_rank()
+        self.world_size = dist.get_world_size()
+        self.dtype = torch.bfloat16 if config.mixed_precision else torch.float32
+        self.device = torch.cuda.current_device()
+        self.is_main_process = global_rank == 0
+        self.causal = config.causal
+        # use a random seed for the training
+        if config.seed == 0:
+            random_seed = torch.randint(0, 10000000, (1,), device=self.device)
+            dist.broadcast(random_seed, src=0)
+            config.seed = random_seed.item()
+        set_seed(config.seed + global_rank)
+        if self.is_main_process:
+            self.writer = SummaryWriter(
+                log_dir=os.path.join(config.logdir, "tensorboard"),
+                flush_secs=10
+            )
+        self.output_path = config.logdir
+        # Step 2: Initialize the model and optimizer
+        if config.distribution_loss == "causvid":
+            self.model = CausVid(config, device=self.device)
+        elif config.distribution_loss == "dmd":
+            self.model = DMD(config, device=self.device)
+        elif config.distribution_loss == "sid":
+            self.model = SiD(config, device=self.device)
+        else:
+            raise ValueError("Invalid distribution matching loss")
+        # Save pretrained model state_dicts to CPU
+        self.fake_score_state_dict_cpu = self.model.fake_score.state_dict()
+        self.model.generator = fsdp_wrap(
+            self.model.generator,
+            sharding_strategy=config.sharding_strategy,
+            mixed_precision=config.mixed_precision,
+            wrap_strategy=config.generator_fsdp_wrap_strategy
+        )
+        self.model.real_score = fsdp_wrap(
+            self.model.real_score,
+            sharding_strategy=config.sharding_strategy,
+            mixed_precision=config.mixed_precision,
+            wrap_strategy=config.real_score_fsdp_wrap_strategy
+        )
+        self.model.fake_score = fsdp_wrap(
+            self.model.fake_score,
+            sharding_strategy=config.sharding_strategy,
+            mixed_precision=config.mixed_precision,
+            wrap_strategy=config.fake_score_fsdp_wrap_strategy
+        )
+        self.model.text_encoder = fsdp_wrap(
+            self.model.text_encoder,
+            sharding_strategy=config.sharding_strategy,
+            mixed_precision=config.mixed_precision,
+            wrap_strategy=config.text_encoder_fsdp_wrap_strategy,
+            cpu_offload=getattr(config, "text_encoder_cpu_offload", False)
+        )
+        if not config.no_visualize or config.load_raw_video:
+            self.model.vae = self.model.vae.to(
+                device=self.device, dtype=torch.bfloat16 if config.mixed_precision else torch.float32)
+        self.generator_optimizer = torch.optim.AdamW(
+            [param for param in self.model.generator.parameters()
+             if param.requires_grad],
+            lr=config.lr,
+            betas=(config.beta1, config.beta2),
+            weight_decay=config.weight_decay
+        )
+        self.critic_optimizer = torch.optim.AdamW(
+            [param for param in self.model.fake_score.parameters()
+             if param.requires_grad],
+            lr=config.lr_critic if hasattr(config, "lr_critic") else config.lr,
+            betas=(config.beta1_critic, config.beta2_critic),
+            weight_decay=config.weight_decay
+        )
+        # Step 3: Initialize the dataloader
+        if self.config.i2v:
+            dataset = ShardingLMDBDataset(config.data_path, max_pair=int(1e8))
+        else:
+            dataset = TextDataset(config.data_path)
+        sampler = torch.utils.data.distributed.DistributedSampler(
+            dataset, shuffle=True, drop_last=True)
+        dataloader = torch.utils.data.DataLoader(
+            dataset,
+            batch_size=config.batch_size,
+            sampler=sampler,
+            num_workers=8)
+        if dist.get_rank() == 0:
+            print("DATASET SIZE %d" % len(dataset))
+        self.dataloader = cycle(dataloader)
+        ##############################################################################################################
+        # 6. Set up EMA parameter containers
+        rename_param = (
+            lambda name: name.replace("_fsdp_wrapped_module.", "")
+            .replace("_checkpoint_wrapped_module.", "")
+            .replace("_orig_mod.", "")
+        )
+        self.name_to_trainable_params = {}
+        for n, p in self.model.generator.named_parameters():
+            if not p.requires_grad:
+                continue
+            renamed_n = rename_param(n)
+            self.name_to_trainable_params[renamed_n] = p
+        ema_weight = config.ema_weight
+        self.generator_ema = None
+        if (ema_weight is not None) and (ema_weight > 0.0):
+            print(f"Setting up EMA with weight {ema_weight}")
+            self.generator_ema = EMA_FSDP(self.model.generator, decay=ema_weight)
+        ##############################################################################################################
+        # 7. (If resuming) Load the model and optimizer, lr_scheduler, ema's statedicts
+        if getattr(config, "generator_ckpt", False):
+            print(f"Loading pretrained generator from {config.generator_ckpt}")
+            state_dict = torch.load(config.generator_ckpt, map_location="cpu")
+            if "generator" in state_dict:
+                state_dict = state_dict["generator"]
+            elif "model" in state_dict:
+                state_dict = state_dict["model"]
+            self.model.generator.load_state_dict(
+                state_dict, strict=True
+            )
+        ##############################################################################################################
+        # Let's delete EMA params for early steps to save some computes at training and inference
+        if self.step < config.ema_start_step:
+            self.generator_ema = None
+        self.max_grad_norm_generator = getattr(config, "max_grad_norm_generator", 10.0)
+        self.max_grad_norm_critic = getattr(config, "max_grad_norm_critic", 10.0)
+        self.previous_time = None
+    def save(self):
+        print("Start gathering distributed model states...")
+        generator_state_dict = fsdp_state_dict(
+            self.model.generator)
+        critic_state_dict = fsdp_state_dict(
+            self.model.fake_score)
+        if self.config.ema_start_step < self.step:
+            state_dict = {
+                "generator": generator_state_dict,
+                "critic": critic_state_dict,
+                "generator_ema": self.generator_ema.state_dict(),
+            }
+        else:
+            state_dict = {
+                "generator": generator_state_dict,
+                "critic": critic_state_dict,
+            }
+        if self.is_main_process:
+            os.makedirs(os.path.join(self.output_path,
+                        f"checkpoint_model_{self.step:06d}"), exist_ok=True)
+            torch.save(state_dict, os.path.join(self.output_path,
+                       f"checkpoint_model_{self.step:06d}", "model.pt"))
+            print("Model saved to", os.path.join(self.output_path,
+                  f"checkpoint_model_{self.step:06d}", "model.pt"))
+    def fwdbwd_one_step(self, batch, train_generator):
+        self.model.eval()  # prevent any randomness (e.g. dropout)
+        if self.step % 20 == 0:
+            torch.cuda.empty_cache()
+        # Step 1: Get the next batch of text prompts
+        text_prompts = batch["prompts"]
+        if self.config.i2v:
+            clean_latent = None
+            image_latent = batch["ode_latent"][:, -1][:, 0:1, ].to(
+                device=self.device, dtype=self.dtype)
+        else:
+            clean_latent = None
+            image_latent = None
+        batch_size = len(text_prompts)
+        image_or_video_shape = list(self.config.image_or_video_shape)
+        image_or_video_shape[0] = batch_size
+        # Step 2: Extract the conditional infos
+        with torch.no_grad():
+            conditional_dict = self.model.text_encoder(
+                text_prompts=text_prompts)
+            if not getattr(self, "unconditional_dict", None):
+                unconditional_dict = self.model.text_encoder(
+                    text_prompts=[self.config.negative_prompt] * batch_size)
+                unconditional_dict = {k: v.detach()
+                                      for k, v in unconditional_dict.items()}
+                self.unconditional_dict = unconditional_dict  # cache the unconditional_dict
+            else:
+                unconditional_dict = self.unconditional_dict
+        # Step 3: Store gradients for the generator (if training the generator)
+        if train_generator:
+            generator_loss, generator_log_dict = self.model.generator_loss(
+                image_or_video_shape=image_or_video_shape,
+                conditional_dict=conditional_dict,
+                unconditional_dict=unconditional_dict,
+                clean_latent=clean_latent,
+                initial_latent=image_latent if self.config.i2v else None
+            )
+            generator_loss.backward()
+            generator_grad_norm = self.model.generator.clip_grad_norm_(
+                self.max_grad_norm_generator)
+            generator_log_dict.update({"generator_loss": generator_loss,
+                                       "generator_grad_norm": generator_grad_norm})
+            return generator_log_dict
+        else:
+            generator_log_dict = {}
+        # Step 4: Store gradients for the critic (if training the critic)
+        critic_loss, critic_log_dict = self.model.critic_loss(
+            image_or_video_shape=image_or_video_shape,
+            conditional_dict=conditional_dict,
+            unconditional_dict=unconditional_dict,
+            clean_latent=clean_latent,
+            initial_latent=image_latent if self.config.i2v else None
+        )
+        critic_loss.backward()
+        critic_grad_norm = self.model.fake_score.clip_grad_norm_(
+            self.max_grad_norm_critic)
+        critic_log_dict.update({"critic_loss": critic_loss,
+                                "critic_grad_norm": critic_grad_norm})
+        return critic_log_dict
+    def generate_video(self, pipeline, prompts, image=None):
+        batch_size = len(prompts)
+        if image is not None:
+            image = image.squeeze(0).unsqueeze(0).unsqueeze(2).to(device="cuda", dtype=torch.bfloat16)
+            # Encode the input image as the first latent
+            initial_latent = pipeline.vae.encode_to_latent(image).to(device="cuda", dtype=torch.bfloat16)
+            initial_latent = initial_latent.repeat(batch_size, 1, 1, 1, 1)
+            sampled_noise = torch.randn(
+                [batch_size, self.model.num_training_frames - 1, 16, 60, 104],
+                device="cuda",
+                dtype=self.dtype
+            )
+        else:
+            initial_latent = None
+            sampled_noise = torch.randn(
+                [batch_size, self.model.num_training_frames, 16, 60, 104],
+                device="cuda",
+                dtype=self.dtype
+            )
+        video, _ = pipeline.inference(
+            noise=sampled_noise,
+            text_prompts=prompts,
+            return_latents=True,
+            initial_latent=initial_latent
+        )
+        current_video = video.permute(0, 1, 3, 4, 2).cpu().numpy() * 255.0
+        return current_video
+    def train(self):
+        start_step = self.step
+        while True:
+            TRAIN_GENERATOR = self.step % self.config.dfake_gen_update_ratio == 0
+            # Train the generator
+            if TRAIN_GENERATOR:
+                self.generator_optimizer.zero_grad(set_to_none=True)
+                extras_list = []
+                batch = next(self.dataloader)
+                extra = self.fwdbwd_one_step(batch, True)
+                extras_list.append(extra)
+                generator_log_dict = merge_dict_list(extras_list)
+                self.generator_optimizer.step()
+                if self.generator_ema is not None:
+                    self.generator_ema.update(self.model.generator)
+            # Train the critic
+            self.critic_optimizer.zero_grad(set_to_none=True)
+            extras_list = []
+            batch = next(self.dataloader)
+            extra = self.fwdbwd_one_step(batch, False)
+            extras_list.append(extra)
+            critic_log_dict = merge_dict_list(extras_list)
+            self.critic_optimizer.step()
+            # Increment the step since we finished gradient update
+            self.step += 1
+            # Create EMA params (if not already created)
+            if (self.step >= self.config.ema_start_step) and \
+                    (self.generator_ema is None) and (self.config.ema_weight > 0):
+                self.generator_ema = EMA_FSDP(self.model.generator, decay=self.config.ema_weight)
+            # Save the model
+            if (not self.config.no_save) and (self.step - start_step) > 0 and self.step % self.config.log_iters == 0:
+                torch.cuda.empty_cache()
+                self.save()
+                torch.cuda.empty_cache()
+            # Logging
+            if self.is_main_process:
+                if TRAIN_GENERATOR:
+                    self.writer.add_scalar(
+                        "generator_loss",
+                        generator_log_dict["generator_loss"].mean().item(),
+                        self.step
+                    )
+                    self.writer.add_scalar(
+                        "generator_grad_norm",
+                        generator_log_dict["generator_grad_norm"].mean().item(),
+                        self.step
+                    )
+                    self.writer.add_scalar(
+                        "dmdtrain_gradient_norm",
+                        generator_log_dict["dmdtrain_gradient_norm"].mean().item(),
+                        self.step
+                    )
+                self.writer.add_scalar(
+                    "critic_loss",
+                    critic_log_dict["critic_loss"].mean().item(),
+                    self.step
+                )
+                self.writer.add_scalar(
+                    "critic_grad_norm",
+                    critic_log_dict["critic_grad_norm"].mean().item(),
+                    self.step
+                )
+            if self.step % self.config.gc_interval == 0:
+                if dist.get_rank() == 0:
+                    logging.info("DistGarbageCollector: Running GC.")
+                gc.collect()
+                torch.cuda.empty_cache()
+            if self.is_main_process:
+                current_time = time.time()
+                if self.previous_time is None:
+                    self.previous_time = current_time
+                else:
+                    self.writer.add_scalar(
+                        "per iteration time",
+                        current_time - self.previous_time,
+                        self.step
+                    )
+                    print(
+                        f"Step {self.step} | "
+                        f"Iteration time: {current_time - self.previous_time:.2f} seconds | "
+                    )
+                    self.previous_time = current_time

trainer/gan.py ADDED Viewed

	@@ -0,0 +1,464 @@

+import gc
+import logging
+from utils.dataset import ShardingLMDBDataset, cycle
+from utils.distributed import EMA_FSDP, fsdp_wrap, fsdp_state_dict, launch_distributed_job
+from utils.misc import (
+    set_seed,
+    merge_dict_list
+)
+import torch.distributed as dist
+from omegaconf import OmegaConf
+from model import GAN
+import torch
+import wandb
+import time
+import os
+class Trainer:
+    def __init__(self, config):
+        self.config = config
+        self.step = 0
+        # Step 1: Initialize the distributed training environment (rank, seed, dtype, logging etc.)
+        torch.backends.cuda.matmul.allow_tf32 = True
+        torch.backends.cudnn.allow_tf32 = True
+        launch_distributed_job()
+        global_rank = dist.get_rank()
+        self.world_size = dist.get_world_size()
+        self.dtype = torch.bfloat16 if config.mixed_precision else torch.float32
+        self.device = torch.cuda.current_device()
+        self.is_main_process = global_rank == 0
+        self.causal = config.causal
+        self.disable_wandb = config.disable_wandb
+        # Configuration for discriminator warmup
+        self.discriminator_warmup_steps = getattr(config, "discriminator_warmup_steps", 0)
+        self.in_discriminator_warmup = self.step < self.discriminator_warmup_steps
+        if self.in_discriminator_warmup and self.is_main_process:
+            print(f"Starting with discriminator warmup for {self.discriminator_warmup_steps} steps")
+        self.loss_scale = getattr(config, "loss_scale", 1.0)
+        # use a random seed for the training
+        if config.seed == 0:
+            random_seed = torch.randint(0, 10000000, (1,), device=self.device)
+            dist.broadcast(random_seed, src=0)
+            config.seed = random_seed.item()
+        set_seed(config.seed + global_rank)
+        if self.is_main_process and not self.disable_wandb:
+            wandb.login(host=config.wandb_host, key=config.wandb_key)
+            wandb.init(
+                config=OmegaConf.to_container(config, resolve=True),
+                name=config.config_name,
+                mode="online",
+                entity=config.wandb_entity,
+                project=config.wandb_project,
+                dir=config.wandb_save_dir
+            )
+        self.output_path = config.logdir
+        # Step 2: Initialize the model and optimizer
+        self.model = GAN(config, device=self.device)
+        self.model.generator = fsdp_wrap(
+            self.model.generator,
+            sharding_strategy=config.sharding_strategy,
+            mixed_precision=config.mixed_precision,
+            wrap_strategy=config.generator_fsdp_wrap_strategy
+        )
+        self.model.fake_score = fsdp_wrap(
+            self.model.fake_score,
+            sharding_strategy=config.sharding_strategy,
+            mixed_precision=config.mixed_precision,
+            wrap_strategy=config.fake_score_fsdp_wrap_strategy
+        )
+        self.model.text_encoder = fsdp_wrap(
+            self.model.text_encoder,
+            sharding_strategy=config.sharding_strategy,
+            mixed_precision=config.mixed_precision,
+            wrap_strategy=config.text_encoder_fsdp_wrap_strategy,
+            cpu_offload=getattr(config, "text_encoder_cpu_offload", False)
+        )
+        if not config.no_visualize or config.load_raw_video:
+            self.model.vae = self.model.vae.to(
+                device=self.device, dtype=torch.bfloat16 if config.mixed_precision else torch.float32)
+        self.generator_optimizer = torch.optim.AdamW(
+            [param for param in self.model.generator.parameters()
+             if param.requires_grad],
+            lr=config.gen_lr,
+            betas=(config.beta1, config.beta2)
+        )
+        # Create separate parameter groups for the fake_score network
+        # One group for parameters with "_cls_pred_branch" or "_gan_ca_blocks" in the name
+        # and another group for all other parameters
+        fake_score_params = []
+        discriminator_params = []
+        for name, param in self.model.fake_score.named_parameters():
+            if param.requires_grad:
+                if "_cls_pred_branch" in name or "_gan_ca_blocks" in name:
+                    discriminator_params.append(param)
+                else:
+                    fake_score_params.append(param)
+        # Use the special learning rate for the special parameter group
+        # and the default critic learning rate for other parameters
+        self.critic_param_groups = [
+            {'params': fake_score_params, 'lr': config.critic_lr},
+            {'params': discriminator_params, 'lr': config.critic_lr * config.discriminator_lr_multiplier}
+        ]
+        if self.in_discriminator_warmup:
+            self.critic_optimizer = torch.optim.AdamW(
+                self.critic_param_groups,
+                betas=(0.9, config.beta2_critic)
+            )
+        else:
+            self.critic_optimizer = torch.optim.AdamW(
+                self.critic_param_groups,
+                betas=(config.beta1_critic, config.beta2_critic)
+            )
+        # Step 3: Initialize the dataloader
+        self.data_path = config.data_path
+        dataset = ShardingLMDBDataset(config.data_path, max_pair=int(1e8))
+        sampler = torch.utils.data.distributed.DistributedSampler(
+            dataset, shuffle=True, drop_last=True)
+        dataloader = torch.utils.data.DataLoader(
+            dataset,
+            batch_size=config.batch_size,
+            sampler=sampler,
+            num_workers=8)
+        if dist.get_rank() == 0:
+            print("DATASET SIZE %d" % len(dataset))
+        self.dataloader = cycle(dataloader)
+        ##############################################################################################################
+        # 6. Set up EMA parameter containers
+        rename_param = (
+            lambda name: name.replace("_fsdp_wrapped_module.", "")
+            .replace("_checkpoint_wrapped_module.", "")
+            .replace("_orig_mod.", "")
+        )
+        self.name_to_trainable_params = {}
+        for n, p in self.model.generator.named_parameters():
+            if not p.requires_grad:
+                continue
+            renamed_n = rename_param(n)
+            self.name_to_trainable_params[renamed_n] = p
+        ema_weight = config.ema_weight
+        self.generator_ema = None
+        if (ema_weight is not None) and (ema_weight > 0.0):
+            print(f"Setting up EMA with weight {ema_weight}")
+            self.generator_ema = EMA_FSDP(self.model.generator, decay=ema_weight)
+        ##############################################################################################################
+        # 7. (If resuming) Load the model and optimizer, lr_scheduler, ema's statedicts
+        if getattr(config, "generator_ckpt", False):
+            print(f"Loading pretrained generator from {config.generator_ckpt}")
+            state_dict = torch.load(config.generator_ckpt, map_location="cpu")
+            if "generator" in state_dict:
+                state_dict = state_dict["generator"]
+            elif "model" in state_dict:
+                state_dict = state_dict["model"]
+            self.model.generator.load_state_dict(
+                state_dict, strict=True
+            )
+        if hasattr(config, "load"):
+            resume_ckpt_path_critic = os.path.join(config.load, "critic")
+            resume_ckpt_path_generator = os.path.join(config.load, "generator")
+        else:
+            resume_ckpt_path_critic = "none"
+            resume_ckpt_path_generator = "none"
+        _, _ = self.checkpointer_critic.try_best_load(
+            resume_ckpt_path=resume_ckpt_path_critic,
+        )
+        self.step, _ = self.checkpointer_generator.try_best_load(
+            resume_ckpt_path=resume_ckpt_path_generator,
+            force_start_w_ema=config.force_start_w_ema,
+            force_reset_zero_step=config.force_reset_zero_step,
+            force_reinit_ema=config.force_reinit_ema,
+            skip_optimizer_scheduler=config.skip_optimizer_scheduler,
+        )
+        ##############################################################################################################
+        # Let's delete EMA params for early steps to save some computes at training and inference
+        if self.step < config.ema_start_step:
+            self.generator_ema = None
+        self.max_grad_norm_generator = getattr(config, "max_grad_norm_generator", 10.0)
+        self.max_grad_norm_critic = getattr(config, "max_grad_norm_critic", 10.0)
+        self.previous_time = None
+    def save(self):
+        print("Start gathering distributed model states...")
+        generator_state_dict = fsdp_state_dict(
+            self.model.generator)
+        critic_state_dict = fsdp_state_dict(
+            self.model.fake_score)
+        if self.config.ema_start_step < self.step:
+            state_dict = {
+                "generator": generator_state_dict,
+                "critic": critic_state_dict,
+                "generator_ema": self.generator_ema.state_dict(),
+            }
+        else:
+            state_dict = {
+                "generator": generator_state_dict,
+                "critic": critic_state_dict,
+            }
+        if self.is_main_process:
+            os.makedirs(os.path.join(self.output_path,
+                        f"checkpoint_model_{self.step:06d}"), exist_ok=True)
+            torch.save(state_dict, os.path.join(self.output_path,
+                       f"checkpoint_model_{self.step:06d}", "model.pt"))
+            print("Model saved to", os.path.join(self.output_path,
+                  f"checkpoint_model_{self.step:06d}", "model.pt"))
+    def fwdbwd_one_step(self, batch, train_generator):
+        self.model.eval()  # prevent any randomness (e.g. dropout)
+        if self.step % 20 == 0:
+            torch.cuda.empty_cache()
+        # Step 1: Get the next batch of text prompts
+        text_prompts = batch["prompts"]  # next(self.dataloader)
+        if "ode_latent" in batch:
+            clean_latent = batch["ode_latent"][:, -1].to(device=self.device, dtype=self.dtype)
+        else:
+            frames = batch["frames"].to(device=self.device, dtype=self.dtype)
+            with torch.no_grad():
+                clean_latent = self.model.vae.encode_to_latent(
+                    frames).to(device=self.device, dtype=self.dtype)
+            image_latent = clean_latent[:, 0:1, ]
+        batch_size = len(text_prompts)
+        image_or_video_shape = list(self.config.image_or_video_shape)
+        image_or_video_shape[0] = batch_size
+        # Step 2: Extract the conditional infos
+        with torch.no_grad():
+            conditional_dict = self.model.text_encoder(
+                text_prompts=text_prompts)
+            if not getattr(self, "unconditional_dict", None):
+                unconditional_dict = self.model.text_encoder(
+                    text_prompts=[self.config.negative_prompt] * batch_size)
+                unconditional_dict = {k: v.detach()
+                                      for k, v in unconditional_dict.items()}
+                self.unconditional_dict = unconditional_dict  # cache the unconditional_dict
+            else:
+                unconditional_dict = self.unconditional_dict
+        mini_bs, full_bs = (
+            batch["mini_bs"],
+            batch["full_bs"],
+        )
+        # Step 3: Store gradients for the generator (if training the generator)
+        if train_generator:
+            gan_G_loss = self.model.generator_loss(
+                image_or_video_shape=image_or_video_shape,
+                conditional_dict=conditional_dict,
+                unconditional_dict=unconditional_dict,
+                clean_latent=clean_latent,
+                initial_latent=image_latent if self.config.i2v else None
+            )
+            loss_ratio = mini_bs * self.world_size / full_bs
+            total_loss = gan_G_loss * loss_ratio * self.loss_scale
+            total_loss.backward()
+            generator_grad_norm = self.model.generator.clip_grad_norm_(
+                self.max_grad_norm_generator)
+            generator_log_dict = {"generator_grad_norm": generator_grad_norm,
+                                  "gan_G_loss": gan_G_loss}
+            return generator_log_dict
+        else:
+            generator_log_dict = {}
+        # Step 4: Store gradients for the critic (if training the critic)
+        (gan_D_loss, r1_loss, r2_loss), critic_log_dict = self.model.critic_loss(
+            image_or_video_shape=image_or_video_shape,
+            conditional_dict=conditional_dict,
+            unconditional_dict=unconditional_dict,
+            clean_latent=clean_latent,
+            real_image_or_video=clean_latent,
+            initial_latent=image_latent if self.config.i2v else None
+        )
+        loss_ratio = mini_bs * dist.get_world_size() / full_bs
+        total_loss = (gan_D_loss + 0.5 * (r1_loss + r2_loss)) * loss_ratio * self.loss_scale
+        total_loss.backward()
+        critic_grad_norm = self.model.fake_score.clip_grad_norm_(
+            self.max_grad_norm_critic)
+        critic_log_dict.update({"critic_grad_norm": critic_grad_norm,
+                                "gan_D_loss": gan_D_loss,
+                                "r1_loss": r1_loss,
+                                "r2_loss": r2_loss})
+        return critic_log_dict
+    def generate_video(self, pipeline, prompts, image=None):
+        batch_size = len(prompts)
+        sampled_noise = torch.randn(
+            [batch_size, 21, 16, 60, 104], device="cuda", dtype=self.dtype
+        )
+        video, _ = pipeline.inference(
+            noise=sampled_noise,
+            text_prompts=prompts,
+            return_latents=True
+        )
+        current_video = video.permute(0, 1, 3, 4, 2).cpu().numpy() * 255.0
+        return current_video
+    def train(self):
+        start_step = self.step
+        while True:
+            if self.step == self.discriminator_warmup_steps and self.discriminator_warmup_steps != 0:
+                print("Resetting critic optimizer")
+                del self.critic_optimizer
+                torch.cuda.empty_cache()
+                # Create new optimizers
+                self.critic_optimizer = torch.optim.AdamW(
+                    self.critic_param_groups,
+                    betas=(self.config.beta1_critic, self.config.beta2_critic)
+                )
+                # Update checkpointer references
+                self.checkpointer_critic.optimizer = self.critic_optimizer
+            # Check if we're in the discriminator warmup phase
+            self.in_discriminator_warmup = self.step < self.discriminator_warmup_steps
+            # Only update generator and critic outside the warmup phase
+            TRAIN_GENERATOR = not self.in_discriminator_warmup and self.step % self.config.dfake_gen_update_ratio == 0
+            # Train the generator (only outside warmup phase)
+            if TRAIN_GENERATOR:
+                self.model.fake_score.requires_grad_(False)
+                self.model.generator.requires_grad_(True)
+                self.generator_optimizer.zero_grad(set_to_none=True)
+                extras_list = []
+                for ii, mini_batch in enumerate(self.dataloader.next()):
+                    extra = self.fwdbwd_one_step(mini_batch, True)
+                    extras_list.append(extra)
+                generator_log_dict = merge_dict_list(extras_list)
+                self.generator_optimizer.step()
+                if self.generator_ema is not None:
+                    self.generator_ema.update(self.model.generator)
+            else:
+                generator_log_dict = {}
+            # Train the critic/discriminator
+            if self.in_discriminator_warmup:
+                # During warmup, only allow gradient for discriminator params
+                self.model.generator.requires_grad_(False)
+                self.model.fake_score.requires_grad_(False)
+                # Enable gradient only for discriminator params
+                for name, param in self.model.fake_score.named_parameters():
+                    if "_cls_pred_branch" in name or "_gan_ca_blocks" in name:
+                        param.requires_grad_(True)
+            else:
+                # Normal training mode
+                self.model.generator.requires_grad_(False)
+                self.model.fake_score.requires_grad_(True)
+            self.critic_optimizer.zero_grad(set_to_none=True)
+            extras_list = []
+            batch = next(self.dataloader)
+            extra = self.fwdbwd_one_step(batch, False)
+            extras_list.append(extra)
+            critic_log_dict = merge_dict_list(extras_list)
+            self.critic_optimizer.step()
+            # Increment the step since we finished gradient update
+            self.step += 1
+            # If we just finished warmup, print a message
+            if self.is_main_process and self.step == self.discriminator_warmup_steps:
+                print(f"Finished discriminator warmup after {self.discriminator_warmup_steps} steps")
+            # Create EMA params (if not already created)
+            if (self.step >= self.config.ema_start_step) and \
+                    (self.generator_ema is None) and (self.config.ema_weight > 0):
+                self.generator_ema = EMA_FSDP(self.model.generator, decay=self.config.ema_weight)
+            # Save the model
+            if (not self.config.no_save) and (self.step - start_step) > 0 and self.step % self.config.log_iters == 0:
+                torch.cuda.empty_cache()
+                self.save()
+                torch.cuda.empty_cache()
+            # Logging
+            wandb_loss_dict = {
+                "generator_grad_norm": generator_log_dict["generator_grad_norm"],
+                "critic_grad_norm": critic_log_dict["critic_grad_norm"],
+                "real_logit": critic_log_dict["noisy_real_logit"],
+                "fake_logit": critic_log_dict["noisy_fake_logit"],
+                "r1_loss": critic_log_dict["r1_loss"],
+                "r2_loss": critic_log_dict["r2_loss"],
+            }
+            if TRAIN_GENERATOR:
+                wandb_loss_dict.update({
+                    "generator_grad_norm": generator_log_dict["generator_grad_norm"],
+                })
+            self.all_gather_dict(wandb_loss_dict)
+            wandb_loss_dict["diff_logit"] = wandb_loss_dict["real_logit"] - wandb_loss_dict["fake_logit"]
+            wandb_loss_dict["reg_loss"] = 0.5 * (wandb_loss_dict["r1_loss"] + wandb_loss_dict["r2_loss"])
+            if self.is_main_process:
+                if self.in_discriminator_warmup:
+                    warmup_status = f"[WARMUP {self.step}/{self.discriminator_warmup_steps}] Training only discriminator params"
+                    print(warmup_status)
+                    if not self.disable_wandb:
+                        wandb_loss_dict.update({"warmup_status": 1.0})
+                if not self.disable_wandb:
+                    wandb.log(wandb_loss_dict, step=self.step)
+            if self.step % self.config.gc_interval == 0:
+                if dist.get_rank() == 0:
+                    logging.info("DistGarbageCollector: Running GC.")
+                gc.collect()
+                torch.cuda.empty_cache()
+            if self.is_main_process:
+                current_time = time.time()
+                if self.previous_time is None:
+                    self.previous_time = current_time
+                else:
+                    if not self.disable_wandb:
+                        wandb.log({"per iteration time": current_time - self.previous_time}, step=self.step)
+                    self.previous_time = current_time
+    def all_gather_dict(self, target_dict):
+        for key, value in target_dict.items():
+            gathered_value = torch.zeros(
+                [self.world_size, *value.shape],
+                dtype=value.dtype, device=self.device)
+            dist.all_gather_into_tensor(gathered_value, value)
+            avg_value = gathered_value.mean().item()
+            target_dict[key] = avg_value

trainer/ode.py ADDED Viewed

	@@ -0,0 +1,242 @@

+import gc
+import logging
+from utils.dataset import ODERegressionLMDBDataset, cycle
+from model import ODERegression
+from collections import defaultdict
+from utils.misc import (
+    set_seed
+)
+import torch.distributed as dist
+from omegaconf import OmegaConf
+import torch
+import wandb
+import time
+import os
+from utils.distributed import barrier, fsdp_wrap, fsdp_state_dict, launch_distributed_job
+class Trainer:
+    def __init__(self, config):
+        self.config = config
+        self.step = 0
+        # Step 1: Initialize the distributed training environment (rank, seed, dtype, logging etc.)
+        torch.backends.cuda.matmul.allow_tf32 = True
+        torch.backends.cudnn.allow_tf32 = True
+        launch_distributed_job()
+        global_rank = dist.get_rank()
+        self.world_size = dist.get_world_size()
+        self.dtype = torch.bfloat16 if config.mixed_precision else torch.float32
+        self.device = torch.cuda.current_device()
+        self.is_main_process = global_rank == 0
+        self.disable_wandb = config.disable_wandb
+        # use a random seed for the training
+        if config.seed == 0:
+            random_seed = torch.randint(0, 10000000, (1,), device=self.device)
+            dist.broadcast(random_seed, src=0)
+            config.seed = random_seed.item()
+        set_seed(config.seed + global_rank)
+        if self.is_main_process and not self.disable_wandb:
+            wandb.login(host=config.wandb_host, key=config.wandb_key)
+            wandb.init(
+                config=OmegaConf.to_container(config, resolve=True),
+                name=config.config_name,
+                mode="online",
+                entity=config.wandb_entity,
+                project=config.wandb_project,
+                dir=config.wandb_save_dir
+            )
+        self.output_path = config.logdir
+        # Step 2: Initialize the model and optimizer
+        assert config.distribution_loss == "ode", "Only ODE loss is supported for ODE training"
+        self.model = ODERegression(config, device=self.device)
+        self.model.generator = fsdp_wrap(
+            self.model.generator,
+            sharding_strategy=config.sharding_strategy,
+            mixed_precision=config.mixed_precision,
+            wrap_strategy=config.generator_fsdp_wrap_strategy
+        )
+        self.model.text_encoder = fsdp_wrap(
+            self.model.text_encoder,
+            sharding_strategy=config.sharding_strategy,
+            mixed_precision=config.mixed_precision,
+            wrap_strategy=config.text_encoder_fsdp_wrap_strategy,
+            cpu_offload=getattr(config, "text_encoder_cpu_offload", False)
+        )
+        if not config.no_visualize or config.load_raw_video:
+            self.model.vae = self.model.vae.to(
+                device=self.device, dtype=torch.bfloat16 if config.mixed_precision else torch.float32)
+        self.generator_optimizer = torch.optim.AdamW(
+            [param for param in self.model.generator.parameters()
+             if param.requires_grad],
+            lr=config.lr,
+            betas=(config.beta1, config.beta2),
+            weight_decay=config.weight_decay
+        )
+        # Step 3: Initialize the dataloader
+        dataset = ODERegressionLMDBDataset(
+            config.data_path, max_pair=getattr(config, "max_pair", int(1e8)))
+        sampler = torch.utils.data.distributed.DistributedSampler(
+            dataset, shuffle=True, drop_last=True)
+        dataloader = torch.utils.data.DataLoader(
+            dataset, batch_size=config.batch_size, sampler=sampler, num_workers=8)
+        total_batch_size = getattr(config, "total_batch_size", None)
+        if total_batch_size is not None:
+            assert total_batch_size == config.batch_size * self.world_size, "Gradient accumulation is not supported for ODE training"
+        self.dataloader = cycle(dataloader)
+        self.step = 0
+        ##############################################################################################################
+        # 7. (If resuming) Load the model and optimizer, lr_scheduler, ema's statedicts
+        if getattr(config, "generator_ckpt", False):
+            print(f"Loading pretrained generator from {config.generator_ckpt}")
+            state_dict = torch.load(config.generator_ckpt, map_location="cpu")[
+                'generator']
+            self.model.generator.load_state_dict(
+                state_dict, strict=True
+            )
+        ##############################################################################################################
+        self.max_grad_norm = 10.0
+        self.previous_time = None
+    def save(self):
+        print("Start gathering distributed model states...")
+        generator_state_dict = fsdp_state_dict(
+            self.model.generator)
+        state_dict = {
+            "generator": generator_state_dict
+        }
+        if self.is_main_process:
+            os.makedirs(os.path.join(self.output_path,
+                        f"checkpoint_model_{self.step:06d}"), exist_ok=True)
+            torch.save(state_dict, os.path.join(self.output_path,
+                       f"checkpoint_model_{self.step:06d}", "model.pt"))
+            print("Model saved to", os.path.join(self.output_path,
+                  f"checkpoint_model_{self.step:06d}", "model.pt"))
+    def train_one_step(self):
+        VISUALIZE = self.step % 100 == 0
+        self.model.eval()  # prevent any randomness (e.g. dropout)
+        # Step 1: Get the next batch of text prompts
+        batch = next(self.dataloader)
+        text_prompts = batch["prompts"]
+        ode_latent = batch["ode_latent"].to(
+            device=self.device, dtype=self.dtype)
+        # Step 2: Extract the conditional infos
+        with torch.no_grad():
+            conditional_dict = self.model.text_encoder(
+                text_prompts=text_prompts)
+        # Step 3: Train the generator
+        generator_loss, log_dict = self.model.generator_loss(
+            ode_latent=ode_latent,
+            conditional_dict=conditional_dict
+        )
+        unnormalized_loss = log_dict["unnormalized_loss"]
+        timestep = log_dict["timestep"]
+        if self.world_size > 1:
+            gathered_unnormalized_loss = torch.zeros(
+                [self.world_size, *unnormalized_loss.shape],
+                dtype=unnormalized_loss.dtype, device=self.device)
+            gathered_timestep = torch.zeros(
+                [self.world_size, *timestep.shape],
+                dtype=timestep.dtype, device=self.device)
+            dist.all_gather_into_tensor(
+                gathered_unnormalized_loss, unnormalized_loss)
+            dist.all_gather_into_tensor(gathered_timestep, timestep)
+        else:
+            gathered_unnormalized_loss = unnormalized_loss
+            gathered_timestep = timestep
+        loss_breakdown = defaultdict(list)
+        stats = {}
+        for index, t in enumerate(timestep):
+            loss_breakdown[str(int(t.item()) // 250 * 250)].append(
+                unnormalized_loss[index].item())
+        for key_t in loss_breakdown.keys():
+            stats["loss_at_time_" + key_t] = sum(loss_breakdown[key_t]) / \
+                len(loss_breakdown[key_t])
+        self.generator_optimizer.zero_grad()
+        generator_loss.backward()
+        generator_grad_norm = self.model.generator.clip_grad_norm_(
+            self.max_grad_norm)
+        self.generator_optimizer.step()
+        # Step 4: Visualization
+        if VISUALIZE and not self.config.no_visualize and not self.config.disable_wandb and self.is_main_process:
+            # Visualize the input, output, and ground truth
+            input = log_dict["input"]
+            output = log_dict["output"]
+            ground_truth = ode_latent[:, -1]
+            input_video = self.model.vae.decode_to_pixel(input)
+            output_video = self.model.vae.decode_to_pixel(output)
+            ground_truth_video = self.model.vae.decode_to_pixel(ground_truth)
+            input_video = 255.0 * (input_video.cpu().numpy() * 0.5 + 0.5)
+            output_video = 255.0 * (output_video.cpu().numpy() * 0.5 + 0.5)
+            ground_truth_video = 255.0 * (ground_truth_video.cpu().numpy() * 0.5 + 0.5)
+            # Visualize the input, output, and ground truth
+            wandb.log({
+                "input": wandb.Video(input_video, caption="Input", fps=16, format="mp4"),
+                "output": wandb.Video(output_video, caption="Output", fps=16, format="mp4"),
+                "ground_truth": wandb.Video(ground_truth_video, caption="Ground Truth", fps=16, format="mp4"),
+            }, step=self.step)
+        # Step 5: Logging
+        if self.is_main_process and not self.disable_wandb:
+            wandb_loss_dict = {
+                "generator_loss": generator_loss.item(),
+                "generator_grad_norm": generator_grad_norm.item(),
+                **stats
+            }
+            wandb.log(wandb_loss_dict, step=self.step)
+        if self.step % self.config.gc_interval == 0:
+            if dist.get_rank() == 0:
+                logging.info("DistGarbageCollector: Running GC.")
+            gc.collect()
+    def train(self):
+        while True:
+            self.train_one_step()
+            if (not self.config.no_save) and self.step % self.config.log_iters == 0:
+                self.save()
+                torch.cuda.empty_cache()
+            barrier()
+            if self.is_main_process:
+                current_time = time.time()
+                if self.previous_time is None:
+                    self.previous_time = current_time
+                else:
+                    if not self.disable_wandb:
+                        wandb.log({"per iteration time": current_time - self.previous_time}, step=self.step)
+                    self.previous_time = current_time
+            self.step += 1

utils/dataset.py ADDED Viewed

	@@ -0,0 +1,220 @@

+from utils.lmdb import get_array_shape_from_lmdb, retrieve_row_from_lmdb
+from torch.utils.data import Dataset
+import numpy as np
+import torch
+import lmdb
+import json
+from pathlib import Path
+from PIL import Image
+import os
+class TextDataset(Dataset):
+    def __init__(self, prompt_path, extended_prompt_path=None):
+        with open(prompt_path, encoding="utf-8") as f:
+            self.prompt_list = [line.rstrip() for line in f]
+        if extended_prompt_path is not None:
+            with open(extended_prompt_path, encoding="utf-8") as f:
+                self.extended_prompt_list = [line.rstrip() for line in f]
+            assert len(self.extended_prompt_list) == len(self.prompt_list)
+        else:
+            self.extended_prompt_list = None
+    def __len__(self):
+        return len(self.prompt_list)
+    def __getitem__(self, idx):
+        batch = {
+            "prompts": self.prompt_list[idx],
+            "idx": idx,
+        }
+        if self.extended_prompt_list is not None:
+            batch["extended_prompts"] = self.extended_prompt_list[idx]
+        return batch
+class ODERegressionLMDBDataset(Dataset):
+    def __init__(self, data_path: str, max_pair: int = int(1e8)):
+        self.env = lmdb.open(data_path, readonly=True,
+                             lock=False, readahead=False, meminit=False)
+        self.latents_shape = get_array_shape_from_lmdb(self.env, 'latents')
+        self.max_pair = max_pair
+    def __len__(self):
+        return min(self.latents_shape[0], self.max_pair)
+    def __getitem__(self, idx):
+        """
+        Outputs:
+            - prompts: List of Strings
+            - latents: Tensor of shape (num_denoising_steps, num_frames, num_channels, height, width). It is ordered from pure noise to clean image.
+        """
+        latents = retrieve_row_from_lmdb(
+            self.env,
+            "latents", np.float16, idx, shape=self.latents_shape[1:]
+        )
+        if len(latents.shape) == 4:
+            latents = latents[None, ...]
+        prompts = retrieve_row_from_lmdb(
+            self.env,
+            "prompts", str, idx
+        )
+        return {
+            "prompts": prompts,
+            "ode_latent": torch.tensor(latents, dtype=torch.float32)
+        }
+class ShardingLMDBDataset(Dataset):
+    def __init__(self, data_path: str, max_pair: int = int(1e8)):
+        self.envs = []
+        self.index = []
+        for fname in sorted(os.listdir(data_path)):
+            path = os.path.join(data_path, fname)
+            env = lmdb.open(path,
+                            readonly=True,
+                            lock=False,
+                            readahead=False,
+                            meminit=False)
+            self.envs.append(env)
+        self.latents_shape = [None] * len(self.envs)
+        for shard_id, env in enumerate(self.envs):
+            self.latents_shape[shard_id] = get_array_shape_from_lmdb(env, 'latents')
+            for local_i in range(self.latents_shape[shard_id][0]):
+                self.index.append((shard_id, local_i))
+            # print("shard_id ", shard_id, " local_i ", local_i)
+        self.max_pair = max_pair
+    def __len__(self):
+        return len(self.index)
+    def __getitem__(self, idx):
+        """
+            Outputs:
+                - prompts: List of Strings
+                - latents: Tensor of shape (num_denoising_steps, num_frames, num_channels, height, width). It is ordered from pure noise to clean image.
+        """
+        shard_id, local_idx = self.index[idx]
+        latents = retrieve_row_from_lmdb(
+            self.envs[shard_id],
+            "latents", np.float16, local_idx,
+            shape=self.latents_shape[shard_id][1:]
+        )
+        if len(latents.shape) == 4:
+            latents = latents[None, ...]
+        prompts = retrieve_row_from_lmdb(
+            self.envs[shard_id],
+            "prompts", str, local_idx
+        )
+        return {
+            "prompts": prompts,
+            "ode_latent": torch.tensor(latents, dtype=torch.float32)
+        }
+class TextImagePairDataset(Dataset):
+    def __init__(
+        self,
+        data_dir,
+        transform=None,
+        eval_first_n=-1,
+        pad_to_multiple_of=None
+    ):
+        """
+        Args:
+            data_dir (str): Path to the directory containing:
+                - target_crop_info_*.json (metadata file)
+                - */ (subdirectory containing images with matching aspect ratio)
+            transform (callable, optional): Optional transform to be applied on the image
+        """
+        self.transform = transform
+        data_dir = Path(data_dir)
+        # Find the metadata JSON file
+        metadata_files = list(data_dir.glob('target_crop_info_*.json'))
+        if not metadata_files:
+            raise FileNotFoundError(f"No metadata file found in {data_dir}")
+        if len(metadata_files) > 1:
+            raise ValueError(f"Multiple metadata files found in {data_dir}")
+        metadata_path = metadata_files[0]
+        # Extract aspect ratio from metadata filename (e.g. target_crop_info_26-15.json -> 26-15)
+        aspect_ratio = metadata_path.stem.split('_')[-1]
+        # Use aspect ratio subfolder for images
+        self.image_dir = data_dir / aspect_ratio
+        if not self.image_dir.exists():
+            raise FileNotFoundError(f"Image directory not found: {self.image_dir}")
+        # Load metadata
+        with open(metadata_path, 'r') as f:
+            self.metadata = json.load(f)
+        eval_first_n = eval_first_n if eval_first_n != -1 else len(self.metadata)
+        self.metadata = self.metadata[:eval_first_n]
+        # Verify all images exist
+        for item in self.metadata:
+            image_path = self.image_dir / item['file_name']
+            if not image_path.exists():
+                raise FileNotFoundError(f"Image not found: {image_path}")
+        self.dummy_prompt = "DUMMY PROMPT"
+        self.pre_pad_len = len(self.metadata)
+        if pad_to_multiple_of is not None and len(self.metadata) % pad_to_multiple_of != 0:
+            # Duplicate the last entry
+            self.metadata += [self.metadata[-1]] * (
+                pad_to_multiple_of - len(self.metadata) % pad_to_multiple_of
+            )
+    def __len__(self):
+        return len(self.metadata)
+    def __getitem__(self, idx):
+        """
+        Returns:
+            dict: A dictionary containing:
+                - image: PIL Image
+                - caption: str
+                - target_bbox: list of int [x1, y1, x2, y2]
+                - target_ratio: str
+                - type: str
+                - origin_size: tuple of int (width, height)
+        """
+        item = self.metadata[idx]
+        # Load image
+        image_path = self.image_dir / item['file_name']
+        image = Image.open(image_path).convert('RGB')
+        # Apply transform if specified
+        if self.transform:
+            image = self.transform(image)
+        return {
+            'image': image,
+            'prompts': item['caption'],
+            'target_bbox': item['target_crop']['target_bbox'],
+            'target_ratio': item['target_crop']['target_ratio'],
+            'type': item['type'],
+            'origin_size': (item['origin_width'], item['origin_height']),
+            'idx': idx
+        }
+def cycle(dl):
+    while True:
+        for data in dl:
+            yield data

utils/distributed.py ADDED Viewed

	@@ -0,0 +1,125 @@

+from datetime import timedelta
+from functools import partial
+import os
+import torch
+import torch.distributed as dist
+from torch.distributed.fsdp import FullStateDictConfig, FullyShardedDataParallel as FSDP, MixedPrecision, ShardingStrategy, StateDictType
+from torch.distributed.fsdp.api import CPUOffload
+from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy, transformer_auto_wrap_policy
+def fsdp_state_dict(model):
+    fsdp_fullstate_save_policy = FullStateDictConfig(
+        offload_to_cpu=True, rank0_only=True
+    )
+    with FSDP.state_dict_type(
+        model, StateDictType.FULL_STATE_DICT, fsdp_fullstate_save_policy
+    ):
+        checkpoint = model.state_dict()
+    return checkpoint
+def fsdp_wrap(module, sharding_strategy="full", mixed_precision=False, wrap_strategy="size", min_num_params=int(5e7), transformer_module=None, cpu_offload=False):
+    if mixed_precision:
+        mixed_precision_policy = MixedPrecision(
+            param_dtype=torch.bfloat16,
+            reduce_dtype=torch.float32,
+            buffer_dtype=torch.float32,
+            cast_forward_inputs=False
+        )
+    else:
+        mixed_precision_policy = None
+    if wrap_strategy == "transformer":
+        auto_wrap_policy = partial(
+            transformer_auto_wrap_policy,
+            transformer_layer_cls=transformer_module
+        )
+    elif wrap_strategy == "size":
+        auto_wrap_policy = partial(
+            size_based_auto_wrap_policy,
+            min_num_params=min_num_params
+        )
+    else:
+        raise ValueError(f"Invalid wrap strategy: {wrap_strategy}")
+    os.environ["NCCL_CROSS_NIC"] = "1"
+    sharding_strategy = {
+        "full": ShardingStrategy.FULL_SHARD,
+        "hybrid_full": ShardingStrategy.HYBRID_SHARD,
+        "hybrid_zero2": ShardingStrategy._HYBRID_SHARD_ZERO2,
+        "no_shard": ShardingStrategy.NO_SHARD,
+    }[sharding_strategy]
+    module = FSDP(
+        module,
+        auto_wrap_policy=auto_wrap_policy,
+        sharding_strategy=sharding_strategy,
+        mixed_precision=mixed_precision_policy,
+        device_id=torch.cuda.current_device(),
+        limit_all_gathers=True,
+        use_orig_params=True,
+        cpu_offload=CPUOffload(offload_params=cpu_offload),
+        sync_module_states=False  # Load ckpt on rank 0 and sync to other ranks
+    )
+    return module
+def barrier():
+    if dist.is_initialized():
+        dist.barrier()
+def launch_distributed_job(backend: str = "nccl"):
+    rank = int(os.environ["RANK"])
+    local_rank = int(os.environ["LOCAL_RANK"])
+    world_size = int(os.environ["WORLD_SIZE"])
+    host = os.environ["MASTER_ADDR"]
+    port = int(os.environ["MASTER_PORT"])
+    if ":" in host:  # IPv6
+        init_method = f"tcp://[{host}]:{port}"
+    else:  # IPv4
+        init_method = f"tcp://{host}:{port}"
+    dist.init_process_group(rank=rank, world_size=world_size, backend=backend,
+                            init_method=init_method, timeout=timedelta(minutes=30))
+    torch.cuda.set_device(local_rank)
+class EMA_FSDP:
+    def __init__(self, fsdp_module: torch.nn.Module, decay: float = 0.999):
+        self.decay = decay
+        self.shadow = {}
+        self._init_shadow(fsdp_module)
+    @torch.no_grad()
+    def _init_shadow(self, fsdp_module):
+        from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+        with FSDP.summon_full_params(fsdp_module, writeback=False):
+            for n, p in fsdp_module.module.named_parameters():
+                self.shadow[n] = p.detach().clone().float().cpu()
+    @torch.no_grad()
+    def update(self, fsdp_module):
+        d = self.decay
+        from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+        with FSDP.summon_full_params(fsdp_module, writeback=False):
+            for n, p in fsdp_module.module.named_parameters():
+                self.shadow[n].mul_(d).add_(p.detach().float().cpu(), alpha=1. - d)
+    # Optional helpers ---------------------------------------------------
+    def state_dict(self):
+        return self.shadow            # picklable
+    def load_state_dict(self, sd):
+        self.shadow = {k: v.clone() for k, v in sd.items()}
+    def copy_to(self, fsdp_module):
+        # load EMA weights into an (unwrapped) copy of the generator
+        from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+        with FSDP.summon_full_params(fsdp_module, writeback=True):
+            for n, p in fsdp_module.module.named_parameters():
+                if n in self.shadow:
+                    p.data.copy_(self.shadow[n].to(p.dtype, device=p.device))

utils/lmdb.py ADDED Viewed

	@@ -0,0 +1,72 @@

+import numpy as np
+def get_array_shape_from_lmdb(env, array_name):
+    with env.begin() as txn:
+        image_shape = txn.get(f"{array_name}_shape".encode()).decode()
+        image_shape = tuple(map(int, image_shape.split()))
+    return image_shape
+def store_arrays_to_lmdb(env, arrays_dict, start_index=0):
+    """
+    Store rows of multiple numpy arrays in a single LMDB.
+    Each row is stored separately with a naming convention.
+    """
+    with env.begin(write=True) as txn:
+        for array_name, array in arrays_dict.items():
+            for i, row in enumerate(array):
+                # Convert row to bytes
+                if isinstance(row, str):
+                    row_bytes = row.encode()
+                else:
+                    row_bytes = row.tobytes()
+                data_key = f'{array_name}_{start_index + i}_data'.encode()
+                txn.put(data_key, row_bytes)
+def process_data_dict(data_dict, seen_prompts):
+    output_dict = {}
+    all_videos = []
+    all_prompts = []
+    for prompt, video in data_dict.items():
+        if prompt in seen_prompts:
+            continue
+        else:
+            seen_prompts.add(prompt)
+        video = video.half().numpy()
+        all_videos.append(video)
+        all_prompts.append(prompt)
+    if len(all_videos) == 0:
+        return {"latents": np.array([]), "prompts": np.array([])}
+    all_videos = np.concatenate(all_videos, axis=0)
+    output_dict['latents'] = all_videos
+    output_dict['prompts'] = np.array(all_prompts)
+    return output_dict
+def retrieve_row_from_lmdb(lmdb_env, array_name, dtype, row_index, shape=None):
+    """
+    Retrieve a specific row from a specific array in the LMDB.
+    """
+    data_key = f'{array_name}_{row_index}_data'.encode()
+    with lmdb_env.begin() as txn:
+        row_bytes = txn.get(data_key)
+    if dtype == str:
+        array = row_bytes.decode()
+    else:
+        array = np.frombuffer(row_bytes, dtype=dtype)
+    if shape is not None and len(shape) > 0:
+        array = array.reshape(shape)
+    return array

utils/loss.py ADDED Viewed

	@@ -0,0 +1,81 @@

+from abc import ABC, abstractmethod
+import torch
+class DenoisingLoss(ABC):
+    @abstractmethod
+    def __call__(
+        self, x: torch.Tensor, x_pred: torch.Tensor,
+        noise: torch.Tensor, noise_pred: torch.Tensor,
+        alphas_cumprod: torch.Tensor,
+        timestep: torch.Tensor,
+        **kwargs
+    ) -> torch.Tensor:
+        """
+        Base class for denoising loss.
+        Input:
+            - x: the clean data with shape [B, F, C, H, W]
+            - x_pred: the predicted clean data with shape [B, F, C, H, W]
+            - noise: the noise with shape [B, F, C, H, W]
+            - noise_pred: the predicted noise with shape [B, F, C, H, W]
+            - alphas_cumprod: the cumulative product of alphas (defining the noise schedule) with shape [T]
+            - timestep: the current timestep with shape [B, F]
+        """
+        pass
+class X0PredLoss(DenoisingLoss):
+    def __call__(
+        self, x: torch.Tensor, x_pred: torch.Tensor,
+        noise: torch.Tensor, noise_pred: torch.Tensor,
+        alphas_cumprod: torch.Tensor,
+        timestep: torch.Tensor,
+        **kwargs
+    ) -> torch.Tensor:
+        return torch.mean((x - x_pred) ** 2)
+class VPredLoss(DenoisingLoss):
+    def __call__(
+        self, x: torch.Tensor, x_pred: torch.Tensor,
+        noise: torch.Tensor, noise_pred: torch.Tensor,
+        alphas_cumprod: torch.Tensor,
+        timestep: torch.Tensor,
+        **kwargs
+    ) -> torch.Tensor:
+        weights = 1 / (1 - alphas_cumprod[timestep].reshape(*timestep.shape, 1, 1, 1))
+        return torch.mean(weights * (x - x_pred) ** 2)
+class NoisePredLoss(DenoisingLoss):
+    def __call__(
+        self, x: torch.Tensor, x_pred: torch.Tensor,
+        noise: torch.Tensor, noise_pred: torch.Tensor,
+        alphas_cumprod: torch.Tensor,
+        timestep: torch.Tensor,
+        **kwargs
+    ) -> torch.Tensor:
+        return torch.mean((noise - noise_pred) ** 2)
+class FlowPredLoss(DenoisingLoss):
+    def __call__(
+        self, x: torch.Tensor, x_pred: torch.Tensor,
+        noise: torch.Tensor, noise_pred: torch.Tensor,
+        alphas_cumprod: torch.Tensor,
+        timestep: torch.Tensor,
+        **kwargs
+    ) -> torch.Tensor:
+        return torch.mean((kwargs["flow_pred"] - (noise - x)) ** 2)
+NAME_TO_CLASS = {
+    "x0": X0PredLoss,
+    "v": VPredLoss,
+    "noise": NoisePredLoss,
+    "flow": FlowPredLoss
+}
+def get_denoising_loss(loss_type: str) -> DenoisingLoss:
+    return NAME_TO_CLASS[loss_type]

utils/misc.py ADDED Viewed

	@@ -0,0 +1,39 @@

+import numpy as np
+import random
+import torch
+def set_seed(seed: int, deterministic: bool = False):
+    """
+    Helper function for reproducible behavior to set the seed in `random`, `numpy`, `torch`.
+    Args:
+        seed (`int`):
+            The seed to set.
+        deterministic (`bool`, *optional*, defaults to `False`):
+            Whether to use deterministic algorithms where available. Can slow down training.
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    if deterministic:
+        torch.use_deterministic_algorithms(True)
+def merge_dict_list(dict_list):
+    if len(dict_list) == 1:
+        return dict_list[0]
+    merged_dict = {}
+    for k, v in dict_list[0].items():
+        if isinstance(v, torch.Tensor):
+            if v.ndim == 0:
+                merged_dict[k] = torch.stack([d[k] for d in dict_list], dim=0)
+            else:
+                merged_dict[k] = torch.cat([d[k] for d in dict_list], dim=0)
+        else:
+            # for non-tensor values, we just copy the value from the first item
+            merged_dict[k] = v
+    return merged_dict

utils/scheduler.py ADDED Viewed

	@@ -0,0 +1,194 @@

+from abc import abstractmethod, ABC
+import torch
+class SchedulerInterface(ABC):
+    """
+    Base class for diffusion noise schedule.
+    """
+    alphas_cumprod: torch.Tensor  # [T], alphas for defining the noise schedule
+    @abstractmethod
+    def add_noise(
+        self, clean_latent: torch.Tensor,
+        noise: torch.Tensor, timestep: torch.Tensor
+    ):
+        """
+        Diffusion forward corruption process.
+        Input:
+            - clean_latent: the clean latent with shape [B, C, H, W]
+            - noise: the noise with shape [B, C, H, W]
+            - timestep: the timestep with shape [B]
+        Output: the corrupted latent with shape [B, C, H, W]
+        """
+        pass
+    def convert_x0_to_noise(
+        self, x0: torch.Tensor, xt: torch.Tensor,
+        timestep: torch.Tensor
+    ) -> torch.Tensor:
+        """
+        Convert the diffusion network's x0 prediction to noise predidction.
+        x0: the predicted clean data with shape [B, C, H, W]
+        xt: the input noisy data with shape [B, C, H, W]
+        timestep: the timestep with shape [B]
+        noise = (xt-sqrt(alpha_t)*x0) / sqrt(beta_t) (eq 11 in https://arxiv.org/abs/2311.18828)
+        """
+        # use higher precision for calculations
+        original_dtype = x0.dtype
+        x0, xt, alphas_cumprod = map(
+            lambda x: x.double().to(x0.device), [x0, xt,
+                                                 self.alphas_cumprod]
+        )
+        alpha_prod_t = alphas_cumprod[timestep].reshape(-1, 1, 1, 1)
+        beta_prod_t = 1 - alpha_prod_t
+        noise_pred = (xt - alpha_prod_t **
+                      (0.5) * x0) / beta_prod_t ** (0.5)
+        return noise_pred.to(original_dtype)
+    def convert_noise_to_x0(
+        self, noise: torch.Tensor, xt: torch.Tensor,
+        timestep: torch.Tensor
+    ) -> torch.Tensor:
+        """
+        Convert the diffusion network's noise prediction to x0 predidction.
+        noise: the predicted noise with shape [B, C, H, W]
+        xt: the input noisy data with shape [B, C, H, W]
+        timestep: the timestep with shape [B]
+        x0 = (x_t - sqrt(beta_t) * noise) / sqrt(alpha_t) (eq 11 in https://arxiv.org/abs/2311.18828)
+        """
+        # use higher precision for calculations
+        original_dtype = noise.dtype
+        noise, xt, alphas_cumprod = map(
+            lambda x: x.double().to(noise.device), [noise, xt,
+                                                    self.alphas_cumprod]
+        )
+        alpha_prod_t = alphas_cumprod[timestep].reshape(-1, 1, 1, 1)
+        beta_prod_t = 1 - alpha_prod_t
+        x0_pred = (xt - beta_prod_t **
+                   (0.5) * noise) / alpha_prod_t ** (0.5)
+        return x0_pred.to(original_dtype)
+    def convert_velocity_to_x0(
+        self, velocity: torch.Tensor, xt: torch.Tensor,
+        timestep: torch.Tensor
+    ) -> torch.Tensor:
+        """
+        Convert the diffusion network's velocity prediction to x0 predidction.
+        velocity: the predicted noise with shape [B, C, H, W]
+        xt: the input noisy data with shape [B, C, H, W]
+        timestep: the timestep with shape [B]
+        v = sqrt(alpha_t) * noise - sqrt(beta_t) x0
+        noise = (xt-sqrt(alpha_t)*x0) / sqrt(beta_t)
+        given v, x_t, we have
+        x0 = sqrt(alpha_t) * x_t - sqrt(beta_t) * v
+        see derivations https://chatgpt.com/share/679fb6c8-3a30-8008-9b0e-d1ae892dac56
+        """
+        # use higher precision for calculations
+        original_dtype = velocity.dtype
+        velocity, xt, alphas_cumprod = map(
+            lambda x: x.double().to(velocity.device), [velocity, xt,
+                                                       self.alphas_cumprod]
+        )
+        alpha_prod_t = alphas_cumprod[timestep].reshape(-1, 1, 1, 1)
+        beta_prod_t = 1 - alpha_prod_t
+        x0_pred = (alpha_prod_t ** 0.5) * xt - (beta_prod_t ** 0.5) * velocity
+        return x0_pred.to(original_dtype)
+class FlowMatchScheduler():
+    def __init__(self, num_inference_steps=100, num_train_timesteps=1000, shift=3.0, sigma_max=1.0, sigma_min=0.003 / 1.002, inverse_timesteps=False, extra_one_step=False, reverse_sigmas=False):
+        self.num_train_timesteps = num_train_timesteps
+        self.shift = shift
+        self.sigma_max = sigma_max
+        self.sigma_min = sigma_min
+        self.inverse_timesteps = inverse_timesteps
+        self.extra_one_step = extra_one_step
+        self.reverse_sigmas = reverse_sigmas
+        self.set_timesteps(num_inference_steps)
+    def set_timesteps(self, num_inference_steps=100, denoising_strength=1.0, training=False):
+        sigma_start = self.sigma_min + \
+            (self.sigma_max - self.sigma_min) * denoising_strength
+        if self.extra_one_step:
+            self.sigmas = torch.linspace(
+                sigma_start, self.sigma_min, num_inference_steps + 1)[:-1]
+        else:
+            self.sigmas = torch.linspace(
+                sigma_start, self.sigma_min, num_inference_steps)
+        if self.inverse_timesteps:
+            self.sigmas = torch.flip(self.sigmas, dims=[0])
+        self.sigmas = self.shift * self.sigmas / \
+            (1 + (self.shift - 1) * self.sigmas)
+        if self.reverse_sigmas:
+            self.sigmas = 1 - self.sigmas
+        self.timesteps = self.sigmas * self.num_train_timesteps
+        if training:
+            x = self.timesteps
+            y = torch.exp(-2 * ((x - num_inference_steps / 2) /
+                          num_inference_steps) ** 2)
+            y_shifted = y - y.min()
+            bsmntw_weighing = y_shifted * \
+                (num_inference_steps / y_shifted.sum())
+            self.linear_timesteps_weights = bsmntw_weighing
+    def step(self, model_output, timestep, sample, to_final=False):
+        if timestep.ndim == 2:
+            timestep = timestep.flatten(0, 1)
+        self.sigmas = self.sigmas.to(model_output.device)
+        self.timesteps = self.timesteps.to(model_output.device)
+        timestep_id = torch.argmin(
+            (self.timesteps.unsqueeze(0) - timestep.unsqueeze(1)).abs(), dim=1)
+        sigma = self.sigmas[timestep_id].reshape(-1, 1, 1, 1)
+        if to_final or (timestep_id + 1 >= len(self.timesteps)).any():
+            sigma_ = 1 if (
+                self.inverse_timesteps or self.reverse_sigmas) else 0
+        else:
+            sigma_ = self.sigmas[timestep_id + 1].reshape(-1, 1, 1, 1)
+        prev_sample = sample + model_output * (sigma_ - sigma)
+        return prev_sample
+    def add_noise(self, original_samples, noise, timestep):
+        """
+        Diffusion forward corruption process.
+        Input:
+            - clean_latent: the clean latent with shape [B*T, C, H, W]
+            - noise: the noise with shape [B*T, C, H, W]
+            - timestep: the timestep with shape [B*T]
+        Output: the corrupted latent with shape [B*T, C, H, W]
+        """
+        if timestep.ndim == 2:
+            timestep = timestep.flatten(0, 1)
+        self.sigmas = self.sigmas.to(noise.device)
+        self.timesteps = self.timesteps.to(noise.device)
+        timestep_id = torch.argmin(
+            (self.timesteps.unsqueeze(0) - timestep.unsqueeze(1)).abs(), dim=1)
+        sigma = self.sigmas[timestep_id].reshape(-1, 1, 1, 1)
+        sample = (1 - sigma) * original_samples + sigma * noise
+        return sample.type_as(noise)
+    def training_target(self, sample, noise, timestep):
+        target = noise - sample
+        return target
+    def training_weight(self, timestep):
+        """
+        Input:
+            - timestep: the timestep with shape [B*T]
+        Output: the corresponding weighting [B*T]
+        """
+        if timestep.ndim == 2:
+            timestep = timestep.flatten(0, 1)
+        self.linear_timesteps_weights = self.linear_timesteps_weights.to(timestep.device)
+        timestep_id = torch.argmin(
+            (self.timesteps.unsqueeze(1) - timestep.unsqueeze(0)).abs(), dim=0)
+        weights = self.linear_timesteps_weights[timestep_id]
+        return weights

utils/wan_wrapper.py ADDED Viewed

	@@ -0,0 +1,313 @@

+import types
+from typing import List, Optional
+import torch
+from torch import nn
+from utils.scheduler import SchedulerInterface, FlowMatchScheduler
+from wan.modules.tokenizers import HuggingfaceTokenizer
+from wan.modules.model import WanModel, RegisterTokens, GanAttentionBlock
+from wan.modules.vae import _video_vae
+from wan.modules.t5 import umt5_xxl
+from wan.modules.causal_model import CausalWanModel
+class WanTextEncoder(torch.nn.Module):
+    def __init__(self) -> None:
+        super().__init__()
+        self.text_encoder = umt5_xxl(
+            encoder_only=True,
+            return_tokenizer=False,
+            dtype=torch.float32,
+            device=torch.device('cpu')
+        ).eval().requires_grad_(False)
+        self.text_encoder.load_state_dict(
+            torch.load("wan_models/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth",
+                       map_location='cpu', weights_only=False)
+        )
+        self.tokenizer = HuggingfaceTokenizer(
+            name="wan_models/Wan2.1-T2V-1.3B/google/umt5-xxl/", seq_len=512, clean='whitespace')
+    @property
+    def device(self):
+        # Assume we are always on GPU
+        return torch.cuda.current_device()
+    def forward(self, text_prompts: List[str]) -> dict:
+        ids, mask = self.tokenizer(
+            text_prompts, return_mask=True, add_special_tokens=True)
+        ids = ids.to(self.device)
+        mask = mask.to(self.device)
+        seq_lens = mask.gt(0).sum(dim=1).long()
+        context = self.text_encoder(ids, mask)
+        for u, v in zip(context, seq_lens):
+            u[v:] = 0.0  # set padding to 0.0
+        return {
+            "prompt_embeds": context
+        }
+class WanVAEWrapper(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        mean = [
+            -0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653, -0.1517, 1.5508,
+            0.4134, -0.0715, 0.5517, -0.3632, -0.1922, -0.9497, 0.2503, -0.2921
+        ]
+        std = [
+            2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708, 2.6052, 2.0743,
+            3.2687, 2.1526, 2.8652, 1.5579, 1.6382, 1.1253, 2.8251, 1.9160
+        ]
+        self.mean = torch.tensor(mean, dtype=torch.float32)
+        self.std = torch.tensor(std, dtype=torch.float32)
+        # init model
+        self.model = _video_vae(
+            pretrained_path="wan_models/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth",
+            z_dim=16,
+        ).eval().requires_grad_(False)
+    def encode_to_latent(self, pixel: torch.Tensor) -> torch.Tensor:
+        # pixel: [batch_size, num_channels, num_frames, height, width]
+        device, dtype = pixel.device, pixel.dtype
+        scale = [self.mean.to(device=device, dtype=dtype),
+                 1.0 / self.std.to(device=device, dtype=dtype)]
+        output = [
+            self.model.encode(u.unsqueeze(0), scale).float().squeeze(0)
+            for u in pixel
+        ]
+        output = torch.stack(output, dim=0)
+        # from [batch_size, num_channels, num_frames, height, width]
+        # to [batch_size, num_frames, num_channels, height, width]
+        output = output.permute(0, 2, 1, 3, 4)
+        return output
+    def decode_to_pixel(self, latent: torch.Tensor, use_cache: bool = False) -> torch.Tensor:
+        # from [batch_size, num_frames, num_channels, height, width]
+        # to [batch_size, num_channels, num_frames, height, width]
+        zs = latent.permute(0, 2, 1, 3, 4)
+        if use_cache:
+            assert latent.shape[0] == 1, "Batch size must be 1 when using cache"
+        device, dtype = latent.device, latent.dtype
+        scale = [self.mean.to(device=device, dtype=dtype),
+                 1.0 / self.std.to(device=device, dtype=dtype)]
+        if use_cache:
+            decode_function = self.model.cached_decode
+        else:
+            decode_function = self.model.decode
+        output = []
+        for u in zs:
+            output.append(decode_function(u.unsqueeze(0), scale).float().clamp_(-1, 1).squeeze(0))
+        output = torch.stack(output, dim=0)
+        # from [batch_size, num_channels, num_frames, height, width]
+        # to [batch_size, num_frames, num_channels, height, width]
+        output = output.permute(0, 2, 1, 3, 4)
+        return output
+class WanDiffusionWrapper(torch.nn.Module):
+    def __init__(
+            self,
+            model_name="Wan2.1-T2V-1.3B",
+            timestep_shift=8.0,
+            is_causal=False,
+            local_attn_size=-1,
+            sink_size=0
+    ):
+        super().__init__()
+        if is_causal:
+            self.model = CausalWanModel.from_pretrained(
+                f"wan_models/{model_name}/", local_attn_size=local_attn_size, sink_size=sink_size)
+        else:
+            self.model = WanModel.from_pretrained(f"wan_models/{model_name}/")
+        self.model.eval()
+        # For non-causal diffusion, all frames share the same timestep
+        self.uniform_timestep = not is_causal
+        self.scheduler = FlowMatchScheduler(
+            shift=timestep_shift, sigma_min=0.0, extra_one_step=True
+        )
+        self.scheduler.set_timesteps(1000, training=True)
+        self.seq_len = 32760  # [1, 21, 16, 60, 104]
+        self.post_init()
+    def enable_gradient_checkpointing(self) -> None:
+        self.model.enable_gradient_checkpointing()
+    def adding_cls_branch(self, atten_dim=1536, num_class=4, time_embed_dim=0) -> None:
+        # NOTE: This is hard coded for WAN2.1-T2V-1.3B for now!!!!!!!!!!!!!!!!!!!!
+        self._cls_pred_branch = nn.Sequential(
+            # Input: [B, 384, 21, 60, 104]
+            nn.LayerNorm(atten_dim * 3 + time_embed_dim),
+            nn.Linear(atten_dim * 3 + time_embed_dim, 1536),
+            nn.SiLU(),
+            nn.Linear(atten_dim, num_class)
+        )
+        self._cls_pred_branch.requires_grad_(True)
+        num_registers = 3
+        self._register_tokens = RegisterTokens(num_registers=num_registers, dim=atten_dim)
+        self._register_tokens.requires_grad_(True)
+        gan_ca_blocks = []
+        for _ in range(num_registers):
+            block = GanAttentionBlock()
+            gan_ca_blocks.append(block)
+        self._gan_ca_blocks = nn.ModuleList(gan_ca_blocks)
+        self._gan_ca_blocks.requires_grad_(True)
+        # self.has_cls_branch = True
+    def _convert_flow_pred_to_x0(self, flow_pred: torch.Tensor, xt: torch.Tensor, timestep: torch.Tensor) -> torch.Tensor:
+        """
+        Convert flow matching's prediction to x0 prediction.
+        flow_pred: the prediction with shape [B, C, H, W]
+        xt: the input noisy data with shape [B, C, H, W]
+        timestep: the timestep with shape [B]
+        pred = noise - x0
+        x_t = (1-sigma_t) * x0 + sigma_t * noise
+        we have x0 = x_t - sigma_t * pred
+        see derivations https://chatgpt.com/share/67bf8589-3d04-8008-bc6e-4cf1a24e2d0e
+        """
+        # use higher precision for calculations
+        original_dtype = flow_pred.dtype
+        flow_pred, xt, sigmas, timesteps = map(
+            lambda x: x.double().to(flow_pred.device), [flow_pred, xt,
+                                                        self.scheduler.sigmas,
+                                                        self.scheduler.timesteps]
+        )
+        timestep_id = torch.argmin(
+            (timesteps.unsqueeze(0) - timestep.unsqueeze(1)).abs(), dim=1)
+        sigma_t = sigmas[timestep_id].reshape(-1, 1, 1, 1)
+        x0_pred = xt - sigma_t * flow_pred
+        return x0_pred.to(original_dtype)
+    @staticmethod
+    def _convert_x0_to_flow_pred(scheduler, x0_pred: torch.Tensor, xt: torch.Tensor, timestep: torch.Tensor) -> torch.Tensor:
+        """
+        Convert x0 prediction to flow matching's prediction.
+        x0_pred: the x0 prediction with shape [B, C, H, W]
+        xt: the input noisy data with shape [B, C, H, W]
+        timestep: the timestep with shape [B]
+        pred = (x_t - x_0) / sigma_t
+        """
+        # use higher precision for calculations
+        original_dtype = x0_pred.dtype
+        x0_pred, xt, sigmas, timesteps = map(
+            lambda x: x.double().to(x0_pred.device), [x0_pred, xt,
+                                                      scheduler.sigmas,
+                                                      scheduler.timesteps]
+        )
+        timestep_id = torch.argmin(
+            (timesteps.unsqueeze(0) - timestep.unsqueeze(1)).abs(), dim=1)
+        sigma_t = sigmas[timestep_id].reshape(-1, 1, 1, 1)
+        flow_pred = (xt - x0_pred) / sigma_t
+        return flow_pred.to(original_dtype)
+    def forward(
+        self,
+        noisy_image_or_video: torch.Tensor, conditional_dict: dict,
+        timestep: torch.Tensor, kv_cache: Optional[List[dict]] = None,
+        crossattn_cache: Optional[List[dict]] = None,
+        current_start: Optional[int] = None,
+        classify_mode: Optional[bool] = False,
+        concat_time_embeddings: Optional[bool] = False,
+        clean_x: Optional[torch.Tensor] = None,
+        aug_t: Optional[torch.Tensor] = None,
+        cache_start: Optional[int] = None,
+        updating_cache: Optional[bool] = False
+    ) -> torch.Tensor:
+        prompt_embeds = conditional_dict["prompt_embeds"]
+        # [B, F] -> [B]
+        if self.uniform_timestep:
+            input_timestep = timestep[:, 0]
+        else:
+            input_timestep = timestep
+        logits = None
+        # X0 prediction
+        if kv_cache is not None:
+            flow_pred = self.model(
+                noisy_image_or_video.permute(0, 2, 1, 3, 4),
+                t=input_timestep, context=prompt_embeds,
+                seq_len=self.seq_len,
+                kv_cache=kv_cache,
+                crossattn_cache=crossattn_cache,
+                current_start=current_start,
+                cache_start=cache_start,
+                updating_cache=updating_cache
+            ).permute(0, 2, 1, 3, 4)
+        else:
+            if clean_x is not None:
+                # teacher forcing
+                flow_pred = self.model(
+                    noisy_image_or_video.permute(0, 2, 1, 3, 4),
+                    t=input_timestep, context=prompt_embeds,
+                    seq_len=self.seq_len,
+                    clean_x=clean_x.permute(0, 2, 1, 3, 4),
+                    aug_t=aug_t,
+                ).permute(0, 2, 1, 3, 4)
+            else:
+                if classify_mode:
+                    flow_pred, logits = self.model(
+                        noisy_image_or_video.permute(0, 2, 1, 3, 4),
+                        t=input_timestep, context=prompt_embeds,
+                        seq_len=self.seq_len,
+                        classify_mode=True,
+                        register_tokens=self._register_tokens,
+                        cls_pred_branch=self._cls_pred_branch,
+                        gan_ca_blocks=self._gan_ca_blocks,
+                        concat_time_embeddings=concat_time_embeddings
+                    )
+                    flow_pred = flow_pred.permute(0, 2, 1, 3, 4)
+                else:
+                    flow_pred = self.model(
+                        noisy_image_or_video.permute(0, 2, 1, 3, 4),
+                        t=input_timestep, context=prompt_embeds,
+                        seq_len=self.seq_len
+                    ).permute(0, 2, 1, 3, 4)
+        pred_x0 = self._convert_flow_pred_to_x0(
+            flow_pred=flow_pred.flatten(0, 1),
+            xt=noisy_image_or_video.flatten(0, 1),
+            timestep=timestep.flatten(0, 1)
+        ).unflatten(0, flow_pred.shape[:2])
+        if logits is not None:
+            return flow_pred, pred_x0, logits
+        return flow_pred, pred_x0
+    def get_scheduler(self) -> SchedulerInterface:
+        """
+        Update the current scheduler with the interface's static method
+        """
+        scheduler = self.scheduler
+        scheduler.convert_x0_to_noise = types.MethodType(
+            SchedulerInterface.convert_x0_to_noise, scheduler)
+        scheduler.convert_noise_to_x0 = types.MethodType(
+            SchedulerInterface.convert_noise_to_x0, scheduler)
+        scheduler.convert_velocity_to_x0 = types.MethodType(
+            SchedulerInterface.convert_velocity_to_x0, scheduler)
+        self.scheduler = scheduler
+        return scheduler
+    def post_init(self):
+        """
+        A few custom initialization steps that should be called after the object is created.
+        Currently, the only one we have is to bind a few methods to scheduler.
+        We can gradually add more methods here if needed.
+        """
+        self.get_scheduler()

wan/README.md ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ Code in this folder is modified from https://github.com/Wan-Video/Wan2.1
2	+ Apache-2.0 License

wan/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@

+from . import configs, distributed, modules
+from .image2video import WanI2V
+from .text2video import WanT2V

wan/configs/__init__.py ADDED Viewed

	@@ -0,0 +1,42 @@

+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
+from .wan_t2v_14B import t2v_14B
+from .wan_t2v_1_3B import t2v_1_3B
+from .wan_i2v_14B import i2v_14B
+import copy
+import os
+os.environ['TOKENIZERS_PARALLELISM'] = 'false'
+# the config of t2i_14B is the same as t2v_14B
+t2i_14B = copy.deepcopy(t2v_14B)
+t2i_14B.__name__ = 'Config: Wan T2I 14B'
+WAN_CONFIGS = {
+    't2v-14B': t2v_14B,
+    't2v-1.3B': t2v_1_3B,
+    'i2v-14B': i2v_14B,
+    't2i-14B': t2i_14B,
+}
+SIZE_CONFIGS = {
+    '720*1280': (720, 1280),
+    '1280*720': (1280, 720),
+    '480*832': (480, 832),
+    '832*480': (832, 480),
+    '1024*1024': (1024, 1024),
+}
+MAX_AREA_CONFIGS = {
+    '720*1280': 720 * 1280,
+    '1280*720': 1280 * 720,
+    '480*832': 480 * 832,
+    '832*480': 832 * 480,
+}
+SUPPORTED_SIZES = {
+    't2v-14B': ('720*1280', '1280*720', '480*832', '832*480'),
+    't2v-1.3B': ('480*832', '832*480'),
+    'i2v-14B': ('720*1280', '1280*720', '480*832', '832*480'),
+    't2i-14B': tuple(SIZE_CONFIGS.keys()),
+}

wan/configs/shared_config.py ADDED Viewed

	@@ -0,0 +1,19 @@

+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
+import torch
+from easydict import EasyDict
+# ------------------------ Wan shared config ------------------------#
+wan_shared_cfg = EasyDict()
+# t5
+wan_shared_cfg.t5_model = 'umt5_xxl'
+wan_shared_cfg.t5_dtype = torch.bfloat16
+wan_shared_cfg.text_len = 512
+# transformer
+wan_shared_cfg.param_dtype = torch.bfloat16
+# inference
+wan_shared_cfg.num_train_timesteps = 1000
+wan_shared_cfg.sample_fps = 16
+wan_shared_cfg.sample_neg_prompt = '色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走'

wan/configs/wan_i2v_14B.py ADDED Viewed

	@@ -0,0 +1,35 @@

+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
+import torch
+from easydict import EasyDict
+from .shared_config import wan_shared_cfg
+# ------------------------ Wan I2V 14B ------------------------#
+i2v_14B = EasyDict(__name__='Config: Wan I2V 14B')
+i2v_14B.update(wan_shared_cfg)
+i2v_14B.t5_checkpoint = 'models_t5_umt5-xxl-enc-bf16.pth'
+i2v_14B.t5_tokenizer = 'google/umt5-xxl'
+# clip
+i2v_14B.clip_model = 'clip_xlm_roberta_vit_h_14'
+i2v_14B.clip_dtype = torch.float16
+i2v_14B.clip_checkpoint = 'models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth'
+i2v_14B.clip_tokenizer = 'xlm-roberta-large'
+# vae
+i2v_14B.vae_checkpoint = 'Wan2.1_VAE.pth'
+i2v_14B.vae_stride = (4, 8, 8)
+# transformer
+i2v_14B.patch_size = (1, 2, 2)
+i2v_14B.dim = 5120
+i2v_14B.ffn_dim = 13824
+i2v_14B.freq_dim = 256
+i2v_14B.num_heads = 40
+i2v_14B.num_layers = 40
+i2v_14B.window_size = (-1, -1)
+i2v_14B.qk_norm = True
+i2v_14B.cross_attn_norm = True
+i2v_14B.eps = 1e-6

wan/configs/wan_t2v_14B.py ADDED Viewed

	@@ -0,0 +1,29 @@

+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
+from easydict import EasyDict
+from .shared_config import wan_shared_cfg
+# ------------------------ Wan T2V 14B ------------------------#
+t2v_14B = EasyDict(__name__='Config: Wan T2V 14B')
+t2v_14B.update(wan_shared_cfg)
+# t5
+t2v_14B.t5_checkpoint = 'models_t5_umt5-xxl-enc-bf16.pth'
+t2v_14B.t5_tokenizer = 'google/umt5-xxl'
+# vae
+t2v_14B.vae_checkpoint = 'Wan2.1_VAE.pth'
+t2v_14B.vae_stride = (4, 8, 8)
+# transformer
+t2v_14B.patch_size = (1, 2, 2)
+t2v_14B.dim = 5120
+t2v_14B.ffn_dim = 13824
+t2v_14B.freq_dim = 256
+t2v_14B.num_heads = 40
+t2v_14B.num_layers = 40
+t2v_14B.window_size = (-1, -1)
+t2v_14B.qk_norm = True
+t2v_14B.cross_attn_norm = True
+t2v_14B.eps = 1e-6

wan/configs/wan_t2v_1_3B.py ADDED Viewed

	@@ -0,0 +1,29 @@

+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
+from easydict import EasyDict
+from .shared_config import wan_shared_cfg
+# ------------------------ Wan T2V 1.3B ------------------------#
+t2v_1_3B = EasyDict(__name__='Config: Wan T2V 1.3B')
+t2v_1_3B.update(wan_shared_cfg)
+# t5
+t2v_1_3B.t5_checkpoint = 'models_t5_umt5-xxl-enc-bf16.pth'
+t2v_1_3B.t5_tokenizer = 'google/umt5-xxl'
+# vae
+t2v_1_3B.vae_checkpoint = 'Wan2.1_VAE.pth'
+t2v_1_3B.vae_stride = (4, 8, 8)
+# transformer
+t2v_1_3B.patch_size = (1, 2, 2)
+t2v_1_3B.dim = 1536
+t2v_1_3B.ffn_dim = 8960
+t2v_1_3B.freq_dim = 256
+t2v_1_3B.num_heads = 12
+t2v_1_3B.num_layers = 30
+t2v_1_3B.window_size = (-1, -1)
+t2v_1_3B.qk_norm = True
+t2v_1_3B.cross_attn_norm = True
+t2v_1_3B.eps = 1e-6

wan/distributed/__init__.py ADDED Viewed

File without changes

wan/distributed/fsdp.py ADDED Viewed

	@@ -0,0 +1,33 @@

+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
+from functools import partial
+import torch
+from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+from torch.distributed.fsdp import MixedPrecision, ShardingStrategy
+from torch.distributed.fsdp.wrap import lambda_auto_wrap_policy
+def shard_model(
+    model,
+    device_id,
+    param_dtype=torch.bfloat16,
+    reduce_dtype=torch.float32,
+    buffer_dtype=torch.float32,
+    process_group=None,
+    sharding_strategy=ShardingStrategy.FULL_SHARD,
+    sync_module_states=True,
+):
+    model = FSDP(
+        module=model,
+        process_group=process_group,
+        sharding_strategy=sharding_strategy,
+        auto_wrap_policy=partial(
+            lambda_auto_wrap_policy, lambda_fn=lambda m: m in model.blocks),
+        mixed_precision=MixedPrecision(
+            param_dtype=param_dtype,
+            reduce_dtype=reduce_dtype,
+            buffer_dtype=buffer_dtype),
+        device_id=device_id,
+        use_orig_params=True,
+        sync_module_states=sync_module_states)
+    return model

wan/distributed/xdit_context_parallel.py ADDED Viewed

	@@ -0,0 +1,192 @@

+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
+import torch
+import torch.cuda.amp as amp
+from xfuser.core.distributed import (get_sequence_parallel_rank,
+                                     get_sequence_parallel_world_size,
+                                     get_sp_group)
+from xfuser.core.long_ctx_attention import xFuserLongContextAttention
+from ..modules.model import sinusoidal_embedding_1d
+def pad_freqs(original_tensor, target_len):
+    seq_len, s1, s2 = original_tensor.shape
+    pad_size = target_len - seq_len
+    padding_tensor = torch.ones(
+        pad_size,
+        s1,
+        s2,
+        dtype=original_tensor.dtype,
+        device=original_tensor.device)
+    padded_tensor = torch.cat([original_tensor, padding_tensor], dim=0)
+    return padded_tensor
+@amp.autocast(enabled=False)
+def rope_apply(x, grid_sizes, freqs):
+    """
+    x:          [B, L, N, C].
+    grid_sizes: [B, 3].
+    freqs:      [M, C // 2].
+    """
+    s, n, c = x.size(1), x.size(2), x.size(3) // 2
+    # split freqs
+    freqs = freqs.split([c - 2 * (c // 3), c // 3, c // 3], dim=1)
+    # loop over samples
+    output = []
+    for i, (f, h, w) in enumerate(grid_sizes.tolist()):
+        seq_len = f * h * w
+        # precompute multipliers
+        x_i = torch.view_as_complex(x[i, :s].to(torch.float64).reshape(
+            s, n, -1, 2))
+        freqs_i = torch.cat([
+            freqs[0][:f].view(f, 1, 1, -1).expand(f, h, w, -1),
+            freqs[1][:h].view(1, h, 1, -1).expand(f, h, w, -1),
+            freqs[2][:w].view(1, 1, w, -1).expand(f, h, w, -1)
+        ],
+            dim=-1).reshape(seq_len, 1, -1)
+        # apply rotary embedding
+        sp_size = get_sequence_parallel_world_size()
+        sp_rank = get_sequence_parallel_rank()
+        freqs_i = pad_freqs(freqs_i, s * sp_size)
+        s_per_rank = s
+        freqs_i_rank = freqs_i[(sp_rank * s_per_rank):((sp_rank + 1) *
+                                                       s_per_rank), :, :]
+        x_i = torch.view_as_real(x_i * freqs_i_rank).flatten(2)
+        x_i = torch.cat([x_i, x[i, s:]])
+        # append to collection
+        output.append(x_i)
+    return torch.stack(output).float()
+def usp_dit_forward(
+    self,
+    x,
+    t,
+    context,
+    seq_len,
+    clip_fea=None,
+    y=None,
+):
+    """
+    x:              A list of videos each with shape [C, T, H, W].
+    t:              [B].
+    context:        A list of text embeddings each with shape [L, C].
+    """
+    if self.model_type == 'i2v':
+        assert clip_fea is not None and y is not None
+    # params
+    device = self.patch_embedding.weight.device
+    if self.freqs.device != device:
+        self.freqs = self.freqs.to(device)
+    if y is not None:
+        x = [torch.cat([u, v], dim=0) for u, v in zip(x, y)]
+    # embeddings
+    x = [self.patch_embedding(u.unsqueeze(0)) for u in x]
+    grid_sizes = torch.stack(
+        [torch.tensor(u.shape[2:], dtype=torch.long) for u in x])
+    x = [u.flatten(2).transpose(1, 2) for u in x]
+    seq_lens = torch.tensor([u.size(1) for u in x], dtype=torch.long)
+    assert seq_lens.max() <= seq_len
+    x = torch.cat([
+        torch.cat([u, u.new_zeros(1, seq_len - u.size(1), u.size(2))], dim=1)
+        for u in x
+    ])
+    # time embeddings
+    with amp.autocast(dtype=torch.float32):
+        e = self.time_embedding(
+            sinusoidal_embedding_1d(self.freq_dim, t).float())
+        e0 = self.time_projection(e).unflatten(1, (6, self.dim))
+        assert e.dtype == torch.float32 and e0.dtype == torch.float32
+    # context
+    context_lens = None
+    context = self.text_embedding(
+        torch.stack([
+            torch.cat([u, u.new_zeros(self.text_len - u.size(0), u.size(1))])
+            for u in context
+        ]))
+    if clip_fea is not None:
+        context_clip = self.img_emb(clip_fea)  # bs x 257 x dim
+        context = torch.concat([context_clip, context], dim=1)
+    # arguments
+    kwargs = dict(
+        e=e0,
+        seq_lens=seq_lens,
+        grid_sizes=grid_sizes,
+        freqs=self.freqs,
+        context=context,
+        context_lens=context_lens)
+    # Context Parallel
+    x = torch.chunk(
+        x, get_sequence_parallel_world_size(),
+        dim=1)[get_sequence_parallel_rank()]
+    for block in self.blocks:
+        x = block(x, **kwargs)
+    # head
+    x = self.head(x, e)
+    # Context Parallel
+    x = get_sp_group().all_gather(x, dim=1)
+    # unpatchify
+    x = self.unpatchify(x, grid_sizes)
+    return [u.float() for u in x]
+def usp_attn_forward(self,
+                     x,
+                     seq_lens,
+                     grid_sizes,
+                     freqs,
+                     dtype=torch.bfloat16):
+    b, s, n, d = *x.shape[:2], self.num_heads, self.head_dim
+    half_dtypes = (torch.float16, torch.bfloat16)
+    def half(x):
+        return x if x.dtype in half_dtypes else x.to(dtype)
+    # query, key, value function
+    def qkv_fn(x):
+        q = self.norm_q(self.q(x)).view(b, s, n, d)
+        k = self.norm_k(self.k(x)).view(b, s, n, d)
+        v = self.v(x).view(b, s, n, d)
+        return q, k, v
+    q, k, v = qkv_fn(x)
+    q = rope_apply(q, grid_sizes, freqs)
+    k = rope_apply(k, grid_sizes, freqs)
+    # TODO: We should use unpaded q,k,v for attention.
+    # k_lens = seq_lens // get_sequence_parallel_world_size()
+    # if k_lens is not None:
+    #     q = torch.cat([u[:l] for u, l in zip(q, k_lens)]).unsqueeze(0)
+    #     k = torch.cat([u[:l] for u, l in zip(k, k_lens)]).unsqueeze(0)
+    #     v = torch.cat([u[:l] for u, l in zip(v, k_lens)]).unsqueeze(0)
+    x = xFuserLongContextAttention()(
+        None,
+        query=half(q),
+        key=half(k),
+        value=half(v),
+        window_size=self.window_size)
+    # TODO: padding after attention.
+    # x = torch.cat([x, x.new_zeros(b, s - x.size(1), n, d)], dim=1)
+    # output
+    x = x.flatten(2)
+    x = self.o(x)
+    return x

wan/image2video.py ADDED Viewed

	@@ -0,0 +1,347 @@

+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
+import gc
+import logging
+import math
+import os
+import random
+import sys
+import types
+from contextlib import contextmanager
+from functools import partial
+import numpy as np
+import torch
+import torch.cuda.amp as amp
+import torch.distributed as dist
+import torchvision.transforms.functional as TF
+from tqdm import tqdm
+from .distributed.fsdp import shard_model
+from .modules.clip import CLIPModel
+from .modules.model import WanModel
+from .modules.t5 import T5EncoderModel
+from .modules.vae import WanVAE
+from .utils.fm_solvers import (FlowDPMSolverMultistepScheduler,
+                               get_sampling_sigmas, retrieve_timesteps)
+from .utils.fm_solvers_unipc import FlowUniPCMultistepScheduler
+class WanI2V:
+    def __init__(
+        self,
+        config,
+        checkpoint_dir,
+        device_id=0,
+        rank=0,
+        t5_fsdp=False,
+        dit_fsdp=False,
+        use_usp=False,
+        t5_cpu=False,
+        init_on_cpu=True,
+    ):
+        r"""
+        Initializes the image-to-video generation model components.
+        Args:
+            config (EasyDict):
+                Object containing model parameters initialized from config.py
+            checkpoint_dir (`str`):
+                Path to directory containing model checkpoints
+            device_id (`int`,  *optional*, defaults to 0):
+                Id of target GPU device
+            rank (`int`,  *optional*, defaults to 0):
+                Process rank for distributed training
+            t5_fsdp (`bool`, *optional*, defaults to False):
+                Enable FSDP sharding for T5 model
+            dit_fsdp (`bool`, *optional*, defaults to False):
+                Enable FSDP sharding for DiT model
+            use_usp (`bool`, *optional*, defaults to False):
+                Enable distribution strategy of USP.
+            t5_cpu (`bool`, *optional*, defaults to False):
+                Whether to place T5 model on CPU. Only works without t5_fsdp.
+            init_on_cpu (`bool`, *optional*, defaults to True):
+                Enable initializing Transformer Model on CPU. Only works without FSDP or USP.
+        """
+        self.device = torch.device(f"cuda:{device_id}")
+        self.config = config
+        self.rank = rank
+        self.use_usp = use_usp
+        self.t5_cpu = t5_cpu
+        self.num_train_timesteps = config.num_train_timesteps
+        self.param_dtype = config.param_dtype
+        shard_fn = partial(shard_model, device_id=device_id)
+        self.text_encoder = T5EncoderModel(
+            text_len=config.text_len,
+            dtype=config.t5_dtype,
+            device=torch.device('cpu'),
+            checkpoint_path=os.path.join(checkpoint_dir, config.t5_checkpoint),
+            tokenizer_path=os.path.join(checkpoint_dir, config.t5_tokenizer),
+            shard_fn=shard_fn if t5_fsdp else None,
+        )
+        self.vae_stride = config.vae_stride
+        self.patch_size = config.patch_size
+        self.vae = WanVAE(
+            vae_pth=os.path.join(checkpoint_dir, config.vae_checkpoint),
+            device=self.device)
+        self.clip = CLIPModel(
+            dtype=config.clip_dtype,
+            device=self.device,
+            checkpoint_path=os.path.join(checkpoint_dir,
+                                         config.clip_checkpoint),
+            tokenizer_path=os.path.join(checkpoint_dir, config.clip_tokenizer))
+        logging.info(f"Creating WanModel from {checkpoint_dir}")
+        self.model = WanModel.from_pretrained(checkpoint_dir)
+        self.model.eval().requires_grad_(False)
+        if t5_fsdp or dit_fsdp or use_usp:
+            init_on_cpu = False
+        if use_usp:
+            from xfuser.core.distributed import \
+                get_sequence_parallel_world_size
+            from .distributed.xdit_context_parallel import (usp_attn_forward,
+                                                            usp_dit_forward)
+            for block in self.model.blocks:
+                block.self_attn.forward = types.MethodType(
+                    usp_attn_forward, block.self_attn)
+            self.model.forward = types.MethodType(usp_dit_forward, self.model)
+            self.sp_size = get_sequence_parallel_world_size()
+        else:
+            self.sp_size = 1
+        if dist.is_initialized():
+            dist.barrier()
+        if dit_fsdp:
+            self.model = shard_fn(self.model)
+        else:
+            if not init_on_cpu:
+                self.model.to(self.device)
+        self.sample_neg_prompt = config.sample_neg_prompt
+    def generate(self,
+                 input_prompt,
+                 img,
+                 max_area=720 * 1280,
+                 frame_num=81,
+                 shift=5.0,
+                 sample_solver='unipc',
+                 sampling_steps=40,
+                 guide_scale=5.0,
+                 n_prompt="",
+                 seed=-1,
+                 offload_model=True):
+        r"""
+        Generates video frames from input image and text prompt using diffusion process.
+        Args:
+            input_prompt (`str`):
+                Text prompt for content generation.
+            img (PIL.Image.Image):
+                Input image tensor. Shape: [3, H, W]
+            max_area (`int`, *optional*, defaults to 720*1280):
+                Maximum pixel area for latent space calculation. Controls video resolution scaling
+            frame_num (`int`, *optional*, defaults to 81):
+                How many frames to sample from a video. The number should be 4n+1
+            shift (`float`, *optional*, defaults to 5.0):
+                Noise schedule shift parameter. Affects temporal dynamics
+                [NOTE]: If you want to generate a 480p video, it is recommended to set the shift value to 3.0.
+            sample_solver (`str`, *optional*, defaults to 'unipc'):
+                Solver used to sample the video.
+            sampling_steps (`int`, *optional*, defaults to 40):
+                Number of diffusion sampling steps. Higher values improve quality but slow generation
+            guide_scale (`float`, *optional*, defaults 5.0):
+                Classifier-free guidance scale. Controls prompt adherence vs. creativity
+            n_prompt (`str`, *optional*, defaults to ""):
+                Negative prompt for content exclusion. If not given, use `config.sample_neg_prompt`
+            seed (`int`, *optional*, defaults to -1):
+                Random seed for noise generation. If -1, use random seed
+            offload_model (`bool`, *optional*, defaults to True):
+                If True, offloads models to CPU during generation to save VRAM
+        Returns:
+            torch.Tensor:
+                Generated video frames tensor. Dimensions: (C, N H, W) where:
+                - C: Color channels (3 for RGB)
+                - N: Number of frames (81)
+                - H: Frame height (from max_area)
+                - W: Frame width from max_area)
+        """
+        img = TF.to_tensor(img).sub_(0.5).div_(0.5).to(self.device)
+        F = frame_num
+        h, w = img.shape[1:]
+        aspect_ratio = h / w
+        lat_h = round(
+            np.sqrt(max_area * aspect_ratio) // self.vae_stride[1] //
+            self.patch_size[1] * self.patch_size[1])
+        lat_w = round(
+            np.sqrt(max_area / aspect_ratio) // self.vae_stride[2] //
+            self.patch_size[2] * self.patch_size[2])
+        h = lat_h * self.vae_stride[1]
+        w = lat_w * self.vae_stride[2]
+        max_seq_len = ((F - 1) // self.vae_stride[0] + 1) * lat_h * lat_w // (
+            self.patch_size[1] * self.patch_size[2])
+        max_seq_len = int(math.ceil(max_seq_len / self.sp_size)) * self.sp_size
+        seed = seed if seed >= 0 else random.randint(0, sys.maxsize)
+        seed_g = torch.Generator(device=self.device)
+        seed_g.manual_seed(seed)
+        noise = torch.randn(
+            16,
+            21,
+            lat_h,
+            lat_w,
+            dtype=torch.float32,
+            generator=seed_g,
+            device=self.device)
+        msk = torch.ones(1, 81, lat_h, lat_w, device=self.device)
+        msk[:, 1:] = 0
+        msk = torch.concat([
+            torch.repeat_interleave(msk[:, 0:1], repeats=4, dim=1), msk[:, 1:]
+        ],
+            dim=1)
+        msk = msk.view(1, msk.shape[1] // 4, 4, lat_h, lat_w)
+        msk = msk.transpose(1, 2)[0]
+        if n_prompt == "":
+            n_prompt = self.sample_neg_prompt
+        # preprocess
+        if not self.t5_cpu:
+            self.text_encoder.model.to(self.device)
+            context = self.text_encoder([input_prompt], self.device)
+            context_null = self.text_encoder([n_prompt], self.device)
+            if offload_model:
+                self.text_encoder.model.cpu()
+        else:
+            context = self.text_encoder([input_prompt], torch.device('cpu'))
+            context_null = self.text_encoder([n_prompt], torch.device('cpu'))
+            context = [t.to(self.device) for t in context]
+            context_null = [t.to(self.device) for t in context_null]
+        self.clip.model.to(self.device)
+        clip_context = self.clip.visual([img[:, None, :, :]])
+        if offload_model:
+            self.clip.model.cpu()
+        y = self.vae.encode([
+            torch.concat([
+                torch.nn.functional.interpolate(
+                    img[None].cpu(), size=(h, w), mode='bicubic').transpose(
+                        0, 1),
+                torch.zeros(3, 80, h, w)
+            ],
+                dim=1).to(self.device)
+        ])[0]
+        y = torch.concat([msk, y])
+        @contextmanager
+        def noop_no_sync():
+            yield
+        no_sync = getattr(self.model, 'no_sync', noop_no_sync)
+        # evaluation mode
+        with amp.autocast(dtype=self.param_dtype), torch.no_grad(), no_sync():
+            if sample_solver == 'unipc':
+                sample_scheduler = FlowUniPCMultistepScheduler(
+                    num_train_timesteps=self.num_train_timesteps,
+                    shift=1,
+                    use_dynamic_shifting=False)
+                sample_scheduler.set_timesteps(
+                    sampling_steps, device=self.device, shift=shift)
+                timesteps = sample_scheduler.timesteps
+            elif sample_solver == 'dpm++':
+                sample_scheduler = FlowDPMSolverMultistepScheduler(
+                    num_train_timesteps=self.num_train_timesteps,
+                    shift=1,
+                    use_dynamic_shifting=False)
+                sampling_sigmas = get_sampling_sigmas(sampling_steps, shift)
+                timesteps, _ = retrieve_timesteps(
+                    sample_scheduler,
+                    device=self.device,
+                    sigmas=sampling_sigmas)
+            else:
+                raise NotImplementedError("Unsupported solver.")
+            # sample videos
+            latent = noise
+            arg_c = {
+                'context': [context[0]],
+                'clip_fea': clip_context,
+                'seq_len': max_seq_len,
+                'y': [y],
+            }
+            arg_null = {
+                'context': context_null,
+                'clip_fea': clip_context,
+                'seq_len': max_seq_len,
+                'y': [y],
+            }
+            if offload_model:
+                torch.cuda.empty_cache()
+            self.model.to(self.device)
+            for _, t in enumerate(tqdm(timesteps)):
+                latent_model_input = [latent.to(self.device)]
+                timestep = [t]
+                timestep = torch.stack(timestep).to(self.device)
+                noise_pred_cond = self.model(
+                    latent_model_input, t=timestep, **arg_c)[0].to(
+                        torch.device('cpu') if offload_model else self.device)
+                if offload_model:
+                    torch.cuda.empty_cache()
+                noise_pred_uncond = self.model(
+                    latent_model_input, t=timestep, **arg_null)[0].to(
+                        torch.device('cpu') if offload_model else self.device)
+                if offload_model:
+                    torch.cuda.empty_cache()
+                noise_pred = noise_pred_uncond + guide_scale * (
+                    noise_pred_cond - noise_pred_uncond)
+                latent = latent.to(
+                    torch.device('cpu') if offload_model else self.device)
+                temp_x0 = sample_scheduler.step(
+                    noise_pred.unsqueeze(0),
+                    t,
+                    latent.unsqueeze(0),
+                    return_dict=False,
+                    generator=seed_g)[0]
+                latent = temp_x0.squeeze(0)
+                x0 = [latent.to(self.device)]
+                del latent_model_input, timestep
+            if offload_model:
+                self.model.cpu()
+                torch.cuda.empty_cache()
+            if self.rank == 0:
+                videos = self.vae.decode(x0)
+        del noise, latent
+        del sample_scheduler
+        if offload_model:
+            gc.collect()
+            torch.cuda.synchronize()
+        if dist.is_initialized():
+            dist.barrier()
+        return videos[0] if self.rank == 0 else None

wan/modules/__init__.py ADDED Viewed

	@@ -0,0 +1,16 @@

+from .attention import flash_attention
+from .model import WanModel
+from .t5 import T5Decoder, T5Encoder, T5EncoderModel, T5Model
+from .tokenizers import HuggingfaceTokenizer
+from .vae import WanVAE
+__all__ = [
+    'WanVAE',
+    'WanModel',
+    'T5Model',
+    'T5Encoder',
+    'T5Decoder',
+    'T5EncoderModel',
+    'HuggingfaceTokenizer',
+    'flash_attention',
+]

wan/modules/attention.py ADDED Viewed

	@@ -0,0 +1,185 @@

+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
+import torch
+try:
+    import flash_attn_interface
+    def is_hopper_gpu():
+        if not torch.cuda.is_available():
+            return False
+        device_name = torch.cuda.get_device_name(0).lower()
+        return "h100" in device_name or "hopper" in device_name
+    FLASH_ATTN_3_AVAILABLE = is_hopper_gpu()
+except ModuleNotFoundError:
+    FLASH_ATTN_3_AVAILABLE = False
+try:
+    import flash_attn
+    FLASH_ATTN_2_AVAILABLE = True
+except ModuleNotFoundError:
+    FLASH_ATTN_2_AVAILABLE = False
+# FLASH_ATTN_3_AVAILABLE = False
+import warnings
+__all__ = [
+    'flash_attention',
+    'attention',
+]
+def flash_attention(
+    q,
+    k,
+    v,
+    q_lens=None,
+    k_lens=None,
+    dropout_p=0.,
+    softmax_scale=None,
+    q_scale=None,
+    causal=False,
+    window_size=(-1, -1),
+    deterministic=False,
+    dtype=torch.bfloat16,
+    version=None,
+):
+    """
+    q:              [B, Lq, Nq, C1].
+    k:              [B, Lk, Nk, C1].
+    v:              [B, Lk, Nk, C2]. Nq must be divisible by Nk.
+    q_lens:         [B].
+    k_lens:         [B].
+    dropout_p:      float. Dropout probability.
+    softmax_scale:  float. The scaling of QK^T before applying softmax.
+    causal:         bool. Whether to apply causal attention mask.
+    window_size:    (left right). If not (-1, -1), apply sliding window local attention.
+    deterministic:  bool. If True, slightly slower and uses more memory.
+    dtype:          torch.dtype. Apply when dtype of q/k/v is not float16/bfloat16.
+    """
+    half_dtypes = (torch.float16, torch.bfloat16)
+    assert dtype in half_dtypes
+    assert q.device.type == 'cuda' and q.size(-1) <= 256
+    # params
+    b, lq, lk, out_dtype = q.size(0), q.size(1), k.size(1), q.dtype
+    def half(x):
+        return x if x.dtype in half_dtypes else x.to(dtype)
+    # preprocess query
+    if q_lens is None:
+        q = half(q.flatten(0, 1))
+        q_lens = torch.tensor(
+            [lq] * b, dtype=torch.int32).to(
+                device=q.device, non_blocking=True)
+    else:
+        q = half(torch.cat([u[:v] for u, v in zip(q, q_lens)]))
+    # preprocess key, value
+    if k_lens is None:
+        k = half(k.flatten(0, 1))
+        v = half(v.flatten(0, 1))
+        k_lens = torch.tensor(
+            [lk] * b, dtype=torch.int32).to(
+                device=k.device, non_blocking=True)
+    else:
+        k = half(torch.cat([u[:v] for u, v in zip(k, k_lens)]))
+        v = half(torch.cat([u[:v] for u, v in zip(v, k_lens)]))
+    q = q.to(v.dtype)
+    k = k.to(v.dtype)
+    if q_scale is not None:
+        q = q * q_scale
+    if version is not None and version == 3 and not FLASH_ATTN_3_AVAILABLE:
+        warnings.warn(
+            'Flash attention 3 is not available, use flash attention 2 instead.'
+        )
+    # apply attention
+    if (version is None or version == 3) and FLASH_ATTN_3_AVAILABLE:
+        # Note: dropout_p, window_size are not supported in FA3 now.
+        x = flash_attn_interface.flash_attn_varlen_func(
+            q=q,
+            k=k,
+            v=v,
+            cu_seqlens_q=torch.cat([q_lens.new_zeros([1]), q_lens]).cumsum(
+                0, dtype=torch.int32).to(q.device, non_blocking=True),
+            cu_seqlens_k=torch.cat([k_lens.new_zeros([1]), k_lens]).cumsum(
+                0, dtype=torch.int32).to(q.device, non_blocking=True),
+            max_seqlen_q=lq,
+            max_seqlen_k=lk,
+            softmax_scale=softmax_scale,
+            causal=causal,
+            deterministic=deterministic)[0].unflatten(0, (b, lq))
+    else:
+        assert FLASH_ATTN_2_AVAILABLE
+        x = flash_attn.flash_attn_varlen_func(
+            q=q,
+            k=k,
+            v=v,
+            cu_seqlens_q=torch.cat([q_lens.new_zeros([1]), q_lens]).cumsum(
+                0, dtype=torch.int32).to(q.device, non_blocking=True),
+            cu_seqlens_k=torch.cat([k_lens.new_zeros([1]), k_lens]).cumsum(
+                0, dtype=torch.int32).to(q.device, non_blocking=True),
+            max_seqlen_q=lq,
+            max_seqlen_k=lk,
+            dropout_p=dropout_p,
+            softmax_scale=softmax_scale,
+            causal=causal,
+            window_size=window_size,
+            deterministic=deterministic).unflatten(0, (b, lq))
+    # output
+    return x.type(out_dtype)
+def attention(
+    q,
+    k,
+    v,
+    q_lens=None,
+    k_lens=None,
+    dropout_p=0.,
+    softmax_scale=None,
+    q_scale=None,
+    causal=False,
+    window_size=(-1, -1),
+    deterministic=False,
+    dtype=torch.bfloat16,
+    fa_version=None,
+):
+    if FLASH_ATTN_2_AVAILABLE or FLASH_ATTN_3_AVAILABLE:
+        return flash_attention(
+            q=q,
+            k=k,
+            v=v,
+            q_lens=q_lens,
+            k_lens=k_lens,
+            dropout_p=dropout_p,
+            softmax_scale=softmax_scale,
+            q_scale=q_scale,
+            causal=causal,
+            window_size=window_size,
+            deterministic=deterministic,
+            dtype=dtype,
+            version=fa_version,
+        )
+    else:
+        if q_lens is not None or k_lens is not None:
+            warnings.warn(
+                'Padding mask is disabled when using scaled_dot_product_attention. It can have a significant impact on performance.'
+            )
+        attn_mask = None
+        q = q.transpose(1, 2).to(dtype)
+        k = k.transpose(1, 2).to(dtype)
+        v = v.transpose(1, 2).to(dtype)
+        out = torch.nn.functional.scaled_dot_product_attention(
+            q, k, v, attn_mask=attn_mask, is_causal=causal, dropout_p=dropout_p)
+        out = out.transpose(1, 2).contiguous()
+        return out

wan/modules/causal_model.py ADDED Viewed

	@@ -0,0 +1,1127 @@

+from wan.modules.attention import attention
+from wan.modules.model import (
+    WanRMSNorm,
+    rope_apply,
+    WanLayerNorm,
+    WAN_CROSSATTENTION_CLASSES,
+    rope_params,
+    MLPProj,
+    sinusoidal_embedding_1d
+)
+# from torch.nn.attention.flex_attention import create_block_mask, flex_attention
+from diffusers.configuration_utils import ConfigMixin, register_to_config
+# from torch.nn.attention.flex_attention import BlockMask
+from diffusers.models.modeling_utils import ModelMixin
+import torch.nn as nn
+import torch
+import math
+import torch.distributed as dist
+# wan 1.3B model has a weird channel / head configurations and require max-autotune to work with flexattention
+# see https://github.com/pytorch/pytorch/issues/133254
+# change to default for other models
+# flex_attention = torch.compile(
+#     flex_attention, dynamic=False, mode="max-autotune-no-cudagraphs")
+def causal_rope_apply(x, grid_sizes, freqs, start_frame=0):
+    n, c = x.size(2), x.size(3) // 2
+    # split freqs
+    freqs = freqs.split([c - 2 * (c // 3), c // 3, c // 3], dim=1)
+    # loop over samples
+    output = []
+    for i, (f, h, w) in enumerate(grid_sizes.tolist()):
+        seq_len = f * h * w
+        # precompute multipliers
+        x_i = torch.view_as_complex(x[i, :seq_len].to(torch.float64).reshape(
+            seq_len, n, -1, 2))
+        freqs_i = torch.cat([
+            freqs[0][start_frame:start_frame + f].view(f, 1, 1, -1).expand(f, h, w, -1),
+            freqs[1][:h].view(1, h, 1, -1).expand(f, h, w, -1),
+            freqs[2][:w].view(1, 1, w, -1).expand(f, h, w, -1)
+        ],
+            dim=-1).reshape(seq_len, 1, -1)
+        # apply rotary embedding
+        x_i = torch.view_as_real(x_i * freqs_i).flatten(2)
+        x_i = torch.cat([x_i, x[i, seq_len:]])
+        # append to collection
+        output.append(x_i)
+    return torch.stack(output).type_as(x)
+class CausalWanSelfAttention(nn.Module):
+    def __init__(self,
+                 dim,
+                 num_heads,
+                 local_attn_size=-1,
+                 sink_size=1,
+                 qk_norm=True,
+                 eps=1e-6):
+        assert dim % num_heads == 0
+        super().__init__()
+        self.dim = dim
+        self.num_heads = num_heads
+        self.head_dim = dim // num_heads
+        self.local_attn_size = local_attn_size
+        self.qk_norm = qk_norm
+        self.eps = eps
+        self.frame_length = 1560
+        self.max_attention_size = 21 * self.frame_length
+        self.block_length = 3 * self.frame_length
+        # layers
+        self.q = nn.Linear(dim, dim)
+        self.k = nn.Linear(dim, dim)
+        self.v = nn.Linear(dim, dim)
+        self.o = nn.Linear(dim, dim)
+        self.norm_q = WanRMSNorm(dim, eps=eps) if qk_norm else nn.Identity()
+        self.norm_k = WanRMSNorm(dim, eps=eps) if qk_norm else nn.Identity()
+    def forward(
+        self,
+        x,
+        seq_lens,
+        grid_sizes,
+        freqs,
+        block_mask,
+        kv_cache=None,
+        current_start=0,
+        cache_start=None,
+        updating_cache=False
+    ):
+        r"""
+        Args:
+            x(Tensor): Shape [B, L, num_heads, C / num_heads]
+            seq_lens(Tensor): Shape [B]
+            grid_sizes(Tensor): Shape [B, 3], the second dimension contains (F, H, W)
+            freqs(Tensor): Rope freqs, shape [1024, C / num_heads / 2]
+            block_mask (BlockMask)
+        """
+        b, s, n, d = *x.shape[:2], self.num_heads, self.head_dim
+        if cache_start is None:
+            cache_start = current_start
+        # query, key, value function
+        def qkv_fn(x):
+            q = self.norm_q(self.q(x)).view(b, s, n, d) # [B, L, 12, 128]
+            k = self.norm_k(self.k(x)).view(b, s, n, d) # [B, L, 12, 128]
+            v = self.v(x).view(b, s, n, d)              # [B, L, 12, 128]
+            return q, k, v
+        q, k, v = qkv_fn(x)
+        if kv_cache is None:
+            # if it is teacher forcing training?
+            is_tf = (s == seq_lens[0].item() * 2)
+            if is_tf:
+                q_chunk = torch.chunk(q, 2, dim=1)
+                k_chunk = torch.chunk(k, 2, dim=1)
+                roped_query = []
+                roped_key = []
+                # rope should be same for clean and noisy parts
+                for ii in range(2):
+                    rq = rope_apply(q_chunk[ii], grid_sizes, freqs).type_as(v)
+                    rk = rope_apply(k_chunk[ii], grid_sizes, freqs).type_as(v)
+                    roped_query.append(rq)
+                    roped_key.append(rk)
+                roped_query = torch.cat(roped_query, dim=1)
+                roped_key = torch.cat(roped_key, dim=1)
+                padded_length = math.ceil(q.shape[1] / 128) * 128 - q.shape[1]
+                padded_roped_query = torch.cat(
+                    [roped_query,
+                     torch.zeros([q.shape[0], padded_length, q.shape[2], q.shape[3]],
+                                 device=q.device, dtype=v.dtype)],
+                    dim=1
+                )
+                padded_roped_key = torch.cat(
+                    [roped_key, torch.zeros([k.shape[0], padded_length, k.shape[2], k.shape[3]],
+                                            device=k.device, dtype=v.dtype)],
+                    dim=1
+                )
+                padded_v = torch.cat(
+                    [v, torch.zeros([v.shape[0], padded_length, v.shape[2], v.shape[3]],
+                                    device=v.device, dtype=v.dtype)],
+                    dim=1
+                )
+                x = flex_attention(
+                    query=padded_roped_query.transpose(2, 1),
+                    key=padded_roped_key.transpose(2, 1),
+                    value=padded_v.transpose(2, 1),
+                    block_mask=block_mask
+                )[:, :, :-padded_length].transpose(2, 1)
+            else:
+                roped_query = rope_apply(q, grid_sizes, freqs).type_as(v)
+                roped_key = rope_apply(k, grid_sizes, freqs).type_as(v)
+                padded_length = math.ceil(q.shape[1] / 128) * 128 - q.shape[1]
+                padded_roped_query = torch.cat(
+                    [roped_query,
+                     torch.zeros([q.shape[0], padded_length, q.shape[2], q.shape[3]],
+                                 device=q.device, dtype=v.dtype)],
+                    dim=1
+                )
+                padded_roped_key = torch.cat(
+                    [roped_key, torch.zeros([k.shape[0], padded_length, k.shape[2], k.shape[3]],
+                                            device=k.device, dtype=v.dtype)],
+                    dim=1
+                )
+                padded_v = torch.cat(
+                    [v, torch.zeros([v.shape[0], padded_length, v.shape[2], v.shape[3]],
+                                    device=v.device, dtype=v.dtype)],
+                    dim=1
+                )
+                x = flex_attention(
+                    query=padded_roped_query.transpose(2, 1),
+                    key=padded_roped_key.transpose(2, 1),
+                    value=padded_v.transpose(2, 1),
+                    block_mask=block_mask
+                )[:, :, :-padded_length].transpose(2, 1)
+        else:
+            frame_seqlen = math.prod(grid_sizes[0][1:]).item()
+            current_start_frame = current_start // frame_seqlen
+            roped_query = causal_rope_apply(
+                q, grid_sizes, freqs, start_frame=current_start_frame).type_as(v)   # [B, L, 12, 128]
+            roped_key = causal_rope_apply(
+                k, grid_sizes, freqs, start_frame=current_start_frame).type_as(v)   # [B, L, 12, 128]
+            grid_sizes_one_block = grid_sizes.clone()
+            grid_sizes_one_block[:,0] = 3
+            # only caching the first block
+            cache_end = cache_start + self.block_length
+            num_new_tokens = cache_end - kv_cache["global_end_index"].item()
+            kv_cache_size = kv_cache["k"].shape[1]
+            sink_tokens = 1 * self.block_length # we keep the first block in the cache
+            if (num_new_tokens > 0) and (
+                    num_new_tokens + kv_cache["local_end_index"].item() > kv_cache_size):
+                num_evicted_tokens = num_new_tokens + kv_cache["local_end_index"].item() - kv_cache_size
+                num_rolled_tokens = kv_cache["local_end_index"].item() - num_evicted_tokens - sink_tokens
+                kv_cache["k"][:, sink_tokens:sink_tokens + num_rolled_tokens] = \
+                    kv_cache["k"][:, sink_tokens + num_evicted_tokens:sink_tokens + num_evicted_tokens + num_rolled_tokens].clone()
+                kv_cache["v"][:, sink_tokens:sink_tokens + num_rolled_tokens] = \
+                    kv_cache["v"][:, sink_tokens + num_evicted_tokens:sink_tokens + num_evicted_tokens + num_rolled_tokens].clone()
+                local_end_index = kv_cache["local_end_index"].item() + cache_end - \
+                    kv_cache["global_end_index"].item() - num_evicted_tokens
+                local_start_index = local_end_index - self.block_length
+                kv_cache["k"][:, local_start_index:local_end_index] = roped_key[:, :self.block_length]
+                kv_cache["v"][:, local_start_index:local_end_index] = v[:, :self.block_length]
+            else:
+                local_end_index = kv_cache["local_end_index"].item() + cache_end - kv_cache["global_end_index"].item()
+                local_start_index = local_end_index - self.block_length
+                if local_start_index == 0: # first block is not roped in the cache
+                    kv_cache["k"][:, local_start_index:local_end_index] = k[:, :self.block_length]
+                else:
+                    kv_cache["k"][:, local_start_index:local_end_index] = roped_key[:, :self.block_length]
+                kv_cache["v"][:, local_start_index:local_end_index] = v[:, :self.block_length]
+            if num_new_tokens > 0: # prevent updating when caching clean frame
+                kv_cache["global_end_index"].fill_(cache_end)
+                kv_cache["local_end_index"].fill_(local_end_index)
+            if local_start_index == 0:
+                # no kv attn with cache
+                x = attention(
+                    roped_query,
+                    roped_key,
+                    v)
+            else:
+                if updating_cache: # updating working cache with clean frame
+                    extract_cache_end = local_end_index
+                    extract_cache_start = max(0, local_end_index-self.max_attention_size)
+                    working_cache_key = kv_cache["k"][:, extract_cache_start:extract_cache_end].clone()
+                    working_cache_v = kv_cache["v"][:, extract_cache_start:extract_cache_end]
+                    if extract_cache_start == 0: # rope the global first block in working cache
+                        working_cache_key[:,:self.block_length] = causal_rope_apply(
+                            working_cache_key[:,:self.block_length], grid_sizes_one_block, freqs, start_frame=0).type_as(v)
+                    x = attention(
+                        roped_query,
+                        working_cache_key,
+                        working_cache_v
+                    )
+                else:
+                    # 1. extract working cache
+                    # calculate the length of working cache
+                    query_length = roped_query.shape[1]
+                    working_cache_max_length = self.max_attention_size - query_length - self.block_length
+                    extract_cache_end = local_start_index
+                    extract_cache_start = max(self.block_length, local_start_index - working_cache_max_length) # working cache does not include the first anchor block
+                    working_cache_key = kv_cache["k"][:, extract_cache_start:extract_cache_end]
+                    working_cache_v = kv_cache["v"][:, extract_cache_start:extract_cache_end]
+                    # 2. extract anchor cache, roped as the past frame
+                    working_cache_frame_length = working_cache_key.shape[1] // self.frame_length
+                    rope_start_frame = current_start_frame - working_cache_frame_length - 3
+                    anchor_cache_key = causal_rope_apply(
+                        kv_cache["k"][:, :self.block_length], grid_sizes_one_block, freqs, start_frame=rope_start_frame).type_as(v)
+                    anchor_cache_v = kv_cache["v"][:, :self.block_length]
+                    # 3. attention with working cache and anchor cache
+                    input_key = torch.cat([
+                        anchor_cache_key,
+                        working_cache_key,
+                        roped_key
+                    ], dim=1)
+                    input_v = torch.cat([
+                        anchor_cache_v,
+                        working_cache_v,
+                        v
+                    ], dim=1)
+                    x = attention(
+                        roped_query,
+                        input_key,
+                        input_v
+                    )
+        # output
+        x = x.flatten(2)
+        x = self.o(x)
+        return x
+class CausalWanAttentionBlock(nn.Module):
+    def __init__(self,
+                 cross_attn_type,
+                 dim,
+                 ffn_dim,
+                 num_heads,
+                 local_attn_size=-1,
+                 sink_size=0,
+                 qk_norm=True,
+                 cross_attn_norm=False,
+                 eps=1e-6):
+        super().__init__()
+        self.dim = dim
+        self.ffn_dim = ffn_dim
+        self.num_heads = num_heads
+        self.local_attn_size = local_attn_size
+        self.qk_norm = qk_norm
+        self.cross_attn_norm = cross_attn_norm
+        self.eps = eps
+        # layers
+        self.norm1 = WanLayerNorm(dim, eps)
+        self.self_attn = CausalWanSelfAttention(dim, num_heads, local_attn_size, sink_size, qk_norm, eps)
+        self.norm3 = WanLayerNorm(
+            dim, eps,
+            elementwise_affine=True) if cross_attn_norm else nn.Identity()
+        self.cross_attn = WAN_CROSSATTENTION_CLASSES[cross_attn_type](dim,
+                                                                      num_heads,
+                                                                      (-1, -1),
+                                                                      qk_norm,
+                                                                      eps)
+        self.norm2 = WanLayerNorm(dim, eps)
+        self.ffn = nn.Sequential(
+            nn.Linear(dim, ffn_dim), nn.GELU(approximate='tanh'),
+            nn.Linear(ffn_dim, dim))
+        # modulation
+        self.modulation = nn.Parameter(torch.randn(1, 6, dim) / dim**0.5)
+    def forward(
+        self,
+        x,
+        e,
+        seq_lens,
+        grid_sizes,
+        freqs,
+        context,
+        context_lens,
+        block_mask,
+        updating_cache=False,
+        kv_cache=None,
+        crossattn_cache=None,
+        current_start=0,
+        cache_start=None
+    ):
+        r"""
+        Args:
+            x(Tensor): Shape [B, L, C]
+            e(Tensor): Shape [B, F, 6, C]
+            seq_lens(Tensor): Shape [B], length of each sequence in batch
+            grid_sizes(Tensor): Shape [B, 3], the second dimension contains (F, H, W)
+            freqs(Tensor): Rope freqs, shape [1024, C / num_heads / 2]
+        """
+        num_frames, frame_seqlen = e.shape[1], x.shape[1] // e.shape[1]
+        # assert e.dtype == torch.float32
+        # with amp.autocast(dtype=torch.float32):
+        e = (self.modulation.unsqueeze(1) + e).chunk(6, dim=2)
+        # assert e[0].dtype == torch.float32
+        # self-attention
+        y = self.self_attn(
+            (self.norm1(x).unflatten(dim=1, sizes=(num_frames, frame_seqlen)) * (1 + e[1]) + e[0]).flatten(1, 2),
+            seq_lens, grid_sizes,
+            freqs, block_mask, kv_cache, current_start, cache_start, updating_cache=updating_cache)
+        # with amp.autocast(dtype=torch.float32):
+        x = x + (y.unflatten(dim=1, sizes=(num_frames, frame_seqlen)) * e[2]).flatten(1, 2)
+        # cross-attention & ffn function
+        def cross_attn_ffn(x, context, context_lens, e, crossattn_cache=None):
+            x = x + self.cross_attn(self.norm3(x), context,
+                                    context_lens, crossattn_cache=crossattn_cache)
+            y = self.ffn(
+                (self.norm2(x).unflatten(dim=1, sizes=(num_frames,
+                 frame_seqlen)) * (1 + e[4]) + e[3]).flatten(1, 2)
+            )
+            # with amp.autocast(dtype=torch.float32):
+            x = x + (y.unflatten(dim=1, sizes=(num_frames,
+                     frame_seqlen)) * e[5]).flatten(1, 2)
+            return x
+        x = cross_attn_ffn(x, context, context_lens, e, crossattn_cache)
+        return x
+class CausalHead(nn.Module):
+    def __init__(self, dim, out_dim, patch_size, eps=1e-6):
+        super().__init__()
+        self.dim = dim
+        self.out_dim = out_dim
+        self.patch_size = patch_size
+        self.eps = eps
+        # layers
+        out_dim = math.prod(patch_size) * out_dim
+        self.norm = WanLayerNorm(dim, eps)
+        self.head = nn.Linear(dim, out_dim)
+        # modulation
+        self.modulation = nn.Parameter(torch.randn(1, 2, dim) / dim**0.5)
+    def forward(self, x, e):
+        r"""
+        Args:
+            x(Tensor): Shape [B, L1, C]
+            e(Tensor): Shape [B, F, 1, C]
+        """
+        # assert e.dtype == torch.float32
+        # with amp.autocast(dtype=torch.float32):
+        num_frames, frame_seqlen = e.shape[1], x.shape[1] // e.shape[1]
+        e = (self.modulation.unsqueeze(1) + e).chunk(2, dim=2)
+        x = (self.head(self.norm(x).unflatten(dim=1, sizes=(num_frames, frame_seqlen)) * (1 + e[1]) + e[0]))
+        return x
+class CausalWanModel(ModelMixin, ConfigMixin):
+    r"""
+    Wan diffusion backbone supporting both text-to-video and image-to-video.
+    """
+    ignore_for_config = [
+        'patch_size', 'cross_attn_norm', 'qk_norm', 'text_dim'
+    ]
+    _no_split_modules = ['WanAttentionBlock']
+    _supports_gradient_checkpointing = True
+    @register_to_config
+    def __init__(self,
+                 model_type='t2v',
+                 patch_size=(1, 2, 2),
+                 text_len=512,
+                 in_dim=16,
+                 dim=2048,
+                 ffn_dim=8192,
+                 freq_dim=256,
+                 text_dim=4096,
+                 out_dim=16,
+                 num_heads=16,
+                 num_layers=32,
+                 local_attn_size=-1,
+                 sink_size=0,
+                 qk_norm=True,
+                 cross_attn_norm=True,
+                 eps=1e-6):
+        r"""
+        Initialize the diffusion model backbone.
+        Args:
+            model_type (`str`, *optional*, defaults to 't2v'):
+                Model variant - 't2v' (text-to-video) or 'i2v' (image-to-video)
+            patch_size (`tuple`, *optional*, defaults to (1, 2, 2)):
+                3D patch dimensions for video embedding (t_patch, h_patch, w_patch)
+            text_len (`int`, *optional*, defaults to 512):
+                Fixed length for text embeddings
+            in_dim (`int`, *optional*, defaults to 16):
+                Input video channels (C_in)
+            dim (`int`, *optional*, defaults to 2048):
+                Hidden dimension of the transformer
+            ffn_dim (`int`, *optional*, defaults to 8192):
+                Intermediate dimension in feed-forward network
+            freq_dim (`int`, *optional*, defaults to 256):
+                Dimension for sinusoidal time embeddings
+            text_dim (`int`, *optional*, defaults to 4096):
+                Input dimension for text embeddings
+            out_dim (`int`, *optional*, defaults to 16):
+                Output video channels (C_out)
+            num_heads (`int`, *optional*, defaults to 16):
+                Number of attention heads
+            num_layers (`int`, *optional*, defaults to 32):
+                Number of transformer blocks
+            local_attn_size (`int`, *optional*, defaults to -1):
+                Window size for temporal local attention (-1 indicates global attention)
+            sink_size (`int`, *optional*, defaults to 0):
+                Size of the attention sink, we keep the first `sink_size` frames unchanged when rolling the KV cache
+            qk_norm (`bool`, *optional*, defaults to True):
+                Enable query/key normalization
+            cross_attn_norm (`bool`, *optional*, defaults to False):
+                Enable cross-attention normalization
+            eps (`float`, *optional*, defaults to 1e-6):
+                Epsilon value for normalization layers
+        """
+        super().__init__()
+        assert model_type in ['t2v', 'i2v']
+        self.model_type = model_type
+        self.patch_size = patch_size
+        self.text_len = text_len
+        self.in_dim = in_dim
+        self.dim = dim
+        self.ffn_dim = ffn_dim
+        self.freq_dim = freq_dim
+        self.text_dim = text_dim
+        self.out_dim = out_dim
+        self.num_heads = num_heads
+        self.num_layers = num_layers
+        self.local_attn_size = local_attn_size
+        self.qk_norm = qk_norm
+        self.cross_attn_norm = cross_attn_norm
+        self.eps = eps
+        # embeddings
+        self.patch_embedding = nn.Conv3d(
+            in_dim, dim, kernel_size=patch_size, stride=patch_size)
+        self.text_embedding = nn.Sequential(
+            nn.Linear(text_dim, dim), nn.GELU(approximate='tanh'),
+            nn.Linear(dim, dim))
+        self.time_embedding = nn.Sequential(
+            nn.Linear(freq_dim, dim), nn.SiLU(), nn.Linear(dim, dim))
+        self.time_projection = nn.Sequential(
+            nn.SiLU(), nn.Linear(dim, dim * 6))
+        # blocks
+        cross_attn_type = 't2v_cross_attn' if model_type == 't2v' else 'i2v_cross_attn'
+        self.blocks = nn.ModuleList([
+            CausalWanAttentionBlock(cross_attn_type, dim, ffn_dim, num_heads,
+                                    local_attn_size, sink_size, qk_norm, cross_attn_norm, eps)
+            for _ in range(num_layers)
+        ])
+        # head
+        self.head = CausalHead(dim, out_dim, patch_size, eps)
+        # buffers (don't use register_buffer otherwise dtype will be changed in to())
+        assert (dim % num_heads) == 0 and (dim // num_heads) % 2 == 0
+        d = dim // num_heads
+        self.freqs = torch.cat([
+            rope_params(1024, d - 4 * (d // 6)),
+            rope_params(1024, 2 * (d // 6)),
+            rope_params(1024, 2 * (d // 6))
+        ],
+            dim=1)
+        if model_type == 'i2v':
+            self.img_emb = MLPProj(1280, dim)
+        # initialize weights
+        self.init_weights()
+        self.gradient_checkpointing = False
+        self.block_mask = None
+        self.num_frame_per_block = 1
+        self.independent_first_frame = False
+    def _set_gradient_checkpointing(self, module, value=False):
+        self.gradient_checkpointing = value
+    @staticmethod
+    def _prepare_blockwise_causal_attn_mask(
+        device: torch.device | str, num_frames: int = 21,
+        frame_seqlen: int = 1560, num_frame_per_block=1, local_attn_size=-1
+    ):
+        """
+        we will divide the token sequence into the following format
+        [1 latent frame] [1 latent frame] ... [1 latent frame]
+        We use flexattention to construct the attention mask
+        """
+        total_length = num_frames * frame_seqlen
+        # we do right padding to get to a multiple of 128
+        padded_length = math.ceil(total_length / 128) * 128 - total_length
+        ends = torch.zeros(total_length + padded_length,
+                           device=device, dtype=torch.long)
+        # Block-wise causal mask will attend to all elements that are before the end of the current chunk
+        frame_indices = torch.arange(
+            start=0,
+            end=total_length,
+            step=frame_seqlen * num_frame_per_block,
+            device=device
+        )
+        for tmp in frame_indices:
+            ends[tmp:tmp + frame_seqlen * num_frame_per_block] = tmp + \
+                frame_seqlen * num_frame_per_block
+        def attention_mask(b, h, q_idx, kv_idx):
+            if local_attn_size == -1:
+                return (kv_idx < ends[q_idx]) | (q_idx == kv_idx)
+            else:
+                return ((kv_idx < ends[q_idx]) & (kv_idx >= (ends[q_idx] - local_attn_size * frame_seqlen))) | (q_idx == kv_idx)
+            # return ((kv_idx < total_length) & (q_idx < total_length))  | (q_idx == kv_idx) # bidirectional mask
+        block_mask = create_block_mask(attention_mask, B=None, H=None, Q_LEN=total_length + padded_length,
+                                       KV_LEN=total_length + padded_length, _compile=False, device=device)
+        import torch.distributed as dist
+        if not dist.is_initialized() or dist.get_rank() == 0:
+            print(
+                f" cache a block wise causal mask with block size of {num_frame_per_block} frames")
+            print(block_mask)
+        # import imageio
+        # import numpy as np
+        # from torch.nn.attention.flex_attention import create_mask
+        # mask = create_mask(attention_mask, B=None, H=None, Q_LEN=total_length +
+        #                    padded_length, KV_LEN=total_length + padded_length, device=device)
+        # import cv2
+        # mask = cv2.resize(mask[0, 0].cpu().float().numpy(), (1024, 1024))
+        # imageio.imwrite("mask_%d.jpg" % (0), np.uint8(255. * mask))
+        return block_mask
+    @staticmethod
+    def _prepare_teacher_forcing_mask(
+        device: torch.device | str, num_frames: int = 21,
+        frame_seqlen: int = 1560, num_frame_per_block=1
+    ):
+        """
+        we will divide the token sequence into the following format
+        [1 latent frame] [1 latent frame] ... [1 latent frame]
+        We use flexattention to construct the attention mask
+        """
+        # debug
+        DEBUG = False
+        if DEBUG:
+            num_frames = 9
+            frame_seqlen = 256
+        total_length = num_frames * frame_seqlen * 2
+        # we do right padding to get to a multiple of 128
+        padded_length = math.ceil(total_length / 128) * 128 - total_length
+        clean_ends = num_frames * frame_seqlen
+        # for clean context frames, we can construct their flex attention mask based on a [start, end] interval
+        context_ends = torch.zeros(total_length + padded_length, device=device, dtype=torch.long)
+        # for noisy frames, we need two intervals to construct the flex attention mask [context_start, context_end] [noisy_start, noisy_end]
+        noise_context_starts = torch.zeros(total_length + padded_length, device=device, dtype=torch.long)
+        noise_context_ends = torch.zeros(total_length + padded_length, device=device, dtype=torch.long)
+        noise_noise_starts = torch.zeros(total_length + padded_length, device=device, dtype=torch.long)
+        noise_noise_ends = torch.zeros(total_length + padded_length, device=device, dtype=torch.long)
+        # Block-wise causal mask will attend to all elements that are before the end of the current chunk
+        attention_block_size = frame_seqlen * num_frame_per_block
+        frame_indices = torch.arange(
+            start=0,
+            end=num_frames * frame_seqlen,
+            step=attention_block_size,
+            device=device, dtype=torch.long
+        )
+        # attention for clean context frames
+        for start in frame_indices:
+            context_ends[start:start + attention_block_size] = start + attention_block_size
+        noisy_image_start_list = torch.arange(
+            num_frames * frame_seqlen, total_length,
+            step=attention_block_size,
+            device=device, dtype=torch.long
+        )
+        noisy_image_end_list = noisy_image_start_list + attention_block_size
+        # attention for noisy frames
+        for block_index, (start, end) in enumerate(zip(noisy_image_start_list, noisy_image_end_list)):
+            # attend to noisy tokens within the same block
+            noise_noise_starts[start:end] = start
+            noise_noise_ends[start:end] = end
+            # attend to context tokens in previous blocks
+            # noise_context_starts[start:end] = 0
+            noise_context_ends[start:end] = block_index * attention_block_size
+        def attention_mask(b, h, q_idx, kv_idx):
+            # first design the mask for clean frames
+            clean_mask = (q_idx < clean_ends) & (kv_idx < context_ends[q_idx])
+            # then design the mask for noisy frames
+            # noisy frames will attend to all clean preceeding clean frames + itself
+            C1 = (kv_idx < noise_noise_ends[q_idx]) & (kv_idx >= noise_noise_starts[q_idx])
+            C2 = (kv_idx < noise_context_ends[q_idx]) & (kv_idx >= noise_context_starts[q_idx])
+            noise_mask = (q_idx >= clean_ends) & (C1 | C2)
+            eye_mask = q_idx == kv_idx
+            return eye_mask | clean_mask | noise_mask
+        block_mask = create_block_mask(attention_mask, B=None, H=None, Q_LEN=total_length + padded_length,
+                                       KV_LEN=total_length + padded_length, _compile=False, device=device)
+        if DEBUG:
+            print(block_mask)
+            import imageio
+            import numpy as np
+            from torch.nn.attention.flex_attention import create_mask
+            mask = create_mask(attention_mask, B=None, H=None, Q_LEN=total_length +
+                               padded_length, KV_LEN=total_length + padded_length, device=device)
+            import cv2
+            mask = cv2.resize(mask[0, 0].cpu().float().numpy(), (1024, 1024))
+            imageio.imwrite("mask_%d.jpg" % (0), np.uint8(255. * mask))
+        return block_mask
+    @staticmethod
+    def _prepare_blockwise_causal_attn_mask_i2v(
+        device: torch.device | str, num_frames: int = 21,
+        frame_seqlen: int = 1560, num_frame_per_block=4, local_attn_size=-1
+    ):
+        """
+        we will divide the token sequence into the following format
+        [1 latent frame] [N latent frame] ... [N latent frame]
+        The first frame is separated out to support I2V generation
+        We use flexattention to construct the attention mask
+        """
+        total_length = num_frames * frame_seqlen
+        # we do right padding to get to a multiple of 128
+        padded_length = math.ceil(total_length / 128) * 128 - total_length
+        ends = torch.zeros(total_length + padded_length,
+                           device=device, dtype=torch.long)
+        # special handling for the first frame
+        ends[:frame_seqlen] = frame_seqlen
+        # Block-wise causal mask will attend to all elements that are before the end of the current chunk
+        frame_indices = torch.arange(
+            start=frame_seqlen,
+            end=total_length,
+            step=frame_seqlen * num_frame_per_block,
+            device=device
+        )
+        for idx, tmp in enumerate(frame_indices):
+            ends[tmp:tmp + frame_seqlen * num_frame_per_block] = tmp + \
+                frame_seqlen * num_frame_per_block
+        def attention_mask(b, h, q_idx, kv_idx):
+            if local_attn_size == -1:
+                return (kv_idx < ends[q_idx]) | (q_idx == kv_idx)
+            else:
+                return ((kv_idx < ends[q_idx]) & (kv_idx >= (ends[q_idx] - local_attn_size * frame_seqlen))) | \
+                    (q_idx == kv_idx)
+        block_mask = create_block_mask(attention_mask, B=None, H=None, Q_LEN=total_length + padded_length,
+                                       KV_LEN=total_length + padded_length, _compile=False, device=device)
+        if not dist.is_initialized() or dist.get_rank() == 0:
+            print(
+                f" cache a block wise causal mask with block size of {num_frame_per_block} frames")
+            print(block_mask)
+        # import imageio
+        # import numpy as np
+        # from torch.nn.attention.flex_attention import create_mask
+        # mask = create_mask(attention_mask, B=None, H=None, Q_LEN=total_length +
+        #                    padded_length, KV_LEN=total_length + padded_length, device=device)
+        # import cv2
+        # mask = cv2.resize(mask[0, 0].cpu().float().numpy(), (1024, 1024))
+        # imageio.imwrite("mask_%d.jpg" % (0), np.uint8(255. * mask))
+        return block_mask
+    def _forward_inference(
+        self,
+        x,
+        t,
+        context,
+        seq_len,
+        updating_cache=False,
+        clip_fea=None,
+        y=None,
+        kv_cache: dict = None,
+        crossattn_cache: dict = None,
+        current_start: int = 0,
+        cache_start: int = 0,
+    ):
+        r"""
+        Run the diffusion model with kv caching.
+        See Algorithm 2 of CausVid paper https://arxiv.org/abs/2412.07772 for details.
+        This function will be run for num_frame times.
+        Process the latent frames one by one (1560 tokens each)
+        Args:
+            x (List[Tensor]):
+                List of input video tensors, each with shape [C_in, F, H, W]
+            t (Tensor):
+                Diffusion timesteps tensor of shape [B]
+            context (List[Tensor]):
+                List of text embeddings each with shape [L, C]
+            seq_len (`int`):
+                Maximum sequence length for positional encoding
+            clip_fea (Tensor, *optional*):
+                CLIP image features for image-to-video mode
+            y (List[Tensor], *optional*):
+                Conditional video inputs for image-to-video mode, same shape as x
+        Returns:
+            List[Tensor]:
+                List of denoised video tensors with original input shapes [C_out, F, H / 8, W / 8]
+        """
+        if self.model_type == 'i2v':
+            assert clip_fea is not None and y is not None
+        # params
+        device = self.patch_embedding.weight.device
+        if self.freqs.device != device:
+            self.freqs = self.freqs.to(device)
+        if y is not None:
+            x = [torch.cat([u, v], dim=0) for u, v in zip(x, y)]
+        # embeddings
+        x = [self.patch_embedding(u.unsqueeze(0)) for u in x]
+        grid_sizes = torch.stack(
+            [torch.tensor(u.shape[2:], dtype=torch.long) for u in x])
+        x = [u.flatten(2).transpose(1, 2) for u in x]
+        seq_lens = torch.tensor([u.size(1) for u in x], dtype=torch.long)
+        assert seq_lens.max() <= seq_len
+        x = torch.cat(x)
+        """
+        torch.cat([
+            torch.cat([u, u.new_zeros(1, seq_len - u.size(1), u.size(2))],
+                      dim=1) for u in x
+        ])
+        """
+        # time embeddings
+        # with amp.autocast(dtype=torch.float32):
+        e = self.time_embedding(
+            sinusoidal_embedding_1d(self.freq_dim, t.flatten()).type_as(x))
+        e0 = self.time_projection(e).unflatten(
+            1, (6, self.dim)).unflatten(dim=0, sizes=t.shape)
+        # assert e.dtype == torch.float32 and e0.dtype == torch.float32
+        # context
+        context_lens = None
+        context = self.text_embedding(
+            torch.stack([
+                torch.cat(
+                    [u, u.new_zeros(self.text_len - u.size(0), u.size(1))])
+                for u in context
+            ]))
+        if clip_fea is not None:
+            context_clip = self.img_emb(clip_fea)  # bs x 257 x dim
+            context = torch.concat([context_clip, context], dim=1)
+        # arguments
+        kwargs = dict(
+            e=e0,
+            seq_lens=seq_lens,
+            grid_sizes=grid_sizes,
+            freqs=self.freqs,
+            context=context,
+            context_lens=context_lens,
+            block_mask=self.block_mask,
+            updating_cache=updating_cache,
+        )
+        def create_custom_forward(module):
+            def custom_forward(*inputs, **kwargs):
+                return module(*inputs, **kwargs)
+            return custom_forward
+        for block_index, block in enumerate(self.blocks):
+            if torch.is_grad_enabled() and self.gradient_checkpointing:
+                kwargs.update(
+                    {
+                        "kv_cache": kv_cache[block_index],
+                        "current_start": current_start,
+                        "cache_start": cache_start
+                    }
+                )
+                x = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(block),
+                    x, **kwargs,
+                    use_reentrant=False,
+                )
+            else:
+                kwargs.update(
+                    {
+                        "kv_cache": kv_cache[block_index],
+                        "crossattn_cache": crossattn_cache[block_index],
+                        "current_start": current_start,
+                        "cache_start": cache_start
+                    }
+                )
+                x = block(x, **kwargs)
+        # head
+        x = self.head(x, e.unflatten(dim=0, sizes=t.shape).unsqueeze(2))
+        # unpatchify
+        x = self.unpatchify(x, grid_sizes)
+        return torch.stack(x)
+    def _forward_train(
+        self,
+        x,
+        t,
+        context,
+        seq_len,
+        clean_x=None,
+        aug_t=None,
+        clip_fea=None,
+        y=None,
+    ):
+        r"""
+        Forward pass through the diffusion model
+        Args:
+            x (List[Tensor]):
+                List of input video tensors, each with shape [C_in, F, H, W]
+            t (Tensor):
+                Diffusion timesteps tensor of shape [B]
+            context (List[Tensor]):
+                List of text embeddings each with shape [L, C]
+            seq_len (`int`):
+                Maximum sequence length for positional encoding
+            clip_fea (Tensor, *optional*):
+                CLIP image features for image-to-video mode
+            y (List[Tensor], *optional*):
+                Conditional video inputs for image-to-video mode, same shape as x
+        Returns:
+            List[Tensor]:
+                List of denoised video tensors with original input shapes [C_out, F, H / 8, W / 8]
+        """
+        if self.model_type == 'i2v':
+            assert clip_fea is not None and y is not None
+        # params
+        device = self.patch_embedding.weight.device
+        if self.freqs.device != device:
+            self.freqs = self.freqs.to(device)
+        # Construct blockwise causal attn mask
+        if self.block_mask is None:
+            if clean_x is not None:
+                if self.independent_first_frame:
+                    raise NotImplementedError()
+                else:
+                    self.block_mask = self._prepare_teacher_forcing_mask(
+                        device, num_frames=x.shape[2],
+                        frame_seqlen=x.shape[-2] * x.shape[-1] // (self.patch_size[1] * self.patch_size[2]),
+                        num_frame_per_block=self.num_frame_per_block
+                    )
+            else:
+                if self.independent_first_frame:
+                    self.block_mask = self._prepare_blockwise_causal_attn_mask_i2v(
+                        device, num_frames=x.shape[2],
+                        frame_seqlen=x.shape[-2] * x.shape[-1] // (self.patch_size[1] * self.patch_size[2]),
+                        num_frame_per_block=self.num_frame_per_block,
+                        local_attn_size=self.local_attn_size
+                    )
+                else:
+                    self.block_mask = self._prepare_blockwise_causal_attn_mask(
+                        device, num_frames=x.shape[2],
+                        frame_seqlen=x.shape[-2] * x.shape[-1] // (self.patch_size[1] * self.patch_size[2]),
+                        num_frame_per_block=self.num_frame_per_block,
+                        local_attn_size=self.local_attn_size
+                    )
+        if y is not None:
+            x = [torch.cat([u, v], dim=0) for u, v in zip(x, y)]
+        # embeddings
+        x = [self.patch_embedding(u.unsqueeze(0)) for u in x]
+        grid_sizes = torch.stack(
+            [torch.tensor(u.shape[2:], dtype=torch.long) for u in x])
+        x = [u.flatten(2).transpose(1, 2) for u in x]
+        seq_lens = torch.tensor([u.size(1) for u in x], dtype=torch.long)
+        assert seq_lens.max() <= seq_len
+        x = torch.cat([
+            torch.cat([u, u.new_zeros(1, seq_lens[0] - u.size(1), u.size(2))],
+                      dim=1) for u in x
+        ])
+        # time embeddings
+        # with amp.autocast(dtype=torch.float32):
+        e = self.time_embedding(
+            sinusoidal_embedding_1d(self.freq_dim, t.flatten()).type_as(x))
+        e0 = self.time_projection(e).unflatten(
+            1, (6, self.dim)).unflatten(dim=0, sizes=t.shape)
+        # assert e.dtype == torch.float32 and e0.dtype == torch.float32
+        # context
+        context_lens = None
+        context = self.text_embedding(
+            torch.stack([
+                torch.cat(
+                    [u, u.new_zeros(self.text_len - u.size(0), u.size(1))])
+                for u in context
+            ]))
+        if clip_fea is not None:
+            context_clip = self.img_emb(clip_fea)  # bs x 257 x dim
+            context = torch.concat([context_clip, context], dim=1)
+        if clean_x is not None:
+            clean_x = [self.patch_embedding(u.unsqueeze(0)) for u in clean_x]
+            clean_x = [u.flatten(2).transpose(1, 2) for u in clean_x]
+            seq_lens_clean = torch.tensor([u.size(1) for u in clean_x], dtype=torch.long)
+            assert seq_lens_clean.max() <= seq_len
+            clean_x = torch.cat([
+                torch.cat([u, u.new_zeros(1, seq_lens_clean[0] - u.size(1), u.size(2))], dim=1) for u in clean_x
+            ])
+            x = torch.cat([clean_x, x], dim=1)
+            if aug_t is None:
+                aug_t = torch.zeros_like(t)
+            e_clean = self.time_embedding(
+                sinusoidal_embedding_1d(self.freq_dim, aug_t.flatten()).type_as(x))
+            e0_clean = self.time_projection(e_clean).unflatten(
+                1, (6, self.dim)).unflatten(dim=0, sizes=t.shape)
+            e0 = torch.cat([e0_clean, e0], dim=1)
+        # arguments
+        kwargs = dict(
+            e=e0,
+            seq_lens=seq_lens,
+            grid_sizes=grid_sizes,
+            freqs=self.freqs,
+            context=context,
+            context_lens=context_lens,
+            block_mask=self.block_mask)
+        def create_custom_forward(module):
+            def custom_forward(*inputs, **kwargs):
+                return module(*inputs, **kwargs)
+            return custom_forward
+        for block in self.blocks:
+            if torch.is_grad_enabled() and self.gradient_checkpointing:
+                x = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(block),
+                    x, **kwargs,
+                    use_reentrant=False,
+                )
+            else:
+                x = block(x, **kwargs)
+        if clean_x is not None:
+            x = x[:, x.shape[1] // 2:]
+        # head
+        x = self.head(x, e.unflatten(dim=0, sizes=t.shape).unsqueeze(2))
+        # unpatchify
+        x = self.unpatchify(x, grid_sizes)
+        return torch.stack(x)
+    def forward(
+        self,
+        *args,
+        **kwargs
+    ):
+        if kwargs.get('kv_cache', None) is not None:
+            return self._forward_inference(*args, **kwargs)
+        else:
+            return self._forward_train(*args, **kwargs)
+    def unpatchify(self, x, grid_sizes):
+        r"""
+        Reconstruct video tensors from patch embeddings.
+        Args:
+            x (List[Tensor]):
+                List of patchified features, each with shape [L, C_out * prod(patch_size)]
+            grid_sizes (Tensor):
+                Original spatial-temporal grid dimensions before patching,
+                    shape [B, 3] (3 dimensions correspond to F_patches, H_patches, W_patches)
+        Returns:
+            List[Tensor]:
+                Reconstructed video tensors with shape [C_out, F, H / 8, W / 8]
+        """
+        c = self.out_dim
+        out = []
+        for u, v in zip(x, grid_sizes.tolist()):
+            u = u[:math.prod(v)].view(*v, *self.patch_size, c)
+            u = torch.einsum('fhwpqrc->cfphqwr', u)
+            u = u.reshape(c, *[i * j for i, j in zip(v, self.patch_size)])
+            out.append(u)
+        return out
+    def init_weights(self):
+        r"""
+        Initialize model parameters using Xavier initialization.
+        """
+        # basic init
+        for m in self.modules():
+            if isinstance(m, nn.Linear):
+                nn.init.xavier_uniform_(m.weight)
+                if m.bias is not None:
+                    nn.init.zeros_(m.bias)
+        # init embeddings
+        nn.init.xavier_uniform_(self.patch_embedding.weight.flatten(1))
+        for m in self.text_embedding.modules():
+            if isinstance(m, nn.Linear):
+                nn.init.normal_(m.weight, std=.02)
+        for m in self.time_embedding.modules():
+            if isinstance(m, nn.Linear):
+                nn.init.normal_(m.weight, std=.02)
+        # init output layer
+        nn.init.zeros_(self.head.head.weight)