Spaces:

LanguageBind
/

Open-Sora-Plan-v1.0.0

Runtime error

App Files Files Community

LinB203 commited on Apr 7

Commit

a220803

•

1 Parent(s): bb863bd

m

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitignore +15 -0
LICENSE +21 -0
README.md +1 -1
docker/LICENSE +21 -0
docker/README.md +87 -0
docker/build_docker.png +0 -0
docker/docker_build.sh +8 -0
docker/docker_run.sh +45 -0
docker/dockerfile.base +24 -0
docker/packages.txt +3 -0
docker/ports.txt +1 -0
docker/postinstallscript.sh +3 -0
docker/requirements.txt +40 -0
docker/run_docker.png +0 -0
docker/setup_env.sh +11 -0
docs/CausalVideoVAE.md +36 -0
docs/Contribution_Guidelines.md +87 -0
docs/Data.md +35 -0
docs/EVAL.md +110 -0
docs/VQVAE.md +57 -0
examples/get_latents_std.py +38 -0
examples/prompt_list_0.txt +16 -0
examples/rec_image.py +57 -0
examples/rec_imvi_vae.py +159 -0
examples/rec_video.py +120 -0
examples/rec_video_ae.py +120 -0
examples/rec_video_vae.py +274 -0
opensora/__init__.py +1 -0
opensora/dataset/__init__.py +99 -0
opensora/dataset/extract_feature_dataset.py +64 -0
opensora/dataset/feature_datasets.py +213 -0
opensora/dataset/landscope.py +90 -0
opensora/dataset/sky_datasets.py +128 -0
opensora/dataset/t2v_datasets.py +111 -0
opensora/dataset/transform.py +489 -0
opensora/dataset/ucf101.py +80 -0
opensora/eval/cal_flolpips.py +83 -0
opensora/eval/cal_fvd.py +85 -0
opensora/eval/cal_lpips.py +97 -0
opensora/eval/cal_psnr.py +84 -0
opensora/eval/cal_ssim.py +113 -0
opensora/eval/eval_clip_score.py +225 -0
opensora/eval/eval_common_metric.py +224 -0
opensora/eval/flolpips/correlation/correlation.py +397 -0
opensora/eval/flolpips/flolpips.py +308 -0
opensora/eval/flolpips/pretrained_networks.py +180 -0
opensora/eval/flolpips/pwcnet.py +344 -0
opensora/eval/flolpips/utils.py +95 -0
opensora/eval/fvd/styleganv/fvd.py +90 -0
opensora/eval/fvd/styleganv/i3d_torchscript.pt +3 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,15 @@

+ucf101_stride4x4x4
+__pycache__
+*.mp4
+.ipynb_checkpoints
+*.pth
+UCF-101/
+results/
+vae
+build/
+opensora.egg-info/
+wandb/
+.idea
+*.ipynb
+*.jpg
+*.mp3

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2024 PKU-YUAN's Group (袁粒课题组-北大信工)
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -4,7 +4,7 @@ emoji: 🦀
 colorFrom: indigo
 colorTo: red
 sdk: gradio
-sdk_version: 4.25.0
 app_file: app.py
 pinned: false
 license: mit

 colorFrom: indigo
 colorTo: red
 sdk: gradio
+sdk_version: 3.37.0
 app_file: app.py
 pinned: false
 license: mit

docker/LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2024 SimonLee
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

docker/README.md ADDED Viewed

	@@ -0,0 +1,87 @@

+# Docker4ML
+Useful docker scripts for ML developement.
+[https://github.com/SimonLeeGit/Docker4ML](https://github.com/SimonLeeGit/Docker4ML)
+## Build Docker Image
+```bash
+bash docker_build.sh
+```
+![build_docker](build_docker.png)
+## Run Docker Container as Development Envirnoment
+```bash
+bash docker_run.sh
+```
+![run_docker](run_docker.png)
+## Custom Docker Config
+### Config [setup_env.sh](./setup_env.sh)
+You can modify this file to custom your settings.
+```bash
+TAG=ml:dev
+BASE_TAG=nvcr.io/nvidia/pytorch:23.12-py3
+```
+#### TAG
+Your built docker image tag, you can set it as what you what.
+#### BASE_TAG
+The base docker image tag for your built docker image, here we use nvidia pytorch images.
+You can check it from [https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags)
+Also, you can use other docker image as base, such as: [ubuntu](https://hub.docker.com/_/ubuntu/tags)
+### USER_NAME
+Your user name used in docker container.
+### USER_PASSWD
+Your user password used in docker container.
+### Config [requriements.txt](./requirements.txt)
+You can add your default installed python libraries here.
+```txt
+transformers==4.27.1
+```
+By default, it has some libs installed, you can check it from [https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-01.html](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-01.html)
+### Config [packages.txt](./packages.txt)
+You can add your default apt-get installed packages here.
+```txt
+wget
+curl
+git
+```
+### Config [ports.txt](./ports.txt)
+You can add some ports enabled for docker container here.
+```txt
+-p 6006:6006
+-p 8080:8080
+```
+### Config [postinstallscript.sh](./postinstallscript.sh)
+You can add your custom script to run when build docker image.
+## Q&A
+If you have any use problems, please contact to <simonlee235@gmail.com>.

docker/build_docker.png ADDED Viewed

docker/docker_build.sh ADDED Viewed

	@@ -0,0 +1,8 @@

+#!/usr/bin/env bash
+WORK_DIR=$(dirname "$(readlink -f "$0")")
+cd $WORK_DIR
+source setup_env.sh
+docker build -t $TAG --build-arg BASE_TAG=$BASE_TAG --build-arg USER_NAME=$USER_NAME --build-arg USER_PASSWD=$USER_PASSWD . -f dockerfile.base

docker/docker_run.sh ADDED Viewed

	@@ -0,0 +1,45 @@

+#!/usr/bin/env bash
+WORK_DIR=$(dirname "$(readlink -f "$0")")
+source $WORK_DIR/setup_env.sh
+RUNNING_IDS="$(docker ps --filter ancestor=$TAG --format "{{.ID}}")"
+if [ -n "$RUNNING_IDS" ]; then
+    # Initialize an array to hold the container IDs
+    declare -a container_ids=($RUNNING_IDS)
+    # Get the first container ID using array indexing
+    ID=${container_ids[0]}
+    # Print the first container ID
+    echo ' '
+    echo "The running container ID is: $ID, enter it!"
+else
+    echo ' '
+    echo "Not found running containers, run it!"
+    # Run a new docker container instance
+    ID=$(docker run \
+        --rm \
+        --gpus all \
+        -itd \
+        --ipc=host \
+        --ulimit memlock=-1 \
+        --ulimit stack=67108864 \
+        -e DISPLAY=$DISPLAY \
+        -v /tmp/.X11-unix/:/tmp/.X11-unix/ \
+        -v $PWD:/home/$USER_NAME/workspace \
+        -w /home/$USER_NAME/workspace \
+        $(cat $WORK_DIR/ports.txt) \
+        $TAG)
+fi
+docker logs $ID
+echo ' '
+echo ' '
+echo '========================================='
+echo ' '
+docker exec -it $ID bash

docker/dockerfile.base ADDED Viewed

	@@ -0,0 +1,24 @@

+ARG BASE_TAG
+FROM ${BASE_TAG}
+ARG USER_NAME=myuser
+ARG USER_PASSWD=111111
+ARG DEBIAN_FRONTEND=noninteractive
+# Pre-install packages, pip install requirements and run post install script.
+COPY packages.txt .
+COPY requirements.txt .
+COPY postinstallscript.sh .
+RUN apt-get update && apt-get install -y sudo $(cat packages.txt)
+RUN pip install --no-cache-dir -r requirements.txt
+RUN bash postinstallscript.sh
+# Create a new user and group using the username argument
+RUN groupadd -r ${USER_NAME} && useradd -r -m -g${USER_NAME} ${USER_NAME}
+RUN echo "${USER_NAME}:${USER_PASSWD}" | chpasswd
+RUN usermod -aG sudo ${USER_NAME}
+USER ${USER_NAME}
+ENV USER=${USER_NAME}
+WORKDIR /home/${USER_NAME}/workspace
+# Set the prompt to highlight the username
+RUN echo "export PS1='\[\033[01;32m\]\u\[\033[00m\]@\[\033[01;34m\]\h\[\033[00m\]:\[\033[01;36m\]\w\[\033[00m\]\$'" >> /home/${USER_NAME}/.bashrc

docker/packages.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+wget
+curl
+git

docker/ports.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ -p 6006:6006

docker/postinstallscript.sh ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ #!/usr/bin/env bash
2	+ # this script will run when build docker image.
3	+

docker/requirements.txt ADDED Viewed

	@@ -0,0 +1,40 @@

+setuptools>=61.0
+torch==2.0.1
+torchvision==0.15.2
+transformers==4.32.0
+albumentations==1.4.0
+av==11.0.0
+decord==0.6.0
+einops==0.3.0
+fastapi==0.110.0
+accelerate==0.21.0
+gdown==5.1.0
+h5py==3.10.0
+idna==3.6
+imageio==2.34.0
+matplotlib==3.7.5
+numpy==1.24.4
+omegaconf==2.1.1
+opencv-python==4.9.0.80
+opencv-python-headless==4.9.0.80
+pandas==2.0.3
+pillow==10.2.0
+pydub==0.25.1
+pytorch-lightning==1.4.2
+pytorchvideo==0.1.5
+PyYAML==6.0.1
+regex==2023.12.25
+requests==2.31.0
+scikit-learn==1.3.2
+scipy==1.10.1
+six==1.16.0
+tensorboard==2.14.0
+test-tube==0.7.5
+timm==0.9.16
+torchdiffeq==0.2.3
+torchmetrics==0.5.0
+tqdm==4.66.2
+urllib3==2.2.1
+uvicorn==0.27.1
+diffusers==0.24.0
+scikit-video==1.1.11

docker/run_docker.png ADDED Viewed

docker/setup_env.sh ADDED Viewed

	@@ -0,0 +1,11 @@

+# Docker tag for new build image
+TAG=open_sora_plan:dev
+# Base docker image tag used by docker build
+BASE_TAG=nvcr.io/nvidia/pytorch:23.05-py3
+# User name used in docker container
+USER_NAME=developer
+# User password used in docker container
+USER_PASSWD=666666

docs/CausalVideoVAE.md ADDED Viewed

	@@ -0,0 +1,36 @@

+# CausalVideoVAE Report
+## Examples
+### Image Reconstruction
+Resconstruction in **1536×1024**.
+<img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/1684c3ec-245d-4a60-865c-b8946d788eb9" width="45%"/> <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/46ef714e-3e5b-492c-aec4-3793cb2260b5" width="45%"/>
+### Video Reconstruction
+We reconstruct two videos with **720×1280**. Since github can't put too big video, we put it here: [1](https://streamable.com/gqojal), [2](https://streamable.com/6nu3j8).
+https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/c100bb02-2420-48a3-9d7b-4608a41f14aa
+https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/8aa8f587-d9f1-4e8b-8a82-d3bf9ba91d68
+## Model Structure
+![image](https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/e3c8b35d-a217-4d96-b2e9-5c248a2859c8)
+The Causal Video VAE architecture inherits from the [Stable-Diffusion Image VAE](https://github.com/CompVis/stable-diffusion/tree/main). To ensure that the pretrained weights of the Image VAE can be seamlessly applied to the Video VAE, the model structure has been designed as follows:
+**1. CausalConv3D**: Converting Conv2D to CausalConv3D enables joint training of image and video data. CausalConv3D applies a special treatment to the first frame, as it does not have access to subsequent frames. For more specific details, please refer to https://github.com/PKU-YuanGroup/Open-Sora-Plan/pull/145
+**2. Initialization**: There are two common [methods](https://github.com/hassony2/inflated_convnets_pytorch/blob/master/src/inflate.py#L5) to expand Conv2D to Conv3D: average initialization and center initialization. But we employ a specific initialization method (tail initialization). This initialization method ensures that without any training, the model is capable of directly reconstructing images, and even videos.
+## Training Details
+<img width="833" alt="image" src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/9ffb6dc4-23f6-4274-a066-bbebc7522a14">
+We present the loss curves for two distinct initialization methods under 17×256×256. The yellow curve represents the loss using tail init, while the blue curve corresponds to the loss from center initialization. As shown in the graph, tail initialization demonstrates better performance on the loss curve. Additionally, **we found that center initialization leads to error accumulation**, causing the collapse over extended durations.

docs/Contribution_Guidelines.md ADDED Viewed

	@@ -0,0 +1,87 @@

+# Contributing to the Open-Sora Plan Community
+The Open-Sora Plan open-source community is a collaborative initiative driven by the community, emphasizing a commitment to being free and void of exploitation. Organized spontaneously by community members, we invite you to contribute to the Open-Sora Plan open-source community and help elevate it to new heights!
+## Submitting a Pull Request (PR)
+As a contributor, before submitting your request, kindly follow these guidelines:
+1. Start by checking the [Open-Sora Plan GitHub](https://github.com/PKU-YuanGroup/Open-Sora-Plan/pulls) to see if there are any open or closed pull requests related to your intended submission. Avoid duplicating existing work.
+2. [Fork](https://github.com/PKU-YuanGroup/Open-Sora-Plan/fork) the [open-sora plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan) repository and download your forked repository to your local machine.
+   ```bash
+   git clone [your-forked-repository-url]
+   ```
+3. Add the original Open-Sora Plan repository as a remote to sync with the latest updates:
+   ```bash
+   git remote add upstream https://github.com/PKU-YuanGroup/Open-Sora-Plan
+   ```
+4. Sync the code from the main repository to your local machine, and then push it back to your forked remote repository.
+   ```
+   # Pull the latest code from the upstream branch
+   git fetch upstream
+   # Switch to the main branch
+   git checkout main
+   # Merge the updates from the upstream branch into main, synchronizing the local main branch with the upstream
+   git merge upstream/main
+   # Additionally, sync the local main branch to the remote branch of your forked repository
+   git push origin main
+   ```
+   > Note: Sync the code from the main repository before each submission.
+5. Create a branch in your forked repository for your changes, ensuring the branch name is meaningful.
+   ```bash
+   git checkout -b my-docs-branch main
+   ```
+6. While making modifications and committing changes, adhere to our [Commit Message Format](#Commit-Message-Format).
+   ```bash
+   git commit -m "[docs]: xxxx"
+   ```
+7. Push your changes to your GitHub repository.
+   ```bash
+   git push origin my-docs-branch
+   ```
+8. Submit a pull request to `Open-Sora-Plan:main` on the GitHub repository page.
+## Commit Message Format
+Commit messages must include both `<type>` and `<summary>` sections.
+```bash
+[<type>]: <summary>
+  │        │
+  │        └─⫸ Briefly describe your changes, without ending with a period.
+  │
+  └─⫸ Commit Type: |docs|feat|fix|refactor|
+```
+### Type
+* **docs**: Modify or add documents.
+* **feat**: Introduce a new feature.
+* **fix**: Fix a bug.
+* **refactor**: Restructure code, excluding new features or bug fixes.
+### Summary
+Describe modifications in English, without ending with a period.
+> e.g., git commit -m "[docs]: add a contributing.md file"
+This guideline is borrowed by [minisora](https://github.com/mini-sora/minisora). We sincerely appreciate MiniSora authors for their awesome templates.

docs/Data.md ADDED Viewed

	@@ -0,0 +1,35 @@

+**We need more dataset**, please refer to the [open-sora-Dataset](https://github.com/shaodong233/open-sora-Dataset) for details.
+## Sky
+This is an un-condition datasets. [Link](https://drive.google.com/open?id=1xWLiU-MBGN7MrsFHQm4_yXmfHBsMbJQo)
+```
+sky_timelapse
+├── readme
+├── sky_test
+├── sky_train
+├── test_videofolder.py
+└── video_folder.py
+```
+## UCF101
+We test the code with UCF-101 dataset. In order to download UCF-101 dataset, you can download the necessary files in [here](https://www.crcv.ucf.edu/data/UCF101.php). The code assumes a `ucf101` directory with the following structure
+```
+UCF-101/
+    ApplyEyeMakeup/
+        v1.avi
+        ...
+    ...
+    YoYo/
+        v1.avi
+        ...
+```
+## Offline feature extraction
+Coming soon...

docs/EVAL.md ADDED Viewed

	@@ -0,0 +1,110 @@

+# Evaluate the generated videos quality
+You can easily calculate the following video quality metrics, which supports the batch-wise process.
+- **CLIP-SCORE**: It uses the pretrained CLIP model to measure the cosine similarity between two modalities.
+- **FVD**: Frechét Video Distance
+- **SSIM**: structural similarity index measure
+- **LPIPS**: learned perceptual image patch similarity
+- **PSNR**: peak-signal-to-noise ratio
+# Requirement
+## Environment
+- install Pytorch (torch>=1.7.1)
+- install CLIP
+    ```
+        pip install git+https://github.com/openai/CLIP.git
+    ```
+- install clip-cose from PyPi
+    ```
+        pip install clip-score
+    ```
+- Other package
+    ```
+        pip install lpips
+        pip install scipy (scipy==1.7.3/1.9.3, if you use 1.11.3, **you will calculate a WRONG FVD VALUE!!!**)
+        pip install numpy
+        pip install pillow
+        pip install torchvision>=0.8.2
+        pip install ftfy
+        pip install regex
+        pip install tqdm
+    ```
+## Pretrain model
+- FVD
+    Before you cacluate FVD, you should first download the FVD pre-trained model. You can manually download any of the following and put it into FVD folder.
+    - `i3d_torchscript.pt` from [here](https://www.dropbox.com/s/ge9e5ujwgetktms/i3d_torchscript.pt)
+    - `i3d_pretrained_400.pt` from [here](https://onedrive.live.com/download?cid=78EEF3EB6AE7DBCB&resid=78EEF3EB6AE7DBCB%21199&authkey=AApKdFHPXzWLNyI)
+## Other Notices
+1. Make sure the pixel value of videos should be in [0, 1].
+2. We average SSIM when images have 3 channels, ssim is the only metric extremely sensitive to gray being compared to b/w.
+3. Because the i3d model downsamples in the time dimension, `frames_num` should > 10 when calculating FVD, so FVD calculation begins from 10-th frame, like upper example.
+4. For grayscale videos, we multiply to 3 channels
+5. data input specifications for clip_score
+> - Image Files:All images should be stored in a single directory. The image files can be in either .png or .jpg format.
+>
+> - Text Files: All text data should be contained in plain text files in a separate directory. These text files should have the extension .txt.
+>
+> Note: The number of files in the image directory should be exactly equal to the number of files in the text directory. Additionally, the files in the image directory and text directory should be paired by file name. For instance, if there is a cat.png in the image directory, there should be a corresponding cat.txt in the text directory.
+>
+> Directory Structure Example:
+> ```
+>   ├── path/to/image
+>   │   ├── cat.png
+>   │   ├── dog.png
+>   │   └── bird.jpg
+>   └── path/to/text
+>       ├── cat.txt
+>       ├── dog.txt
+>       └── bird.txt
+> ```
+6. data input specifications for fvd, psnr, ssim, lpips
+> Directory Structure Example:
+> ```
+>   ├── path/to/generated_image
+>   │   ├── cat.mp4
+>   │   ├── dog.mp4
+>   │   └── bird.mp4
+>   └── path/to/real_image
+>       ├── cat.mp4
+>       ├── dog.mp4
+>       └── bird.mp4
+> ```
+# Usage
+```
+# you change the file path and need to set the frame_num, resolution etc...
+# clip_score cross modality
+cd opensora/eval
+bash script/cal_clip_score.sh
+# fvd
+cd opensora/eval
+bash script/cal_fvd.sh
+# psnr
+cd opensora/eval
+bash eval/script/cal_psnr.sh
+# ssim
+cd opensora/eval
+bash eval/script/cal_ssim.sh
+# lpips
+cd opensora/eval
+bash eval/script/cal_lpips.sh
+```
+# Acknowledgement
+The evaluation codebase refers to [clip-score](https://github.com/Taited/clip-score) and [common_metrics](https://github.com/JunyaoHu/common_metrics_on_video_quality).

docs/VQVAE.md ADDED Viewed

	@@ -0,0 +1,57 @@

+# VQVAE Documentation
+# Introduction
+Vector Quantized Variational AutoEncoders (VQ-VAE) is a type of autoencoder that uses a discrete latent representation. It is particularly useful for tasks that require discrete latent variables, such as text-to-speech and video generation.
+# Usage
+## Initialization
+To initialize a VQVAE model, you can use the `VideoGPTVQVAE` class. This class is a part of the `opensora.models.ae` module.
+```python
+from opensora.models.ae import VideoGPTVQVAE
+vqvae = VideoGPTVQVAE()
+```
+### Training
+To train the VQVAE model, you can use the `train_videogpt.sh` script. This script will train the model using the parameters specified in the script.
+```bash
+bash scripts/videogpt/train_videogpt.sh
+```
+### Loading Pretrained Models
+You can load a pretrained model using the `download_and_load_model` method. This method will download the checkpoint file and load the model.
+```python
+vqvae = VideoGPTVQVAE.download_and_load_model("bair_stride4x2x2")
+```
+Alternatively, you can load a model from a checkpoint using the `load_from_checkpoint` method.
+```python
+vqvae = VQVAEModel.load_from_checkpoint("results/VQVAE/checkpoint-1000")
+```
+### Encoding and Decoding
+You can encode a video using the `encode` method. This method will return the encodings and embeddings of the video.
+```python
+encodings, embeddings = vqvae.encode(x_vae, include_embeddings=True)
+```
+You can reconstruct a video from its encodings using the decode method.
+```python
+video_recon = vqvae.decode(encodings)
+```
+## Testing
+You can test the VQVAE model by reconstructing a video. The `examples/rec_video.py` script provides an example of how to do this.

examples/get_latents_std.py ADDED Viewed

	@@ -0,0 +1,38 @@

+import torch
+from torch.utils.data import DataLoader, Subset
+import sys
+sys.path.append(".")
+from opensora.models.ae.videobase import CausalVAEModel, CausalVAEDataset
+num_workers = 4
+batch_size = 12
+torch.manual_seed(0)
+torch.set_grad_enabled(False)
+device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+pretrained_model_name_or_path = 'results/causalvae/checkpoint-26000'
+data_path = '/remote-home1/dataset/UCF-101'
+video_num_frames = 17
+resolution = 128
+sample_rate = 10
+vae = CausalVAEModel.load_from_checkpoint(pretrained_model_name_or_path)
+vae.to(device)
+dataset = CausalVAEDataset(data_path, sequence_length=video_num_frames, resolution=resolution, sample_rate=sample_rate)
+subset_indices = list(range(1000))
+subset_dataset = Subset(dataset, subset_indices)
+loader = DataLoader(subset_dataset, batch_size=8, pin_memory=True)
+all_latents = []
+for video_data in loader:
+    video_data = video_data['video'].to(device)
+    latents = vae.encode(video_data).sample()
+    all_latents.append(video_data.cpu())
+all_latents_tensor = torch.cat(all_latents)
+std = all_latents_tensor.std().item()
+normalizer = 1 / std
+print(f'{normalizer = }')

examples/prompt_list_0.txt ADDED Viewed

	@@ -0,0 +1,16 @@

+A quiet beach at dawn, the waves gently lapping at the shore and the sky painted in pastel hues.
+A quiet beach at dawn, the waves softly lapping at the shore, pink and orange hues painting the sky, offering a moment of solitude and reflection.
+The majestic beauty of a waterfall cascading down a cliff into a serene lake.
+Sunset over the sea.
+a cat wearing sunglasses and working as a lifeguard at pool.
+Slow pan upward of blazing oak fire in an indoor fireplace.
+Yellow and black tropical fish dart through the sea.
+a serene winter scene in a forest. The forest is blanketed in a thick layer of snow, which has settled on the branches of the trees, creating a canopy of white. The trees, a mix of evergreens and deciduous, stand tall and silent, their forms partially obscured by the snow. The ground is a uniform white, with no visible tracks or signs of human activity. The sun is low in the sky, casting a warm glow that contrasts with the cool tones of the snow. The light filters through the trees, creating a soft, diffused illumination that highlights the texture of the snow and the contours of the trees. The overall style of the scene is naturalistic, with a focus on the tranquility and beauty of the winter landscape.
+a dynamic interaction between the ocean and a large rock. The rock, with its rough texture and jagged edges, is partially submerged in the water, suggesting it is a natural feature of the coastline. The water around the rock is in motion, with white foam and waves crashing against the rock, indicating the force of the ocean's movement. The background is a vast expanse of the ocean, with small ripples and waves, suggesting a moderate sea state. The overall style of the scene is a realistic depiction of a natural landscape, with a focus on the interplay between the rock and the water.
+A serene waterfall cascading down moss-covered rocks, its soothing sound creating a harmonious symphony with nature.
+A soaring drone footage captures the majestic beauty of a coastal cliff, its red and yellow stratified rock faces rich in color and against the vibrant turquoise of the sea. Seabirds can be seen taking flight around the cliff's precipices. As the drone slowly moves from different angles, the changing sunlight casts shifting shadows that highlight the rugged textures of the cliff and the surrounding calm sea. The water gently laps at the rock base and the greenery that clings to the top of the cliff, and the scene gives a sense of peaceful isolation at the fringes of the ocean. The video captures the essence of pristine natural beauty untouched by human structures.
+The video captures the majestic beauty of a waterfall cascading down a cliff into a serene lake. The waterfall, with its powerful flow, is the central focus of the video. The surrounding landscape is lush and green, with trees and foliage adding to the natural beauty of the scene. The camera angle provides a bird's eye view of the waterfall, allowing viewers to appreciate the full height and grandeur of the waterfall. The video is a stunning representation of nature's power and beauty.
+A vibrant scene of a snowy mountain landscape. The sky is filled with a multitude of colorful hot air balloons, each floating at different heights, creating a dynamic and lively atmosphere. The balloons are scattered across the sky, some closer to the viewer, others further away, adding depth to the scene.  Below, the mountainous terrain is blanketed in a thick layer of snow, with a few patches of bare earth visible here and there. The snow-covered mountains provide a stark contrast to the colorful balloons, enhancing the visual appeal of the scene.
+A serene underwater scene featuring a sea turtle swimming through a coral reef. The turtle, with its greenish-brown shell, is the main focus of the video, swimming gracefully towards the right side of the frame. The coral reef, teeming with life, is visible in the background, providing a vibrant and colorful backdrop to the turtle's journey. Several small fish, darting around the turtle, add a sense of movement and dynamism to the scene.
+A snowy forest landscape with a dirt road running through it. The road is flanked by trees covered in snow, and the ground is also covered in snow. The sun is shining, creating a bright and serene atmosphere. The road appears to be empty, and there are no people or animals visible in the video. The style of the video is a natural landscape shot, with a focus on the beauty of the snowy forest and the peacefulness of the road.
+The dynamic movement of tall, wispy grasses swaying in the wind. The sky above is filled with clouds, creating a dramatic backdrop. The sunlight pierces through the clouds, casting a warm glow on the scene. The grasses are a mix of green and brown, indicating a change in seasons. The overall style of the video is naturalistic, capturing the beauty of the landscape in a realistic manner. The focus is on the grasses and their movement, with the sky serving as a secondary element. The video does not contain any human or animal elements.

examples/rec_image.py ADDED Viewed

	@@ -0,0 +1,57 @@

+import sys
+sys.path.append(".")
+from PIL import Image
+import torch
+from torchvision.transforms import ToTensor, Compose, Resize, Normalize
+from torch.nn import functional as F
+from opensora.models.ae.videobase import CausalVAEModel
+import argparse
+import numpy as np
+def preprocess(video_data: torch.Tensor, short_size: int = 128) -> torch.Tensor:
+    transform = Compose(
+        [
+            ToTensor(),
+            Normalize((0.5), (0.5)),
+            Resize(size=short_size),
+        ]
+    )
+    outputs = transform(video_data)
+    outputs = outputs.unsqueeze(0).unsqueeze(2)
+    return outputs
+def main(args: argparse.Namespace):
+    image_path = args.image_path
+    resolution = args.resolution
+    device = args.device
+    vqvae = CausalVAEModel.load_from_checkpoint(args.ckpt)
+    vqvae.eval()
+    vqvae = vqvae.to(device)
+    with torch.no_grad():
+        x_vae = preprocess(Image.open(image_path), resolution)
+        x_vae = x_vae.to(device)
+        latents = vqvae.encode(x_vae)
+        recon = vqvae.decode(latents.sample())
+    x = recon[0, :, 0, :, :]
+    x = x.squeeze()
+    x = x.detach().cpu().numpy()
+    x = np.clip(x, -1, 1)
+    x = (x + 1) / 2
+    x = (255*x).astype(np.uint8)
+    x = x.transpose(1,2,0)
+    image = Image.fromarray(x)
+    image.save(args.rec_path)
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--image-path', type=str, default='')
+    parser.add_argument('--rec-path', type=str, default='')
+    parser.add_argument('--ckpt', type=str, default='')
+    parser.add_argument('--resolution', type=int, default=336)
+    parser.add_argument('--device', type=str, default='cuda')
+    args = parser.parse_args()
+    main(args)

examples/rec_imvi_vae.py ADDED Viewed

	@@ -0,0 +1,159 @@

+import math
+import random
+import argparse
+from typing import Optional
+import cv2
+import numpy as np
+import numpy.typing as npt
+import torch
+from PIL import Image
+from decord import VideoReader, cpu
+from torch.nn import functional as F
+from pytorchvideo.transforms import ShortSideScale
+from torchvision.transforms import Lambda, Compose
+import sys
+sys.path.append(".")
+from opensora.dataset.transform import CenterCropVideo, resize
+from opensora.models.ae.videobase import CausalVAEModel
+def array_to_video(image_array: npt.NDArray, fps: float = 30.0, output_file: str = 'output_video.mp4') -> None:
+    height, width, channels = image_array[0].shape
+    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
+    video_writer = cv2.VideoWriter(output_file, fourcc, float(fps), (width, height))
+    for image in image_array:
+        image_rgb = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
+        video_writer.write(image_rgb)
+    video_writer.release()
+def custom_to_video(x: torch.Tensor, fps: float = 2.0, output_file: str = 'output_video.mp4') -> None:
+    x = x.detach().cpu()
+    x = torch.clamp(x, -1, 1)
+    x = (x + 1) / 2
+    x = x.permute(1, 2, 3, 0).numpy()
+    x = (255*x).astype(np.uint8)
+    array_to_video(x, fps=fps, output_file=output_file)
+    return
+def read_video(video_path: str, num_frames: int, sample_rate: int) -> torch.Tensor:
+    decord_vr = VideoReader(video_path, ctx=cpu(0))
+    total_frames = len(decord_vr)
+    sample_frames_len = sample_rate * num_frames
+    if total_frames > sample_frames_len:
+        s = random.randint(0, total_frames - sample_frames_len - 1)
+        s = 0
+        e = s + sample_frames_len
+        num_frames = num_frames
+    else:
+        s = 0
+        e = total_frames
+        num_frames = int(total_frames / sample_frames_len * num_frames)
+        print(f'sample_frames_len {sample_frames_len}, only can sample {num_frames * sample_rate}', video_path,
+              total_frames)
+    frame_id_list = np.linspace(s, e - 1, num_frames, dtype=int)
+    video_data = decord_vr.get_batch(frame_id_list).asnumpy()
+    video_data = torch.from_numpy(video_data)
+    video_data = video_data.permute(3, 0, 1, 2)  # (T, H, W, C) -> (C, T, H, W)
+    return video_data
+class ResizeVideo:
+    def __init__(
+            self,
+            size,
+            interpolation_mode="bilinear",
+    ):
+        self.size = size
+        self.interpolation_mode = interpolation_mode
+    def __call__(self, clip):
+        _, _, h, w = clip.shape
+        if w < h:
+            new_h = int(math.floor((float(h) / w) * self.size))
+            new_w = self.size
+        else:
+            new_h = self.size
+            new_w = int(math.floor((float(w) / h) * self.size))
+        return torch.nn.functional.interpolate(
+            clip, size=(new_h, new_w), mode=self.interpolation_mode, align_corners=False, antialias=True
+        )
+    def __repr__(self) -> str:
+        return f"{self.__class__.__name__}(size={self.size}, interpolation_mode={self.interpolation_mode}"
+def preprocess(video_data: torch.Tensor, short_size: int = 128, crop_size: Optional[int] = None) -> torch.Tensor:
+    transform = Compose(
+        [
+            Lambda(lambda x: ((x / 255.0) * 2 - 1)),
+            ResizeVideo(size=short_size),
+            CenterCropVideo(crop_size) if crop_size is not None else Lambda(lambda x: x),
+        ]
+    )
+    video_outputs = transform(video_data)
+    video_outputs = torch.unsqueeze(video_outputs, 0)
+    return video_outputs
+def main(args: argparse.Namespace):
+    video_path = args.video_path
+    num_frames = args.num_frames
+    resolution = args.resolution
+    crop_size = args.crop_size
+    sample_fps = args.sample_fps
+    sample_rate = args.sample_rate
+    device = args.device
+    vqvae = CausalVAEModel.from_pretrained(args.ckpt)
+    if args.enable_tiling:
+        vqvae.enable_tiling()
+        vqvae.tile_overlap_factor = args.tile_overlap_factor
+    vqvae.eval()
+    vqvae = vqvae.to(device)
+    vqvae = vqvae # .to(torch.float16)
+    with torch.no_grad():
+        x_vae = preprocess(read_video(video_path, num_frames, sample_rate), resolution, crop_size)
+        x_vae = x_vae.to(device)  # b c t h w
+        x_vae = x_vae # .to(torch.float16)
+        latents = vqvae.encode(x_vae).sample() # .to(torch.float16)
+        video_recon = vqvae.decode(latents)
+    if video_recon.shape[2] == 1:
+        x = video_recon[0, :, 0, :, :]
+        x = x.squeeze()
+        x = x.detach().cpu().numpy()
+        x = np.clip(x, -1, 1)
+        x = (x + 1) / 2
+        x = (255 * x).astype(np.uint8)
+        x = x.transpose(1, 2, 0)
+        image = Image.fromarray(x)
+        image.save(args.rec_path.replace('mp4', 'jpg'))
+    else:
+        custom_to_video(video_recon[0], fps=sample_fps/sample_rate, output_file=args.rec_path)
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--video-path', type=str, default='')
+    parser.add_argument('--rec-path', type=str, default='')
+    parser.add_argument('--ckpt', type=str, default='results/pretrained')
+    parser.add_argument('--sample-fps', type=int, default=30)
+    parser.add_argument('--resolution', type=int, default=336)
+    parser.add_argument('--crop-size', type=int, default=None)
+    parser.add_argument('--num-frames', type=int, default=100)
+    parser.add_argument('--sample-rate', type=int, default=1)
+    parser.add_argument('--device', type=str, default="cuda")
+    parser.add_argument('--tile_overlap_factor', type=float, default=0.25)
+    parser.add_argument('--enable_tiling', action='store_true')
+    args = parser.parse_args()
+    main(args)

examples/rec_video.py ADDED Viewed

	@@ -0,0 +1,120 @@

+import random
+import argparse
+from typing import Optional
+import cv2
+import imageio
+import numpy as np
+import numpy.typing as npt
+import torch
+from decord import VideoReader, cpu
+from torch.nn import functional as F
+from pytorchvideo.transforms import ShortSideScale
+from torchvision.transforms import Lambda, Compose
+from torchvision.transforms._transforms_video import RandomCropVideo
+import sys
+sys.path.append(".")
+from opensora.models.ae import VQVAEModel
+def array_to_video(image_array: npt.NDArray, fps: float = 30.0, output_file: str = 'output_video.mp4') -> None:
+    height, width, channels = image_array[0].shape
+    fourcc = cv2.VideoWriter_fourcc(*'mp4v')    # type: ignore
+    video_writer = cv2.VideoWriter(output_file, fourcc, float(fps), (width, height))
+    for image in image_array:
+        image_rgb = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
+        video_writer.write(image_rgb)
+    video_writer.release()
+def custom_to_video(x: torch.Tensor, fps: float = 2.0, output_file: str = 'output_video.mp4') -> None:
+    x = x.detach().cpu()
+    x = torch.clamp(x, -0.5, 0.5)
+    x = (x + 0.5)
+    x = x.permute(1, 2, 3, 0).numpy()  # (C, T, H, W) -> (T, H, W, C)
+    x = (255*x).astype(np.uint8)
+    # array_to_video(x, fps=fps, output_file=output_file)
+    imageio.mimwrite(output_file, x, fps=fps, quality=9)
+    return
+def read_video(video_path: str, num_frames: int, sample_rate: int) -> torch.Tensor:
+    decord_vr = VideoReader(video_path, ctx=cpu(0))
+    total_frames = len(decord_vr)
+    sample_frames_len = sample_rate * num_frames
+    if total_frames > sample_frames_len:
+        s = random.randint(0, total_frames - sample_frames_len - 1)
+        e = s + sample_frames_len
+        num_frames = num_frames
+    else:
+        s = 0
+        e = total_frames
+        num_frames = int(total_frames / sample_frames_len * num_frames)
+        print(f'sample_frames_len {sample_frames_len}, only can sample {num_frames * sample_rate}', video_path,
+              total_frames)
+    frame_id_list = np.linspace(s, e - 1, num_frames, dtype=int)
+    video_data = decord_vr.get_batch(frame_id_list).asnumpy()
+    video_data = torch.from_numpy(video_data)
+    video_data = video_data.permute(3, 0, 1, 2)  # (T, H, W, C) -> (C, T, H, W)
+    return video_data
+def preprocess(video_data: torch.Tensor, short_size: int = 128, crop_size: Optional[int] = None) -> torch.Tensor:
+    transform = Compose(
+        [
+            # UniformTemporalSubsample(num_frames),
+            Lambda(lambda x: ((x / 255.0) - 0.5)),
+            # NormalizeVideo(mean=OPENAI_DATASET_MEAN, std=OPENAI_DATASET_STD),
+            ShortSideScale(size=short_size),
+            RandomCropVideo(size=crop_size) if crop_size is not None else Lambda(lambda x: x),
+            # RandomHorizontalFlipVideo(p=0.5),
+        ]
+    )
+    video_outputs = transform(video_data)
+    video_outputs = torch.unsqueeze(video_outputs, 0)
+    return video_outputs
+def main(args: argparse.Namespace):
+    video_path = args.video_path
+    num_frames = args.num_frames
+    resolution = args.resolution
+    crop_size = args.crop_size
+    sample_fps = args.sample_fps
+    sample_rate = args.sample_rate
+    device = torch.device('cuda')
+    if args.ckpt in ['bair_stride4x2x2', 'ucf101_stride4x4x4', 'kinetics_stride4x4x4', 'kinetics_stride2x4x4']:
+        vqvae = VQVAEModel.download_and_load_model(args.ckpt)
+    else:
+        vqvae = VQVAEModel.load_from_checkpoint(args.ckpt)
+    vqvae.eval()
+    vqvae = vqvae.to(device)
+    with torch.no_grad():
+        x_vae = preprocess(read_video(video_path, num_frames, sample_rate), resolution, crop_size)
+        x_vae = x_vae.to(device)
+        encodings, embeddings = vqvae.encode(x_vae, include_embeddings=True)
+        video_recon = vqvae.decode(encodings)
+    # custom_to_video(x_vae[0], fps=sample_fps/sample_rate, output_file='origin_input.mp4')
+    custom_to_video(video_recon[0], fps=sample_fps/sample_rate, output_file=args.rec_path)
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--video-path', type=str, default='')
+    parser.add_argument('--rec-path', type=str, default='')
+    parser.add_argument('--ckpt', type=str, default='ucf101_stride4x4x4')
+    parser.add_argument('--sample-fps', type=int, default=30)
+    parser.add_argument('--resolution', type=int, default=336)
+    parser.add_argument('--crop-size', type=int, default=None)
+    parser.add_argument('--num-frames', type=int, default=100)
+    parser.add_argument('--sample-rate', type=int, default=1)
+    args = parser.parse_args()
+    main(args)

examples/rec_video_ae.py ADDED Viewed

	@@ -0,0 +1,120 @@

+import random
+import argparse
+from typing import Optional
+import cv2
+import imageio
+import numpy as np
+import numpy.typing as npt
+import torch
+from decord import VideoReader, cpu
+from torch.nn import functional as F
+from pytorchvideo.transforms import ShortSideScale
+from torchvision.transforms import Lambda, Compose
+from torchvision.transforms._transforms_video import RandomCropVideo
+import sys
+sys.path.append(".")
+from opensora.models.ae import VQVAEModel
+def array_to_video(image_array: npt.NDArray, fps: float = 30.0, output_file: str = 'output_video.mp4') -> None:
+    height, width, channels = image_array[0].shape
+    fourcc = cv2.VideoWriter_fourcc(*'mp4v')    # type: ignore
+    video_writer = cv2.VideoWriter(output_file, fourcc, float(fps), (width, height))
+    for image in image_array:
+        image_rgb = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
+        video_writer.write(image_rgb)
+    video_writer.release()
+def custom_to_video(x: torch.Tensor, fps: float = 2.0, output_file: str = 'output_video.mp4') -> None:
+    x = x.detach().cpu()
+    x = torch.clamp(x, -0.5, 0.5)
+    x = (x + 0.5)
+    x = x.permute(1, 2, 3, 0).numpy()  # (C, T, H, W) -> (T, H, W, C)
+    x = (255*x).astype(np.uint8)
+    # array_to_video(x, fps=fps, output_file=output_file)
+    imageio.mimwrite(output_file, x, fps=fps, quality=9)
+    return
+def read_video(video_path: str, num_frames: int, sample_rate: int) -> torch.Tensor:
+    decord_vr = VideoReader(video_path, ctx=cpu(0))
+    total_frames = len(decord_vr)
+    sample_frames_len = sample_rate * num_frames
+    if total_frames > sample_frames_len:
+        s = random.randint(0, total_frames - sample_frames_len - 1)
+        e = s + sample_frames_len
+        num_frames = num_frames
+    else:
+        s = 0
+        e = total_frames
+        num_frames = int(total_frames / sample_frames_len * num_frames)
+        print(f'sample_frames_len {sample_frames_len}, only can sample {num_frames * sample_rate}', video_path,
+              total_frames)
+    frame_id_list = np.linspace(s, e - 1, num_frames, dtype=int)
+    video_data = decord_vr.get_batch(frame_id_list).asnumpy()
+    video_data = torch.from_numpy(video_data)
+    video_data = video_data.permute(3, 0, 1, 2)  # (T, H, W, C) -> (C, T, H, W)
+    return video_data
+def preprocess(video_data: torch.Tensor, short_size: int = 128, crop_size: Optional[int] = None) -> torch.Tensor:
+    transform = Compose(
+        [
+            # UniformTemporalSubsample(num_frames),
+            Lambda(lambda x: ((x / 255.0) - 0.5)),
+            # NormalizeVideo(mean=OPENAI_DATASET_MEAN, std=OPENAI_DATASET_STD),
+            ShortSideScale(size=short_size),
+            RandomCropVideo(size=crop_size) if crop_size is not None else Lambda(lambda x: x),
+            # RandomHorizontalFlipVideo(p=0.5),
+        ]
+    )
+    video_outputs = transform(video_data)
+    video_outputs = torch.unsqueeze(video_outputs, 0)
+    return video_outputs
+def main(args: argparse.Namespace):
+    video_path = args.video_path
+    num_frames = args.num_frames
+    resolution = args.resolution
+    crop_size = args.crop_size
+    sample_fps = args.sample_fps
+    sample_rate = args.sample_rate
+    device = torch.device('cuda')
+    if args.ckpt in ['bair_stride4x2x2', 'ucf101_stride4x4x4', 'kinetics_stride4x4x4', 'kinetics_stride2x4x4']:
+        vqvae = VQVAEModel.download_and_load_model(args.ckpt)
+    else:
+        vqvae = VQVAEModel.load_from_checkpoint(args.ckpt)
+    vqvae.eval()
+    vqvae = vqvae.to(device)
+    with torch.no_grad():
+        x_vae = preprocess(read_video(video_path, num_frames, sample_rate), resolution, crop_size)
+        x_vae = x_vae.to(device)
+        encodings, embeddings = vqvae.encode(x_vae, include_embeddings=True)
+        video_recon = vqvae.decode(encodings)
+    # custom_to_video(x_vae[0], fps=sample_fps/sample_rate, output_file='origin_input.mp4')
+    custom_to_video(video_recon[0], fps=sample_fps/sample_rate, output_file=args.rec_path)
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--video-path', type=str, default='')
+    parser.add_argument('--rec-path', type=str, default='')
+    parser.add_argument('--ckpt', type=str, default='ucf101_stride4x4x4')
+    parser.add_argument('--sample-fps', type=int, default=30)
+    parser.add_argument('--resolution', type=int, default=336)
+    parser.add_argument('--crop-size', type=int, default=None)
+    parser.add_argument('--num-frames', type=int, default=100)
+    parser.add_argument('--sample-rate', type=int, default=1)
+    args = parser.parse_args()
+    main(args)

examples/rec_video_vae.py ADDED Viewed

	@@ -0,0 +1,274 @@

+import random
+import argparse
+import cv2
+from tqdm import tqdm
+import numpy as np
+import numpy.typing as npt
+import torch
+from decord import VideoReader, cpu
+from torch.nn import functional as F
+from pytorchvideo.transforms import ShortSideScale
+from torchvision.transforms import Lambda, Compose
+from torchvision.transforms._transforms_video import CenterCropVideo
+import sys
+from torch.utils.data import Dataset, DataLoader, Subset
+import os
+sys.path.append(".")
+from opensora.models.ae.videobase import CausalVAEModel
+import torch.nn as nn
+def array_to_video(
+    image_array: npt.NDArray, fps: float = 30.0, output_file: str = "output_video.mp4"
+) -> None:
+    height, width, channels = image_array[0].shape
+    fourcc = cv2.VideoWriter_fourcc(*"mp4v")
+    video_writer = cv2.VideoWriter(output_file, fourcc, float(fps), (width, height))
+    for image in image_array:
+        image_rgb = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
+        video_writer.write(image_rgb)
+    video_writer.release()
+def custom_to_video(
+    x: torch.Tensor, fps: float = 2.0, output_file: str = "output_video.mp4"
+) -> None:
+    x = x.detach().cpu()
+    x = torch.clamp(x, -1, 1)
+    x = (x + 1) / 2
+    x = x.permute(1, 2, 3, 0).float().numpy()
+    x = (255 * x).astype(np.uint8)
+    array_to_video(x, fps=fps, output_file=output_file)
+    return
+def read_video(video_path: str, num_frames: int, sample_rate: int) -> torch.Tensor:
+    decord_vr = VideoReader(video_path, ctx=cpu(0), num_threads=8)
+    total_frames = len(decord_vr)
+    sample_frames_len = sample_rate * num_frames
+    if total_frames > sample_frames_len:
+        s = 0
+        e = s + sample_frames_len
+        num_frames = num_frames
+    else:
+        s = 0
+        e = total_frames
+        num_frames = int(total_frames / sample_frames_len * num_frames)
+        print(
+            f"sample_frames_len {sample_frames_len}, only can sample {num_frames * sample_rate}",
+            video_path,
+            total_frames,
+        )
+    frame_id_list = np.linspace(s, e - 1, num_frames, dtype=int)
+    video_data = decord_vr.get_batch(frame_id_list).asnumpy()
+    video_data = torch.from_numpy(video_data)
+    video_data = video_data.permute(3, 0, 1, 2)  # (T, H, W, C) -> (C, T, H, W)
+    return video_data
+class RealVideoDataset(Dataset):
+    def __init__(
+        self,
+        real_video_dir,
+        num_frames,
+        sample_rate=1,
+        crop_size=None,
+        resolution=128,
+    ) -> None:
+        super().__init__()
+        self.real_video_files = self._combine_without_prefix(real_video_dir)
+        self.num_frames = num_frames
+        self.sample_rate = sample_rate
+        self.crop_size = crop_size
+        self.short_size = resolution
+    def __len__(self):
+        return len(self.real_video_files)
+    def __getitem__(self, index):
+        if index >= len(self):
+            raise IndexError
+        real_video_file = self.real_video_files[index]
+        real_video_tensor = self._load_video(real_video_file)
+        video_name = os.path.basename(real_video_file)
+        return {'video': real_video_tensor, 'file_name': video_name }
+    def _load_video(self, video_path):
+        num_frames = self.num_frames
+        sample_rate = self.sample_rate
+        decord_vr = VideoReader(video_path, ctx=cpu(0))
+        total_frames = len(decord_vr)
+        sample_frames_len = sample_rate * num_frames
+        if total_frames > sample_frames_len:
+            s = 0
+            e = s + sample_frames_len
+            num_frames = num_frames
+        else:
+            s = 0
+            e = total_frames
+            num_frames = int(total_frames / sample_frames_len * num_frames)
+            print(
+                f"sample_frames_len {sample_frames_len}, only can sample {num_frames * sample_rate}",
+                video_path,
+                total_frames,
+            )
+        frame_id_list = np.linspace(s, e - 1, num_frames, dtype=int)
+        video_data = decord_vr.get_batch(frame_id_list).asnumpy()
+        video_data = torch.from_numpy(video_data)
+        video_data = video_data.permute(3, 0, 1, 2)
+        return _preprocess(
+            video_data, short_size=self.short_size, crop_size=self.crop_size
+        )
+    def _combine_without_prefix(self, folder_path, prefix="."):
+        folder = []
+        for name in os.listdir(folder_path):
+            if name[0] == prefix:
+                continue
+            folder.append(os.path.join(folder_path, name))
+        folder.sort()
+        return folder
+def resize(x, resolution):
+    height, width = x.shape[-2:]
+    aspect_ratio = width / height
+    if width <= height:
+        new_width = resolution
+        new_height = int(resolution / aspect_ratio)
+    else:
+        new_height = resolution
+        new_width = int(resolution * aspect_ratio)
+    resized_x = F.interpolate(x, size=(new_height, new_width), mode='bilinear', align_corners=True, antialias=True)
+    return resized_x
+def _preprocess(video_data, short_size=128, crop_size=None):
+    transform = Compose(
+        [
+            Lambda(lambda x: ((x / 255.0) * 2 - 1)),
+            Lambda(lambda x: resize(x, short_size)),
+            (
+                CenterCropVideo(crop_size=crop_size)
+                if crop_size is not None
+                else Lambda(lambda x: x)
+            ),
+        ]
+    )
+    video_outputs = transform(video_data)
+    video_outputs = _format_video_shape(video_outputs)
+    return video_outputs
+def _format_video_shape(video, time_compress=4, spatial_compress=8):
+    time = video.shape[1]
+    height = video.shape[2]
+    width = video.shape[3]
+    new_time = (
+        (time - (time - 1) % time_compress)
+        if (time - 1) % time_compress != 0
+        else time
+    )
+    new_height = (
+        (height - (height) % spatial_compress)
+        if height % spatial_compress != 0
+        else height
+    )
+    new_width = (
+        (width - (width) % spatial_compress) if width % spatial_compress != 0 else width
+    )
+    return video[:, :new_time, :new_height, :new_width]
+@torch.no_grad()
+def main(args: argparse.Namespace):
+    real_video_dir = args.real_video_dir
+    generated_video_dir = args.generated_video_dir
+    ckpt = args.ckpt
+    sample_rate = args.sample_rate
+    resolution = args.resolution
+    crop_size = args.crop_size
+    num_frames = args.num_frames
+    sample_rate = args.sample_rate
+    device = args.device
+    sample_fps = args.sample_fps
+    batch_size = args.batch_size
+    num_workers = args.num_workers
+    subset_size = args.subset_size
+    if not os.path.exists(args.generated_video_dir):
+        os.makedirs(args.generated_video_dir, exist_ok=True)
+    data_type = torch.bfloat16
+    # ---- Load Model ----
+    device = args.device
+    vqvae = CausalVAEModel.from_pretrained(args.ckpt)
+    vqvae = vqvae.to(device).to(data_type)
+    if args.enable_tiling:
+        vqvae.enable_tiling()
+        vqvae.tile_overlap_factor = args.tile_overlap_factor
+    # ---- Load Model ----
+    # ---- Prepare Dataset ----
+    dataset = RealVideoDataset(
+        real_video_dir=real_video_dir,
+        num_frames=num_frames,
+        sample_rate=sample_rate,
+        crop_size=crop_size,
+        resolution=resolution,
+    )
+    if subset_size:
+        indices = range(subset_size)
+        dataset = Subset(dataset, indices=indices)
+    dataloader = DataLoader(
+        dataset, batch_size=batch_size, pin_memory=True, num_workers=num_workers
+    )
+    # ---- Prepare Dataset
+    # ---- Inference ----
+    for batch in tqdm(dataloader):
+        x, file_names = batch['video'], batch['file_name']
+        x = x.to(device=device, dtype=data_type)  # b c t h w
+        latents = vqvae.encode(x).sample().to(data_type)
+        video_recon = vqvae.decode(latents)
+        for idx, video in enumerate(video_recon):
+            output_path = os.path.join(generated_video_dir, file_names[idx])
+            if args.output_origin:
+                os.makedirs(os.path.join(generated_video_dir, "origin/"), exist_ok=True)
+                origin_output_path = os.path.join(generated_video_dir, "origin/", file_names[idx])
+                custom_to_video(
+                    x[idx], fps=sample_fps / sample_rate, output_file=origin_output_path
+                )
+            custom_to_video(
+                video, fps=sample_fps / sample_rate, output_file=output_path
+            )
+    # ---- Inference ----
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--real_video_dir", type=str, default="")
+    parser.add_argument("--generated_video_dir", type=str, default="")
+    parser.add_argument("--ckpt", type=str, default="")
+    parser.add_argument("--sample_fps", type=int, default=30)
+    parser.add_argument("--resolution", type=int, default=336)
+    parser.add_argument("--crop_size", type=int, default=None)
+    parser.add_argument("--num_frames", type=int, default=17)
+    parser.add_argument("--sample_rate", type=int, default=1)
+    parser.add_argument("--batch_size", type=int, default=1)
+    parser.add_argument("--num_workers", type=int, default=8)
+    parser.add_argument("--subset_size", type=int, default=None)
+    parser.add_argument("--tile_overlap_factor", type=float, default=0.25)
+    parser.add_argument('--enable_tiling', action='store_true')
+    parser.add_argument('--output_origin', action='store_true')
+    parser.add_argument("--device", type=str, default="cuda")
+    args = parser.parse_args()
+    main(args)

opensora/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ #

opensora/dataset/__init__.py ADDED Viewed

	@@ -0,0 +1,99 @@

+from torchvision.transforms import Compose
+from transformers import AutoTokenizer
+from .feature_datasets import T2V_Feature_dataset, T2V_T5_Feature_dataset
+from torchvision import transforms
+from torchvision.transforms import Lambda
+from .landscope import Landscope
+from .t2v_datasets import T2V_dataset
+from .transform import ToTensorVideo, TemporalRandomCrop, RandomHorizontalFlipVideo, CenterCropResizeVideo
+from .ucf101 import UCF101
+from .sky_datasets import Sky
+ae_norm = {
+    'CausalVAEModel_4x8x8': Lambda(lambda x: 2. * x - 1.),
+    'CausalVQVAEModel_4x4x4': Lambda(lambda x: x - 0.5),
+    'CausalVQVAEModel_4x8x8': Lambda(lambda x: x - 0.5),
+    'VQVAEModel_4x4x4': Lambda(lambda x: x - 0.5),
+    'VQVAEModel_4x8x8': Lambda(lambda x: x - 0.5),
+    "bair_stride4x2x2": Lambda(lambda x: x - 0.5),
+    "ucf101_stride4x4x4": Lambda(lambda x: x - 0.5),
+    "kinetics_stride4x4x4": Lambda(lambda x: x - 0.5),
+    "kinetics_stride2x4x4": Lambda(lambda x: x - 0.5),
+    'stabilityai/sd-vae-ft-mse': transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5], inplace=True),
+    'stabilityai/sd-vae-ft-ema': transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5], inplace=True),
+    'vqgan_imagenet_f16_1024': Lambda(lambda x: 2. * x - 1.),
+    'vqgan_imagenet_f16_16384': Lambda(lambda x: 2. * x - 1.),
+    'vqgan_gumbel_f8': Lambda(lambda x: 2. * x - 1.),
+}
+ae_denorm = {
+    'CausalVAEModel_4x8x8': lambda x: (x + 1.) / 2.,
+    'CausalVQVAEModel_4x4x4': lambda x: x + 0.5,
+    'CausalVQVAEModel_4x8x8': lambda x: x + 0.5,
+    'VQVAEModel_4x4x4': lambda x: x + 0.5,
+    'VQVAEModel_4x8x8': lambda x: x + 0.5,
+    "bair_stride4x2x2": lambda x: x + 0.5,
+    "ucf101_stride4x4x4": lambda x: x + 0.5,
+    "kinetics_stride4x4x4": lambda x: x + 0.5,
+    "kinetics_stride2x4x4": lambda x: x + 0.5,
+    'stabilityai/sd-vae-ft-mse': lambda x: 0.5 * x + 0.5,
+    'stabilityai/sd-vae-ft-ema': lambda x: 0.5 * x + 0.5,
+    'vqgan_imagenet_f16_1024': lambda x: (x + 1.) / 2.,
+    'vqgan_imagenet_f16_16384': lambda x: (x + 1.) / 2.,
+    'vqgan_gumbel_f8': lambda x: (x + 1.) / 2.,
+}
+def getdataset(args):
+    temporal_sample = TemporalRandomCrop(args.num_frames * args.sample_rate)  # 16 x
+    norm_fun = ae_norm[args.ae]
+    if args.dataset == 'ucf101':
+        transform = Compose(
+            [
+                ToTensorVideo(),  # TCHW
+                CenterCropResizeVideo(size=args.max_image_size),
+                RandomHorizontalFlipVideo(p=0.5),
+                norm_fun,
+            ]
+        )
+        return UCF101(args, transform=transform, temporal_sample=temporal_sample)
+    if args.dataset == 'landscope':
+        transform = Compose(
+            [
+                ToTensorVideo(),  # TCHW
+                CenterCropResizeVideo(size=args.max_image_size),
+                RandomHorizontalFlipVideo(p=0.5),
+                norm_fun,
+            ]
+        )
+        return Landscope(args, transform=transform, temporal_sample=temporal_sample)
+    elif args.dataset == 'sky':
+        transform = transforms.Compose([
+            ToTensorVideo(),
+            CenterCropResizeVideo(args.max_image_size),
+            RandomHorizontalFlipVideo(p=0.5),
+            norm_fun
+        ])
+        return Sky(args, transform=transform, temporal_sample=temporal_sample)
+    elif args.dataset == 't2v':
+        transform = transforms.Compose([
+            ToTensorVideo(),
+            CenterCropResizeVideo(args.max_image_size),
+            RandomHorizontalFlipVideo(p=0.5),
+            norm_fun
+        ])
+        tokenizer = AutoTokenizer.from_pretrained(args.text_encoder_name, cache_dir='./cache_dir')
+        return T2V_dataset(args, transform=transform, temporal_sample=temporal_sample, tokenizer=tokenizer)
+    elif args.dataset == 't2v_feature':
+        return T2V_Feature_dataset(args, temporal_sample)
+    elif args.dataset == 't2v_t5_feature':
+        transform = transforms.Compose([
+            ToTensorVideo(),
+            CenterCropResizeVideo(args.max_image_size),
+            RandomHorizontalFlipVideo(p=0.5),
+            norm_fun
+        ])
+        return T2V_T5_Feature_dataset(args, transform, temporal_sample)
+    else:
+        raise NotImplementedError(args.dataset)

opensora/dataset/extract_feature_dataset.py ADDED Viewed

	@@ -0,0 +1,64 @@

+import os
+from glob import glob
+import numpy as np
+import torch
+import torchvision
+from PIL import Image
+from torch.utils.data import Dataset
+from opensora.utils.dataset_utils import DecordInit, is_image_file
+class ExtractVideo2Feature(Dataset):
+    def __init__(self, args, transform):
+        self.data_path = args.data_path
+        self.transform = transform
+        self.v_decoder = DecordInit()
+        self.samples = list(glob(f'{self.data_path}'))
+    def __len__(self):
+        return len(self.samples)
+    def __getitem__(self, idx):
+        video_path = self.samples[idx]
+        video = self.decord_read(video_path)
+        video = self.transform(video)  # T C H W -> T C H W
+        return video, video_path
+    def tv_read(self, path):
+        vframes, aframes, info = torchvision.io.read_video(filename=path, pts_unit='sec', output_format='TCHW')
+        total_frames = len(vframes)
+        frame_indice = list(range(total_frames))
+        video = vframes[frame_indice]
+        return video
+    def decord_read(self, path):
+        decord_vr = self.v_decoder(path)
+        total_frames = len(decord_vr)
+        frame_indice = list(range(total_frames))
+        video_data = decord_vr.get_batch(frame_indice).asnumpy()
+        video_data = torch.from_numpy(video_data)
+        video_data = video_data.permute(0, 3, 1, 2)  # (T, H, W, C) -> (T C H W)
+        return video_data
+class ExtractImage2Feature(Dataset):
+    def __init__(self, args, transform):
+        self.data_path = args.data_path
+        self.transform = transform
+        self.data_all = list(glob(f'{self.data_path}'))
+    def __len__(self):
+        return len(self.data_all)
+    def __getitem__(self, index):
+        path = self.data_all[index]
+        video_frame = torch.as_tensor(np.array(Image.open(path), dtype=np.uint8, copy=True)).unsqueeze(0)
+        video_frame = video_frame.permute(0, 3, 1, 2)
+        video_frame = self.transform(video_frame)  # T C H W
+        # video_frame = video_frame.transpose(0, 1)  # T C H W -> C T H W
+        return video_frame, path

opensora/dataset/feature_datasets.py ADDED Viewed

	@@ -0,0 +1,213 @@

+import json
+import os
+import torch
+import random
+import torch.utils.data as data
+import numpy as np
+from glob import glob
+from PIL import Image
+from torch.utils.data import Dataset
+from tqdm import tqdm
+from opensora.dataset.transform import center_crop, RandomCropVideo
+from opensora.utils.dataset_utils import DecordInit
+class T2V_Feature_dataset(Dataset):
+    def __init__(self, args, temporal_sample):
+        self.video_folder = args.video_folder
+        self.num_frames = args.video_length
+        self.temporal_sample = temporal_sample
+        print('Building dataset...')
+        if os.path.exists('samples_430k.json'):
+            with open('samples_430k.json', 'r') as f:
+                self.samples = json.load(f)
+        else:
+            self.samples = self._make_dataset()
+            with open('samples_430k.json', 'w') as f:
+                json.dump(self.samples, f, indent=2)
+        self.use_image_num = args.use_image_num
+        self.use_img_from_vid = args.use_img_from_vid
+        if self.use_image_num != 0 and not self.use_img_from_vid:
+            self.img_cap_list = self.get_img_cap_list()
+    def _make_dataset(self):
+        all_mp4 = list(glob(os.path.join(self.video_folder, '**', '*.mp4'), recursive=True))
+        # all_mp4 = all_mp4[:1000]
+        samples = []
+        for i in tqdm(all_mp4):
+            video_id = os.path.basename(i).split('.')[0]
+            ae = os.path.split(i)[0].replace('data_split_tt', 'lb_causalvideovae444_feature')
+            ae = os.path.join(ae, f'{video_id}_causalvideovae444.npy')
+            if not os.path.exists(ae):
+                continue
+            t5 = os.path.split(i)[0].replace('data_split_tt', 'lb_t5_feature')
+            cond_list = []
+            cond_llava = os.path.join(t5, f'{video_id}_t5_llava_fea.npy')
+            mask_llava = os.path.join(t5, f'{video_id}_t5_llava_mask.npy')
+            if os.path.exists(cond_llava) and os.path.exists(mask_llava):
+                llava = dict(cond=cond_llava, mask=mask_llava)
+                cond_list.append(llava)
+            cond_sharegpt4v = os.path.join(t5, f'{video_id}_t5_sharegpt4v_fea.npy')
+            mask_sharegpt4v = os.path.join(t5, f'{video_id}_t5_sharegpt4v_mask.npy')
+            if os.path.exists(cond_sharegpt4v) and os.path.exists(mask_sharegpt4v):
+                sharegpt4v = dict(cond=cond_sharegpt4v, mask=mask_sharegpt4v)
+                cond_list.append(sharegpt4v)
+            if len(cond_list) > 0:
+                sample = dict(ae=ae, t5=cond_list)
+                samples.append(sample)
+        return samples
+    def __len__(self):
+        return len(self.samples)
+    def __getitem__(self, idx):
+        # try:
+        sample = self.samples[idx]
+        ae, t5 = sample['ae'], sample['t5']
+        t5 = random.choice(t5)
+        video_origin = np.load(ae)[0]  # C T H W
+        _, total_frames, _, _ = video_origin.shape
+        # Sampling video frames
+        start_frame_ind, end_frame_ind = self.temporal_sample(total_frames)
+        assert end_frame_ind - start_frame_ind >= self.num_frames
+        select_video_idx = np.linspace(start_frame_ind, end_frame_ind - 1, num=self.num_frames, dtype=int)  # start, stop, num=50
+        # print('select_video_idx', total_frames, select_video_idx)
+        video = video_origin[:, select_video_idx]  # C num_frames H W
+        video = torch.from_numpy(video)
+        cond = torch.from_numpy(np.load(t5['cond']))[0]  # L
+        cond_mask = torch.from_numpy(np.load(t5['mask']))[0]  # L D
+        if self.use_image_num != 0 and self.use_img_from_vid:
+            select_image_idx = np.random.randint(0, total_frames, self.use_image_num)
+            # print('select_image_idx', total_frames, self.use_image_num, select_image_idx)
+            images = video_origin[:, select_image_idx]  # c, num_img, h, w
+            images = torch.from_numpy(images)
+            video = torch.cat([video, images], dim=1)  # c, num_frame+num_img, h, w
+            cond = torch.stack([cond] * (1+self.use_image_num))  # 1+self.use_image_num, l
+            cond_mask = torch.stack([cond_mask] * (1+self.use_image_num))  # 1+self.use_image_num, l
+        elif self.use_image_num != 0 and not self.use_img_from_vid:
+            images, captions = self.img_cap_list[idx]
+            raise NotImplementedError
+        else:
+            pass
+        return video, cond, cond_mask
+        # except Exception as e:
+        #     print(f'Error with {e}, {sample}')
+        #     return self.__getitem__(random.randint(0, self.__len__() - 1))
+    def get_img_cap_list(self):
+        raise NotImplementedError
+class T2V_T5_Feature_dataset(Dataset):
+    def __init__(self, args, transform, temporal_sample):
+        self.video_folder = args.video_folder
+        self.num_frames = args.num_frames
+        self.transform = transform
+        self.temporal_sample = temporal_sample
+        self.v_decoder = DecordInit()
+        print('Building dataset...')
+        if os.path.exists('samples_430k.json'):
+            with open('samples_430k.json', 'r') as f:
+                self.samples = json.load(f)
+                self.samples = [dict(ae=i['ae'].replace('lb_causalvideovae444_feature', 'data_split_1024').replace('_causalvideovae444.npy', '.mp4'), t5=i['t5']) for i in self.samples]
+        else:
+            self.samples = self._make_dataset()
+            with open('samples_430k.json', 'w') as f:
+                json.dump(self.samples, f, indent=2)
+        self.use_image_num = args.use_image_num
+        self.use_img_from_vid = args.use_img_from_vid
+        if self.use_image_num != 0 and not self.use_img_from_vid:
+            self.img_cap_list = self.get_img_cap_list()
+    def _make_dataset(self):
+        all_mp4 = list(glob(os.path.join(self.video_folder, '**', '*.mp4'), recursive=True))
+        # all_mp4 = all_mp4[:1000]
+        samples = []
+        for i in tqdm(all_mp4):
+            video_id = os.path.basename(i).split('.')[0]
+            # ae = os.path.split(i)[0].replace('data_split', 'lb_causalvideovae444_feature')
+            # ae = os.path.join(ae, f'{video_id}_causalvideovae444.npy')
+            ae = i
+            if not os.path.exists(ae):
+                continue
+            t5 = os.path.split(i)[0].replace('data_split_1024', 'lb_t5_feature')
+            cond_list = []
+            cond_llava = os.path.join(t5, f'{video_id}_t5_llava_fea.npy')
+            mask_llava = os.path.join(t5, f'{video_id}_t5_llava_mask.npy')
+            if os.path.exists(cond_llava) and os.path.exists(mask_llava):
+                llava = dict(cond=cond_llava, mask=mask_llava)
+                cond_list.append(llava)
+            cond_sharegpt4v = os.path.join(t5, f'{video_id}_t5_sharegpt4v_fea.npy')
+            mask_sharegpt4v = os.path.join(t5, f'{video_id}_t5_sharegpt4v_mask.npy')
+            if os.path.exists(cond_sharegpt4v) and os.path.exists(mask_sharegpt4v):
+                sharegpt4v = dict(cond=cond_sharegpt4v, mask=mask_sharegpt4v)
+                cond_list.append(sharegpt4v)
+            if len(cond_list) > 0:
+                sample = dict(ae=ae, t5=cond_list)
+                samples.append(sample)
+        return samples
+    def __len__(self):
+        return len(self.samples)
+    def __getitem__(self, idx):
+        try:
+            sample = self.samples[idx]
+            ae, t5 = sample['ae'], sample['t5']
+            t5 = random.choice(t5)
+            video = self.decord_read(ae)
+            video = self.transform(video)  # T C H W -> T C H W
+            video = video.transpose(0, 1)  # T C H W -> C T H W
+            total_frames = video.shape[1]
+            cond = torch.from_numpy(np.load(t5['cond']))[0]  # L
+            cond_mask = torch.from_numpy(np.load(t5['mask']))[0]  # L D
+            if self.use_image_num != 0 and self.use_img_from_vid:
+                select_image_idx = np.random.randint(0, total_frames, self.use_image_num)
+                # print('select_image_idx', total_frames, self.use_image_num, select_image_idx)
+                images = video.numpy()[:, select_image_idx]  # c, num_img, h, w
+                images = torch.from_numpy(images)
+                video = torch.cat([video, images], dim=1)  # c, num_frame+num_img, h, w
+                cond = torch.stack([cond] * (1+self.use_image_num))  # 1+self.use_image_num, l
+                cond_mask = torch.stack([cond_mask] * (1+self.use_image_num))  # 1+self.use_image_num, l
+            elif self.use_image_num != 0 and not self.use_img_from_vid:
+                images, captions = self.img_cap_list[idx]
+                raise NotImplementedError
+            else:
+                pass
+            return video, cond, cond_mask
+        except Exception as e:
+            print(f'Error with {e}, {sample}')
+            return self.__getitem__(random.randint(0, self.__len__() - 1))
+    def decord_read(self, path):
+        decord_vr = self.v_decoder(path)
+        total_frames = len(decord_vr)
+        # Sampling video frames
+        start_frame_ind, end_frame_ind = self.temporal_sample(total_frames)
+        # assert end_frame_ind - start_frame_ind >= self.num_frames
+        frame_indice = np.linspace(start_frame_ind, end_frame_ind - 1, self.num_frames, dtype=int)
+        video_data = decord_vr.get_batch(frame_indice).asnumpy()
+        video_data = torch.from_numpy(video_data)
+        video_data = video_data.permute(0, 3, 1, 2)  # (T, H, W, C) -> (T C H W)
+        return video_data
+    def get_img_cap_list(self):
+        raise NotImplementedError

opensora/dataset/landscope.py ADDED Viewed

	@@ -0,0 +1,90 @@

+import math
+import os
+from glob import glob
+import decord
+import numpy as np
+import torch
+import torchvision
+from decord import VideoReader, cpu
+from torch.utils.data import Dataset
+from torchvision.transforms import Compose, Lambda, ToTensor
+from torchvision.transforms._transforms_video import NormalizeVideo, RandomCropVideo, RandomHorizontalFlipVideo
+from pytorchvideo.transforms import ApplyTransformToKey, ShortSideScale, UniformTemporalSubsample
+from torch.nn import functional as F
+import random
+from opensora.utils.dataset_utils import DecordInit
+class Landscope(Dataset):
+    def __init__(self, args, transform, temporal_sample):
+        self.data_path = args.data_path
+        self.num_frames = args.num_frames
+        self.transform = transform
+        self.temporal_sample = temporal_sample
+        self.v_decoder = DecordInit()
+        self.samples = self._make_dataset()
+        self.use_image_num = args.use_image_num
+        self.use_img_from_vid = args.use_img_from_vid
+        if self.use_image_num != 0 and not self.use_img_from_vid:
+            self.img_cap_list = self.get_img_cap_list()
+    def _make_dataset(self):
+        paths = list(glob(os.path.join(self.data_path, '**', '*.mp4'), recursive=True))
+        return paths
+    def __len__(self):
+        return len(self.samples)
+    def __getitem__(self, idx):
+        video_path = self.samples[idx]
+        try:
+            video = self.tv_read(video_path)
+            video = self.transform(video)  # T C H W -> T C H W
+            video = video.transpose(0, 1)  # T C H W -> C T H W
+            if self.use_image_num != 0 and self.use_img_from_vid:
+                select_image_idx = np.linspace(0, self.num_frames - 1, self.use_image_num, dtype=int)
+                assert self.num_frames >= self.use_image_num
+                images = video[:, select_image_idx]  # c, num_img, h, w
+                video = torch.cat([video, images], dim=1)  # c, num_frame+num_img, h, w
+            elif self.use_image_num != 0 and not self.use_img_from_vid:
+                images, captions = self.img_cap_list[idx]
+                raise NotImplementedError
+            else:
+                pass
+            return video, 1
+        except Exception as e:
+            print(f'Error with {e}, {video_path}')
+            return self.__getitem__(random.randint(0, self.__len__()-1))
+    def tv_read(self, path):
+        vframes, aframes, info = torchvision.io.read_video(filename=path, pts_unit='sec', output_format='TCHW')
+        total_frames = len(vframes)
+        # Sampling video frames
+        start_frame_ind, end_frame_ind = self.temporal_sample(total_frames)
+        # assert end_frame_ind - start_frame_ind >= self.num_frames
+        frame_indice = np.linspace(start_frame_ind, end_frame_ind - 1, self.num_frames, dtype=int)
+        video = vframes[frame_indice]  # (T, C, H, W)
+        return video
+    def decord_read(self, path):
+        decord_vr = self.v_decoder(path)
+        total_frames = len(decord_vr)
+        # Sampling video frames
+        start_frame_ind, end_frame_ind = self.temporal_sample(total_frames)
+        # assert end_frame_ind - start_frame_ind >= self.num_frames
+        frame_indice = np.linspace(start_frame_ind, end_frame_ind - 1, self.num_frames, dtype=int)
+        video_data = decord_vr.get_batch(frame_indice).asnumpy()
+        video_data = torch.from_numpy(video_data)
+        video_data = video_data.permute(0, 3, 1, 2)  # (T, H, W, C) -> (T C H W)
+        return video_data
+    def get_img_cap_list(self):
+        raise NotImplementedError

opensora/dataset/sky_datasets.py ADDED Viewed

	@@ -0,0 +1,128 @@

+import os
+import torch
+import random
+import torch.utils.data as data
+import numpy as np
+from PIL import Image
+from opensora.utils.dataset_utils import is_image_file
+class Sky(data.Dataset):
+    def __init__(self, args, transform, temporal_sample=None, train=True):
+        self.args = args
+        self.data_path = args.data_path
+        self.transform = transform
+        self.temporal_sample = temporal_sample
+        self.num_frames = self.args.num_frames
+        self.sample_rate = self.args.sample_rate
+        self.data_all = self.load_video_frames(self.data_path)
+        self.use_image_num = args.use_image_num
+        self.use_img_from_vid = args.use_img_from_vid
+        if self.use_image_num != 0 and not self.use_img_from_vid:
+            self.img_cap_list = self.get_img_cap_list()
+    def __getitem__(self, index):
+        vframes = self.data_all[index]
+        total_frames = len(vframes)
+        # Sampling video frames
+        start_frame_ind, end_frame_ind = self.temporal_sample(total_frames)
+        assert end_frame_ind - start_frame_ind >= self.num_frames
+        frame_indice = np.linspace(start_frame_ind, end_frame_ind-1, num=self.num_frames, dtype=int) # start, stop, num=50
+        select_video_frames = vframes[frame_indice[0]: frame_indice[-1]+1: self.sample_rate]
+        video_frames = []
+        for path in select_video_frames:
+            video_frame = torch.as_tensor(np.array(Image.open(path), dtype=np.uint8, copy=True)).unsqueeze(0)
+            video_frames.append(video_frame)
+        video_clip = torch.cat(video_frames, dim=0).permute(0, 3, 1, 2)
+        video_clip = self.transform(video_clip)
+        video_clip = video_clip.transpose(0, 1)  # T C H W -> C T H W
+        if self.use_image_num != 0 and self.use_img_from_vid:
+            select_image_idx = np.linspace(0, self.num_frames - 1, self.use_image_num, dtype=int)
+            assert self.num_frames >= self.use_image_num
+            images = video_clip[:, select_image_idx]  # c, num_img, h, w
+            video_clip = torch.cat([video_clip, images], dim=1)  # c, num_frame+num_img, h, w
+        elif self.use_image_num != 0 and not self.use_img_from_vid:
+            images, captions = self.img_cap_list[index]
+            raise NotImplementedError
+        else:
+            pass
+        return video_clip, 1
+    def __len__(self):
+        return self.video_num
+    def load_video_frames(self, dataroot):
+        data_all = []
+        frame_list = os.walk(dataroot)
+        for _, meta in enumerate(frame_list):
+            root = meta[0]
+            try:
+                frames = [i for i in meta[2] if is_image_file(i)]
+                frames = sorted(frames, key=lambda item: int(item.split('.')[0].split('_')[-1]))
+            except:
+                pass
+                # print(meta[0]) # root
+                # print(meta[2]) # files
+            frames = [os.path.join(root, item) for item in frames if is_image_file(item)]
+            if len(frames) > max(0, self.num_frames * self.sample_rate): # need all > (16 * frame-interval) videos
+            # if len(frames) >= max(0, self.target_video_len): # need all > 16 frames videos
+                data_all.append(frames)
+        self.video_num = len(data_all)
+        return data_all
+    def get_img_cap_list(self):
+        raise NotImplementedError
+if __name__ == '__main__':
+    import argparse
+    import torchvision
+    import video_transforms
+    import torch.utils.data as data
+    from torchvision import transforms
+    from torchvision.utils import save_image
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--num_frames", type=int, default=16)
+    parser.add_argument("--frame_interval", type=int, default=4)
+    parser.add_argument("--data-path", type=str, default="/path/to/datasets/sky_timelapse/sky_train/")
+    config = parser.parse_args()
+    target_video_len = config.num_frames
+    temporal_sample = video_transforms.TemporalRandomCrop(target_video_len * config.frame_interval)
+    trans = transforms.Compose([
+        video_transforms.ToTensorVideo(),
+        # video_transforms.CenterCropVideo(256),
+        video_transforms.CenterCropResizeVideo(256),
+        # video_transforms.RandomHorizontalFlipVideo(),
+        transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5], inplace=True)
+    ])
+    taichi_dataset = Sky(config, transform=trans, temporal_sample=temporal_sample)
+    print(len(taichi_dataset))
+    taichi_dataloader = data.DataLoader(dataset=taichi_dataset, batch_size=1, shuffle=False, num_workers=1)
+    for i, video_data in enumerate(taichi_dataloader):
+        print(video_data['video'].shape)
+        # print(video_data.dtype)
+        # for i in range(target_video_len):
+        #     save_image(video_data[0][i], os.path.join('./test_data', '%04d.png' % i), normalize=True, value_range=(-1, 1))
+        # video_ = ((video_data[0] * 0.5 + 0.5) * 255).add_(0.5).clamp_(0, 255).to(dtype=torch.uint8).cpu().permute(0, 2, 3, 1)
+        # torchvision.io.write_video('./test_data' + 'test.mp4', video_, fps=8)
+        # exit()

opensora/dataset/t2v_datasets.py ADDED Viewed

	@@ -0,0 +1,111 @@

+import json
+import os, io, csv, math, random
+import numpy as np
+import torchvision
+from einops import rearrange
+from decord import VideoReader
+import torch
+import torchvision.transforms as transforms
+from torch.utils.data.dataset import Dataset
+from tqdm import tqdm
+from opensora.utils.dataset_utils import DecordInit
+from opensora.utils.utils import text_preprocessing
+class T2V_dataset(Dataset):
+    def __init__(self, args, transform, temporal_sample, tokenizer):
+        # with open(args.data_path, 'r') as csvfile:
+        #     self.samples = list(csv.DictReader(csvfile))
+        self.video_folder = args.video_folder
+        self.num_frames = args.num_frames
+        self.transform = transform
+        self.temporal_sample = temporal_sample
+        self.tokenizer = tokenizer
+        self.model_max_length = args.model_max_length
+        self.v_decoder = DecordInit()
+        with open(args.data_path, 'r') as f:
+            self.samples = json.load(f)
+        self.use_image_num = args.use_image_num
+        self.use_img_from_vid = args.use_img_from_vid
+        if self.use_image_num != 0 and not self.use_img_from_vid:
+            self.img_cap_list = self.get_img_cap_list()
+    def __len__(self):
+        return len(self.samples)
+    def __getitem__(self, idx):
+        try:
+            # video = torch.randn(3, 16, 128, 128)
+            # input_ids = torch.ones(1, 120).to(torch.long).squeeze(0)
+            # cond_mask = torch.cat([torch.ones(1, 60).to(torch.long), torch.ones(1, 60).to(torch.long)], dim=1).squeeze(0)
+            # return video, input_ids, cond_mask
+            video_path = self.samples[idx]['path']
+            video = self.decord_read(video_path)
+            video = self.transform(video)  # T C H W -> T C H W
+            video = video.transpose(0, 1)  # T C H W -> C T H W
+            text = self.samples[idx]['cap'][0]
+            text = text_preprocessing(text)
+            text_tokens_and_mask = self.tokenizer(
+                text,
+                max_length=self.model_max_length,
+                padding='max_length',
+                truncation=True,
+                return_attention_mask=True,
+                add_special_tokens=True,
+                return_tensors='pt'
+            )
+            input_ids = text_tokens_and_mask['input_ids'].squeeze(0)
+            cond_mask = text_tokens_and_mask['attention_mask'].squeeze(0)
+            if self.use_image_num != 0 and self.use_img_from_vid:
+                select_image_idx = np.linspace(0, self.num_frames-1, self.use_image_num, dtype=int)
+                assert self.num_frames >= self.use_image_num
+                images = video[:, select_image_idx]  # c, num_img, h, w
+                video = torch.cat([video, images], dim=1)  # c, num_frame+num_img, h, w
+                input_ids = torch.stack([input_ids] * (1+self.use_image_num))  # 1+self.use_image_num, l
+                cond_mask = torch.stack([cond_mask] * (1+self.use_image_num))  # 1+self.use_image_num, l
+            elif self.use_image_num != 0 and not self.use_img_from_vid:
+                images, captions = self.img_cap_list[idx]
+                raise NotImplementedError
+            else:
+                pass
+            return video, input_ids, cond_mask
+        except Exception as e:
+            print(f'Error with {e}, {self.samples[idx]}')
+            return self.__getitem__(random.randint(0, self.__len__() - 1))
+    def tv_read(self, path):
+        vframes, aframes, info = torchvision.io.read_video(filename=path, pts_unit='sec', output_format='TCHW')
+        total_frames = len(vframes)
+        # Sampling video frames
+        start_frame_ind, end_frame_ind = self.temporal_sample(total_frames)
+        # assert end_frame_ind - start_frame_ind >= self.num_frames
+        frame_indice = np.linspace(start_frame_ind, end_frame_ind - 1, self.num_frames, dtype=int)
+        video = vframes[frame_indice]  # (T, C, H, W)
+        return video
+    def decord_read(self, path):
+        decord_vr = self.v_decoder(path)
+        total_frames = len(decord_vr)
+        # Sampling video frames
+        start_frame_ind, end_frame_ind = self.temporal_sample(total_frames)
+        # assert end_frame_ind - start_frame_ind >= self.num_frames
+        frame_indice = np.linspace(start_frame_ind, end_frame_ind - 1, self.num_frames, dtype=int)
+        video_data = decord_vr.get_batch(frame_indice).asnumpy()
+        video_data = torch.from_numpy(video_data)
+        video_data = video_data.permute(0, 3, 1, 2)  # (T, H, W, C) -> (T C H W)
+        return video_data
+    def get_img_cap_list(self):
+        raise NotImplementedError

opensora/dataset/transform.py ADDED Viewed

	@@ -0,0 +1,489 @@

+import torch
+import random
+import numbers
+from torchvision.transforms import RandomCrop, RandomResizedCrop
+def _is_tensor_video_clip(clip):
+    if not torch.is_tensor(clip):
+        raise TypeError("clip should be Tensor. Got %s" % type(clip))
+    if not clip.ndimension() == 4:
+        raise ValueError("clip should be 4D. Got %dD" % clip.dim())
+    return True
+def center_crop_arr(pil_image, image_size):
+    """
+    Center cropping implementation from ADM.
+    https://github.com/openai/guided-diffusion/blob/8fb3ad9197f16bbc40620447b2742e13458d2831/guided_diffusion/image_datasets.py#L126
+    """
+    while min(*pil_image.size) >= 2 * image_size:
+        pil_image = pil_image.resize(
+            tuple(x // 2 for x in pil_image.size), resample=Image.BOX
+        )
+    scale = image_size / min(*pil_image.size)
+    pil_image = pil_image.resize(
+        tuple(round(x * scale) for x in pil_image.size), resample=Image.BICUBIC
+    )
+    arr = np.array(pil_image)
+    crop_y = (arr.shape[0] - image_size) // 2
+    crop_x = (arr.shape[1] - image_size) // 2
+    return Image.fromarray(arr[crop_y: crop_y + image_size, crop_x: crop_x + image_size])
+def crop(clip, i, j, h, w):
+    """
+    Args:
+        clip (torch.tensor): Video clip to be cropped. Size is (T, C, H, W)
+    """
+    if len(clip.size()) != 4:
+        raise ValueError("clip should be a 4D tensor")
+    return clip[..., i: i + h, j: j + w]
+def resize(clip, target_size, interpolation_mode):
+    if len(target_size) != 2:
+        raise ValueError(f"target size should be tuple (height, width), instead got {target_size}")
+    return torch.nn.functional.interpolate(clip, size=target_size, mode=interpolation_mode, align_corners=True, antialias=True)
+def resize_scale(clip, target_size, interpolation_mode):
+    if len(target_size) != 2:
+        raise ValueError(f"target size should be tuple (height, width), instead got {target_size}")
+    H, W = clip.size(-2), clip.size(-1)
+    scale_ = target_size[0] / min(H, W)
+    return torch.nn.functional.interpolate(clip, scale_factor=scale_, mode=interpolation_mode, align_corners=True, antialias=True)
+def resized_crop(clip, i, j, h, w, size, interpolation_mode="bilinear"):
+    """
+    Do spatial cropping and resizing to the video clip
+    Args:
+        clip (torch.tensor): Video clip to be cropped. Size is (T, C, H, W)
+        i (int): i in (i,j) i.e coordinates of the upper left corner.
+        j (int): j in (i,j) i.e coordinates of the upper left corner.
+        h (int): Height of the cropped region.
+        w (int): Width of the cropped region.
+        size (tuple(int, int)): height and width of resized clip
+    Returns:
+        clip (torch.tensor): Resized and cropped clip. Size is (T, C, H, W)
+    """
+    if not _is_tensor_video_clip(clip):
+        raise ValueError("clip should be a 4D torch.tensor")
+    clip = crop(clip, i, j, h, w)
+    clip = resize(clip, size, interpolation_mode)
+    return clip
+def center_crop(clip, crop_size):
+    if not _is_tensor_video_clip(clip):
+        raise ValueError("clip should be a 4D torch.tensor")
+    h, w = clip.size(-2), clip.size(-1)
+    th, tw = crop_size
+    if h < th or w < tw:
+        raise ValueError("height and width must be no smaller than crop_size")
+    i = int(round((h - th) / 2.0))
+    j = int(round((w - tw) / 2.0))
+    return crop(clip, i, j, th, tw)
+def center_crop_using_short_edge(clip):
+    if not _is_tensor_video_clip(clip):
+        raise ValueError("clip should be a 4D torch.tensor")
+    h, w = clip.size(-2), clip.size(-1)
+    if h < w:
+        th, tw = h, h
+        i = 0
+        j = int(round((w - tw) / 2.0))
+    else:
+        th, tw = w, w
+        i = int(round((h - th) / 2.0))
+        j = 0
+    return crop(clip, i, j, th, tw)
+def random_shift_crop(clip):
+    '''
+    Slide along the long edge, with the short edge as crop size
+    '''
+    if not _is_tensor_video_clip(clip):
+        raise ValueError("clip should be a 4D torch.tensor")
+    h, w = clip.size(-2), clip.size(-1)
+    if h <= w:
+        long_edge = w
+        short_edge = h
+    else:
+        long_edge = h
+        short_edge = w
+    th, tw = short_edge, short_edge
+    i = torch.randint(0, h - th + 1, size=(1,)).item()
+    j = torch.randint(0, w - tw + 1, size=(1,)).item()
+    return crop(clip, i, j, th, tw)
+def to_tensor(clip):
+    """
+    Convert tensor data type from uint8 to float, divide value by 255.0 and
+    permute the dimensions of clip tensor
+    Args:
+        clip (torch.tensor, dtype=torch.uint8): Size is (T, C, H, W)
+    Return:
+        clip (torch.tensor, dtype=torch.float): Size is (T, C, H, W)
+    """
+    _is_tensor_video_clip(clip)
+    if not clip.dtype == torch.uint8:
+        raise TypeError("clip tensor should have data type uint8. Got %s" % str(clip.dtype))
+    # return clip.float().permute(3, 0, 1, 2) / 255.0
+    return clip.float() / 255.0
+def normalize(clip, mean, std, inplace=False):
+    """
+    Args:
+        clip (torch.tensor): Video clip to be normalized. Size is (T, C, H, W)
+        mean (tuple): pixel RGB mean. Size is (3)
+        std (tuple): pixel standard deviation. Size is (3)
+    Returns:
+        normalized clip (torch.tensor): Size is (T, C, H, W)
+    """
+    if not _is_tensor_video_clip(clip):
+        raise ValueError("clip should be a 4D torch.tensor")
+    if not inplace:
+        clip = clip.clone()
+    mean = torch.as_tensor(mean, dtype=clip.dtype, device=clip.device)
+    # print(mean)
+    std = torch.as_tensor(std, dtype=clip.dtype, device=clip.device)
+    clip.sub_(mean[:, None, None, None]).div_(std[:, None, None, None])
+    return clip
+def hflip(clip):
+    """
+    Args:
+        clip (torch.tensor): Video clip to be normalized. Size is (T, C, H, W)
+    Returns:
+        flipped clip (torch.tensor): Size is (T, C, H, W)
+    """
+    if not _is_tensor_video_clip(clip):
+        raise ValueError("clip should be a 4D torch.tensor")
+    return clip.flip(-1)
+class RandomCropVideo:
+    def __init__(self, size):
+        if isinstance(size, numbers.Number):
+            self.size = (int(size), int(size))
+        else:
+            self.size = size
+    def __call__(self, clip):
+        """
+        Args:
+            clip (torch.tensor): Video clip to be cropped. Size is (T, C, H, W)
+        Returns:
+            torch.tensor: randomly cropped video clip.
+                size is (T, C, OH, OW)
+        """
+        i, j, h, w = self.get_params(clip)
+        return crop(clip, i, j, h, w)
+    def get_params(self, clip):
+        h, w = clip.shape[-2:]
+        th, tw = self.size
+        if h < th or w < tw:
+            raise ValueError(f"Required crop size {(th, tw)} is larger than input image size {(h, w)}")
+        if w == tw and h == th:
+            return 0, 0, h, w
+        i = torch.randint(0, h - th + 1, size=(1,)).item()
+        j = torch.randint(0, w - tw + 1, size=(1,)).item()
+        return i, j, th, tw
+    def __repr__(self) -> str:
+        return f"{self.__class__.__name__}(size={self.size})"
+class CenterCropResizeVideo:
+    '''
+    First use the short side for cropping length,
+    center crop video, then resize to the specified size
+    '''
+    def __init__(
+            self,
+            size,
+            interpolation_mode="bilinear",
+    ):
+        if isinstance(size, tuple):
+            if len(size) != 2:
+                raise ValueError(f"size should be tuple (height, width), instead got {size}")
+            self.size = size
+        else:
+            self.size = (size, size)
+        self.interpolation_mode = interpolation_mode
+    def __call__(self, clip):
+        """
+        Args:
+            clip (torch.tensor): Video clip to be cropped. Size is (T, C, H, W)
+        Returns:
+            torch.tensor: scale resized / center cropped video clip.
+                size is (T, C, crop_size, crop_size)
+        """
+        clip_center_crop = center_crop_using_short_edge(clip)
+        clip_center_crop_resize = resize(clip_center_crop, target_size=self.size,
+                                         interpolation_mode=self.interpolation_mode)
+        return clip_center_crop_resize
+    def __repr__(self) -> str:
+        return f"{self.__class__.__name__}(size={self.size}, interpolation_mode={self.interpolation_mode}"
+class UCFCenterCropVideo:
+    '''
+    First scale to the specified size in equal proportion to the short edge,
+    then center cropping
+    '''
+    def __init__(
+            self,
+            size,
+            interpolation_mode="bilinear",
+    ):
+        if isinstance(size, tuple):
+            if len(size) != 2:
+                raise ValueError(f"size should be tuple (height, width), instead got {size}")
+            self.size = size
+        else:
+            self.size = (size, size)
+        self.interpolation_mode = interpolation_mode
+    def __call__(self, clip):
+        """
+        Args:
+            clip (torch.tensor): Video clip to be cropped. Size is (T, C, H, W)
+        Returns:
+            torch.tensor: scale resized / center cropped video clip.
+                size is (T, C, crop_size, crop_size)
+        """
+        clip_resize = resize_scale(clip=clip, target_size=self.size, interpolation_mode=self.interpolation_mode)
+        clip_center_crop = center_crop(clip_resize, self.size)
+        return clip_center_crop
+    def __repr__(self) -> str:
+        return f"{self.__class__.__name__}(size={self.size}, interpolation_mode={self.interpolation_mode}"
+class KineticsRandomCropResizeVideo:
+    '''
+    Slide along the long edge, with the short edge as crop size. And resie to the desired size.
+    '''
+    def __init__(
+            self,
+            size,
+            interpolation_mode="bilinear",
+    ):
+        if isinstance(size, tuple):
+            if len(size) != 2:
+                raise ValueError(f"size should be tuple (height, width), instead got {size}")
+            self.size = size
+        else:
+            self.size = (size, size)
+        self.interpolation_mode = interpolation_mode
+    def __call__(self, clip):
+        clip_random_crop = random_shift_crop(clip)
+        clip_resize = resize(clip_random_crop, self.size, self.interpolation_mode)
+        return clip_resize
+class CenterCropVideo:
+    def __init__(
+            self,
+            size,
+            interpolation_mode="bilinear",
+    ):
+        if isinstance(size, tuple):
+            if len(size) != 2:
+                raise ValueError(f"size should be tuple (height, width), instead got {size}")
+            self.size = size
+        else:
+            self.size = (size, size)
+        self.interpolation_mode = interpolation_mode
+    def __call__(self, clip):
+        """
+        Args:
+            clip (torch.tensor): Video clip to be cropped. Size is (T, C, H, W)
+        Returns:
+            torch.tensor: center cropped video clip.
+                size is (T, C, crop_size, crop_size)
+        """
+        clip_center_crop = center_crop(clip, self.size)
+        return clip_center_crop
+    def __repr__(self) -> str:
+        return f"{self.__class__.__name__}(size={self.size}, interpolation_mode={self.interpolation_mode}"
+class NormalizeVideo:
+    """
+    Normalize the video clip by mean subtraction and division by standard deviation
+    Args:
+        mean (3-tuple): pixel RGB mean
+        std (3-tuple): pixel RGB standard deviation
+        inplace (boolean): whether do in-place normalization
+    """
+    def __init__(self, mean, std, inplace=False):
+        self.mean = mean
+        self.std = std
+        self.inplace = inplace
+    def __call__(self, clip):
+        """
+        Args:
+            clip (torch.tensor): video clip must be normalized. Size is (C, T, H, W)
+        """
+        return normalize(clip, self.mean, self.std, self.inplace)
+    def __repr__(self) -> str:
+        return f"{self.__class__.__name__}(mean={self.mean}, std={self.std}, inplace={self.inplace})"
+class ToTensorVideo:
+    """
+    Convert tensor data type from uint8 to float, divide value by 255.0 and
+    permute the dimensions of clip tensor
+    """
+    def __init__(self):
+        pass
+    def __call__(self, clip):
+        """
+        Args:
+            clip (torch.tensor, dtype=torch.uint8): Size is (T, C, H, W)
+        Return:
+            clip (torch.tensor, dtype=torch.float): Size is (T, C, H, W)
+        """
+        return to_tensor(clip)
+    def __repr__(self) -> str:
+        return self.__class__.__name__
+class RandomHorizontalFlipVideo:
+    """
+    Flip the video clip along the horizontal direction with a given probability
+    Args:
+        p (float): probability of the clip being flipped. Default value is 0.5
+    """
+    def __init__(self, p=0.5):
+        self.p = p
+    def __call__(self, clip):
+        """
+        Args:
+            clip (torch.tensor): Size is (T, C, H, W)
+        Return:
+            clip (torch.tensor): Size is (T, C, H, W)
+        """
+        if random.random() < self.p:
+            clip = hflip(clip)
+        return clip
+    def __repr__(self) -> str:
+        return f"{self.__class__.__name__}(p={self.p})"
+#  ------------------------------------------------------------
+#  ---------------------  Sampling  ---------------------------
+#  ------------------------------------------------------------
+class TemporalRandomCrop(object):
+    """Temporally crop the given frame indices at a random location.
+    Args:
+        size (int): Desired length of frames will be seen in the model.
+    """
+    def __init__(self, size):
+        self.size = size
+    def __call__(self, total_frames):
+        rand_end = max(0, total_frames - self.size - 1)
+        begin_index = random.randint(0, rand_end)
+        end_index = min(begin_index + self.size, total_frames)
+        return begin_index, end_index
+if __name__ == '__main__':
+    from torchvision import transforms
+    import torchvision.io as io
+    import numpy as np
+    from torchvision.utils import save_image
+    import os
+    vframes, aframes, info = io.read_video(
+        filename='./v_Archery_g01_c03.avi',
+        pts_unit='sec',
+        output_format='TCHW'
+    )
+    trans = transforms.Compose([
+        ToTensorVideo(),
+        RandomHorizontalFlipVideo(),
+        UCFCenterCropVideo(512),
+        # NormalizeVideo(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5], inplace=True),
+        transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5], inplace=True)
+    ])
+    target_video_len = 32
+    frame_interval = 1
+    total_frames = len(vframes)
+    print(total_frames)
+    temporal_sample = TemporalRandomCrop(target_video_len * frame_interval)
+    # Sampling video frames
+    start_frame_ind, end_frame_ind = temporal_sample(total_frames)
+    # print(start_frame_ind)
+    # print(end_frame_ind)
+    assert end_frame_ind - start_frame_ind >= target_video_len
+    frame_indice = np.linspace(start_frame_ind, end_frame_ind - 1, target_video_len, dtype=int)
+    print(frame_indice)
+    select_vframes = vframes[frame_indice]
+    print(select_vframes.shape)
+    print(select_vframes.dtype)
+    select_vframes_trans = trans(select_vframes)
+    print(select_vframes_trans.shape)
+    print(select_vframes_trans.dtype)
+    select_vframes_trans_int = ((select_vframes_trans * 0.5 + 0.5) * 255).to(dtype=torch.uint8)
+    print(select_vframes_trans_int.dtype)
+    print(select_vframes_trans_int.permute(0, 2, 3, 1).shape)
+    io.write_video('./test.avi', select_vframes_trans_int.permute(0, 2, 3, 1), fps=8)
+    for i in range(target_video_len):
+        save_image(select_vframes_trans[i], os.path.join('./test000', '%04d.png' % i), normalize=True,
+                   value_range=(-1, 1))

opensora/dataset/ucf101.py ADDED Viewed

	@@ -0,0 +1,80 @@

+import math
+import os
+import decord
+import numpy as np
+import torch
+import torchvision
+from decord import VideoReader, cpu
+from torch.utils.data import Dataset
+from torchvision.transforms import Compose, Lambda, ToTensor
+from torchvision.transforms._transforms_video import NormalizeVideo, RandomCropVideo, RandomHorizontalFlipVideo
+from pytorchvideo.transforms import ApplyTransformToKey, ShortSideScale, UniformTemporalSubsample
+from torch.nn import functional as F
+import random
+from opensora.utils.dataset_utils import DecordInit
+class UCF101(Dataset):
+    def __init__(self, args, transform, temporal_sample):
+        self.data_path = args.data_path
+        self.num_frames = args.num_frames
+        self.transform = transform
+        self.temporal_sample = temporal_sample
+        self.v_decoder = DecordInit()
+        self.classes = sorted(os.listdir(self.data_path))
+        self.class_to_idx = {cls_name: idx for idx, cls_name in enumerate(self.classes)}
+        self.samples = self._make_dataset()
+    def _make_dataset(self):
+        dataset = []
+        for class_name in self.classes:
+            class_path = os.path.join(self.data_path, class_name)
+            for fname in os.listdir(class_path):
+                if fname.endswith('.avi'):
+                    item = (os.path.join(class_path, fname), self.class_to_idx[class_name])
+                    dataset.append(item)
+        return dataset
+    def __len__(self):
+        return len(self.samples)
+    def __getitem__(self, idx):
+        video_path, label = self.samples[idx]
+        try:
+            video = self.tv_read(video_path)
+            video = self.transform(video)  # T C H W -> T C H W
+            video = video.transpose(0, 1)  # T C H W -> C T H W
+            return video, label
+        except Exception as e:
+            print(f'Error with {e}, {video_path}')
+            return self.__getitem__(random.randint(0, self.__len__()-1))
+    def tv_read(self, path):
+        vframes, aframes, info = torchvision.io.read_video(filename=path, pts_unit='sec', output_format='TCHW')
+        total_frames = len(vframes)
+        # Sampling video frames
+        start_frame_ind, end_frame_ind = self.temporal_sample(total_frames)
+        # assert end_frame_ind - start_frame_ind >= self.num_frames
+        frame_indice = np.linspace(start_frame_ind, end_frame_ind - 1, self.num_frames, dtype=int)
+        video = vframes[frame_indice]  # (T, C, H, W)
+        return video
+    def decord_read(self, path):
+        decord_vr = self.v_decoder(path)
+        total_frames = len(decord_vr)
+        # Sampling video frames
+        start_frame_ind, end_frame_ind = self.temporal_sample(total_frames)
+        # assert end_frame_ind - start_frame_ind >= self.num_frames
+        frame_indice = np.linspace(start_frame_ind, end_frame_ind - 1, self.num_frames, dtype=int)
+        video_data = decord_vr.get_batch(frame_indice).asnumpy()
+        video_data = torch.from_numpy(video_data)
+        video_data = video_data.permute(0, 3, 1, 2)  # (T, H, W, C) -> (T C H W)
+        return video_data

opensora/eval/cal_flolpips.py ADDED Viewed

	@@ -0,0 +1,83 @@

+import numpy as np
+import torch
+from tqdm import tqdm
+import math
+from einops import rearrange
+import sys
+sys.path.append(".")
+from opensora.eval.flolpips.pwcnet import Network as PWCNet
+from opensora.eval.flolpips.flolpips import FloLPIPS
+loss_fn = FloLPIPS(net='alex', version='0.1').eval().requires_grad_(False)
+flownet = PWCNet().eval().requires_grad_(False)
+def trans(x):
+    return x
+def calculate_flolpips(videos1, videos2, device):
+    global loss_fn, flownet
+    print("calculate_flowlpips...")
+    loss_fn = loss_fn.to(device)
+    flownet = flownet.to(device)
+    if videos1.shape != videos2.shape:
+        print("Warning: the shape of videos are not equal.")
+        min_frames = min(videos1.shape[1], videos2.shape[1])
+        videos1 = videos1[:, :min_frames]
+        videos2 = videos2[:, :min_frames]
+    videos1 = trans(videos1)
+    videos2 = trans(videos2)
+    flolpips_results = []
+    for video_num in tqdm(range(videos1.shape[0])):
+        video1 = videos1[video_num].to(device)
+        video2 = videos2[video_num].to(device)
+        frames_rec = video1[:-1]
+        frames_rec_next = video1[1:]
+        frames_gt = video2[:-1]
+        frames_gt_next = video2[1:]
+        t, c, h, w = frames_gt.shape
+        flow_gt = flownet(frames_gt, frames_gt_next)
+        flow_dis = flownet(frames_rec, frames_rec_next)
+        flow_diff = flow_gt - flow_dis
+        flolpips = loss_fn.forward(frames_gt, frames_rec, flow_diff, normalize=True)
+        flolpips_results.append(flolpips.cpu().numpy().tolist())
+    flolpips_results = np.array(flolpips_results) # [batch_size, num_frames]
+    flolpips = {}
+    flolpips_std = {}
+    for clip_timestamp in range(flolpips_results.shape[1]):
+        flolpips[clip_timestamp] = np.mean(flolpips_results[:,clip_timestamp], axis=-1)
+        flolpips_std[clip_timestamp] = np.std(flolpips_results[:,clip_timestamp], axis=-1)
+    result = {
+        "value": flolpips,
+        "value_std": flolpips_std,
+        "video_setting": video1.shape,
+        "video_setting_name": "time, channel, heigth, width",
+        "result": flolpips_results,
+        "details": flolpips_results.tolist()
+    }
+    return result
+# test code / using example
+def main():
+    NUMBER_OF_VIDEOS = 8
+    VIDEO_LENGTH = 50
+    CHANNEL = 3
+    SIZE = 64
+    videos1 = torch.zeros(NUMBER_OF_VIDEOS, VIDEO_LENGTH, CHANNEL, SIZE, SIZE, requires_grad=False)
+    videos2 = torch.zeros(NUMBER_OF_VIDEOS, VIDEO_LENGTH, CHANNEL, SIZE, SIZE, requires_grad=False)
+    import json
+    result = calculate_flolpips(videos1, videos2, "cuda:0")
+    print(json.dumps(result, indent=4))
+if __name__ == "__main__":
+    main()

opensora/eval/cal_fvd.py ADDED Viewed

	@@ -0,0 +1,85 @@

+import numpy as np
+import torch
+from tqdm import tqdm
+def trans(x):
+    # if greyscale images add channel
+    if x.shape[-3] == 1:
+        x = x.repeat(1, 1, 3, 1, 1)
+    # permute BTCHW -> BCTHW
+    x = x.permute(0, 2, 1, 3, 4)
+    return x
+def calculate_fvd(videos1, videos2, device, method='styleganv'):
+    if method == 'styleganv':
+        from fvd.styleganv.fvd import get_fvd_feats, frechet_distance, load_i3d_pretrained
+    elif method == 'videogpt':
+        from fvd.videogpt.fvd import load_i3d_pretrained
+        from fvd.videogpt.fvd import get_fvd_logits as get_fvd_feats
+        from fvd.videogpt.fvd import frechet_distance
+    print("calculate_fvd...")
+    # videos [batch_size, timestamps, channel, h, w]
+    assert videos1.shape == videos2.shape
+    i3d = load_i3d_pretrained(device=device)
+    fvd_results = []
+    # support grayscale input, if grayscale -> channel*3
+    # BTCHW -> BCTHW
+    # videos -> [batch_size, channel, timestamps, h, w]
+    videos1 = trans(videos1)
+    videos2 = trans(videos2)
+    fvd_results = {}
+    # for calculate FVD, each clip_timestamp must >= 10
+    for clip_timestamp in tqdm(range(10, videos1.shape[-3]+1)):
+        # get a video clip
+        # videos_clip [batch_size, channel, timestamps[:clip], h, w]
+        videos_clip1 = videos1[:, :, : clip_timestamp]
+        videos_clip2 = videos2[:, :, : clip_timestamp]
+        # get FVD features
+        feats1 = get_fvd_feats(videos_clip1, i3d=i3d, device=device)
+        feats2 = get_fvd_feats(videos_clip2, i3d=i3d, device=device)
+        # calculate FVD when timestamps[:clip]
+        fvd_results[clip_timestamp] = frechet_distance(feats1, feats2)
+    result = {
+        "value": fvd_results,
+        "video_setting": videos1.shape,
+        "video_setting_name": "batch_size, channel, time, heigth, width",
+    }
+    return result
+# test code / using example
+def main():
+    NUMBER_OF_VIDEOS = 8
+    VIDEO_LENGTH = 50
+    CHANNEL = 3
+    SIZE = 64
+    videos1 = torch.zeros(NUMBER_OF_VIDEOS, VIDEO_LENGTH, CHANNEL, SIZE, SIZE, requires_grad=False)
+    videos2 = torch.ones(NUMBER_OF_VIDEOS, VIDEO_LENGTH, CHANNEL, SIZE, SIZE, requires_grad=False)
+    device = torch.device("cuda")
+    # device = torch.device("cpu")
+    import json
+    result = calculate_fvd(videos1, videos2, device, method='videogpt')
+    print(json.dumps(result, indent=4))
+    result = calculate_fvd(videos1, videos2, device, method='styleganv')
+    print(json.dumps(result, indent=4))
+if __name__ == "__main__":
+    main()

opensora/eval/cal_lpips.py ADDED Viewed

	@@ -0,0 +1,97 @@

+import numpy as np
+import torch
+from tqdm import tqdm
+import math
+import torch
+import lpips
+spatial = True         # Return a spatial map of perceptual distance.
+# Linearly calibrated models (LPIPS)
+loss_fn = lpips.LPIPS(net='alex', spatial=spatial) # Can also set net = 'squeeze' or 'vgg'
+# loss_fn = lpips.LPIPS(net='alex', spatial=spatial, lpips=False) # Can also set net = 'squeeze' or 'vgg'
+def trans(x):
+    # if greyscale images add channel
+    if x.shape[-3] == 1:
+        x = x.repeat(1, 1, 3, 1, 1)
+    # value range [0, 1] -> [-1, 1]
+    x = x * 2 - 1
+    return x
+def calculate_lpips(videos1, videos2, device):
+    # image should be RGB, IMPORTANT: normalized to [-1,1]
+    print("calculate_lpips...")
+    assert videos1.shape == videos2.shape
+    # videos [batch_size, timestamps, channel, h, w]
+    # support grayscale input, if grayscale -> channel*3
+    # value range [0, 1] -> [-1, 1]
+    videos1 = trans(videos1)
+    videos2 = trans(videos2)
+    lpips_results = []
+    for video_num in tqdm(range(videos1.shape[0])):
+        # get a video
+        # video [timestamps, channel, h, w]
+        video1 = videos1[video_num]
+        video2 = videos2[video_num]
+        lpips_results_of_a_video = []
+        for clip_timestamp in range(len(video1)):
+            # get a img
+            # img [timestamps[x], channel, h, w]
+            # img [channel, h, w] tensor
+            img1 = video1[clip_timestamp].unsqueeze(0).to(device)
+            img2 = video2[clip_timestamp].unsqueeze(0).to(device)
+            loss_fn.to(device)
+            # calculate lpips of a video
+            lpips_results_of_a_video.append(loss_fn.forward(img1, img2).mean().detach().cpu().tolist())
+        lpips_results.append(lpips_results_of_a_video)
+    lpips_results = np.array(lpips_results)
+    lpips = {}
+    lpips_std = {}
+    for clip_timestamp in range(len(video1)):
+        lpips[clip_timestamp] = np.mean(lpips_results[:,clip_timestamp])
+        lpips_std[clip_timestamp] = np.std(lpips_results[:,clip_timestamp])
+    result = {
+        "value": lpips,
+        "value_std": lpips_std,
+        "video_setting": video1.shape,
+        "video_setting_name": "time, channel, heigth, width",
+    }
+    return result
+# test code / using example
+def main():
+    NUMBER_OF_VIDEOS = 8
+    VIDEO_LENGTH = 50
+    CHANNEL = 3
+    SIZE = 64
+    videos1 = torch.zeros(NUMBER_OF_VIDEOS, VIDEO_LENGTH, CHANNEL, SIZE, SIZE, requires_grad=False)
+    videos2 = torch.ones(NUMBER_OF_VIDEOS, VIDEO_LENGTH, CHANNEL, SIZE, SIZE, requires_grad=False)
+    device = torch.device("cuda")
+    # device = torch.device("cpu")
+    import json
+    result = calculate_lpips(videos1, videos2, device)
+    print(json.dumps(result, indent=4))
+if __name__ == "__main__":
+    main()

opensora/eval/cal_psnr.py ADDED Viewed

	@@ -0,0 +1,84 @@

+import numpy as np
+import torch
+from tqdm import tqdm
+import math
+def img_psnr(img1, img2):
+    # [0,1]
+    # compute mse
+    # mse = np.mean((img1-img2)**2)
+    mse = np.mean((img1 / 1.0 - img2 / 1.0) ** 2)
+    # compute psnr
+    if mse < 1e-10:
+        return 100
+    psnr = 20 * math.log10(1 / math.sqrt(mse))
+    return psnr
+def trans(x):
+    return x
+def calculate_psnr(videos1, videos2):
+    print("calculate_psnr...")
+    # videos [batch_size, timestamps, channel, h, w]
+    assert videos1.shape == videos2.shape
+    videos1 = trans(videos1)
+    videos2 = trans(videos2)
+    psnr_results = []
+    for video_num in tqdm(range(videos1.shape[0])):
+        # get a video
+        # video [timestamps, channel, h, w]
+        video1 = videos1[video_num]
+        video2 = videos2[video_num]
+        psnr_results_of_a_video = []
+        for clip_timestamp in range(len(video1)):
+            # get a img
+            # img [timestamps[x], channel, h, w]
+            # img [channel, h, w] numpy
+            img1 = video1[clip_timestamp].numpy()
+            img2 = video2[clip_timestamp].numpy()
+            # calculate psnr of a video
+            psnr_results_of_a_video.append(img_psnr(img1, img2))
+        psnr_results.append(psnr_results_of_a_video)
+    psnr_results = np.array(psnr_results) # [batch_size, num_frames]
+    psnr = {}
+    psnr_std = {}
+    for clip_timestamp in range(len(video1)):
+        psnr[clip_timestamp] = np.mean(psnr_results[:,clip_timestamp])
+        psnr_std[clip_timestamp] = np.std(psnr_results[:,clip_timestamp])
+    result = {
+        "value": psnr,
+        "value_std": psnr_std,
+        "video_setting": video1.shape,
+        "video_setting_name": "time, channel, heigth, width",
+    }
+    return result
+# test code / using example
+def main():
+    NUMBER_OF_VIDEOS = 8
+    VIDEO_LENGTH = 50
+    CHANNEL = 3
+    SIZE = 64
+    videos1 = torch.zeros(NUMBER_OF_VIDEOS, VIDEO_LENGTH, CHANNEL, SIZE, SIZE, requires_grad=False)
+    videos2 = torch.zeros(NUMBER_OF_VIDEOS, VIDEO_LENGTH, CHANNEL, SIZE, SIZE, requires_grad=False)
+    import json
+    result = calculate_psnr(videos1, videos2)
+    print(json.dumps(result, indent=4))
+if __name__ == "__main__":
+    main()

opensora/eval/cal_ssim.py ADDED Viewed

	@@ -0,0 +1,113 @@

+import numpy as np
+import torch
+from tqdm import tqdm
+import cv2
+def ssim(img1, img2):
+    C1 = 0.01 ** 2
+    C2 = 0.03 ** 2
+    img1 = img1.astype(np.float64)
+    img2 = img2.astype(np.float64)
+    kernel = cv2.getGaussianKernel(11, 1.5)
+    window = np.outer(kernel, kernel.transpose())
+    mu1 = cv2.filter2D(img1, -1, window)[5:-5, 5:-5]  # valid
+    mu2 = cv2.filter2D(img2, -1, window)[5:-5, 5:-5]
+    mu1_sq = mu1 ** 2
+    mu2_sq = mu2 ** 2
+    mu1_mu2 = mu1 * mu2
+    sigma1_sq = cv2.filter2D(img1 ** 2, -1, window)[5:-5, 5:-5] - mu1_sq
+    sigma2_sq = cv2.filter2D(img2 ** 2, -1, window)[5:-5, 5:-5] - mu2_sq
+    sigma12 = cv2.filter2D(img1 * img2, -1, window)[5:-5, 5:-5] - mu1_mu2
+    ssim_map = ((2 * mu1_mu2 + C1) * (2 * sigma12 + C2)) / ((mu1_sq + mu2_sq + C1) *
+                                                            (sigma1_sq + sigma2_sq + C2))
+    return ssim_map.mean()
+def calculate_ssim_function(img1, img2):
+    # [0,1]
+    # ssim is the only metric extremely sensitive to gray being compared to b/w
+    if not img1.shape == img2.shape:
+        raise ValueError('Input images must have the same dimensions.')
+    if img1.ndim == 2:
+        return ssim(img1, img2)
+    elif img1.ndim == 3:
+        if img1.shape[0] == 3:
+            ssims = []
+            for i in range(3):
+                ssims.append(ssim(img1[i], img2[i]))
+            return np.array(ssims).mean()
+        elif img1.shape[0] == 1:
+            return ssim(np.squeeze(img1), np.squeeze(img2))
+    else:
+        raise ValueError('Wrong input image dimensions.')
+def trans(x):
+    return x
+def calculate_ssim(videos1, videos2):
+    print("calculate_ssim...")
+    # videos [batch_size, timestamps, channel, h, w]
+    assert videos1.shape == videos2.shape
+    videos1 = trans(videos1)
+    videos2 = trans(videos2)
+    ssim_results = []
+    for video_num in tqdm(range(videos1.shape[0])):
+        # get a video
+        # video [timestamps, channel, h, w]
+        video1 = videos1[video_num]
+        video2 = videos2[video_num]
+        ssim_results_of_a_video = []
+        for clip_timestamp in range(len(video1)):
+            # get a img
+            # img [timestamps[x], channel, h, w]
+            # img [channel, h, w] numpy
+            img1 = video1[clip_timestamp].numpy()
+            img2 = video2[clip_timestamp].numpy()
+            # calculate ssim of a video
+            ssim_results_of_a_video.append(calculate_ssim_function(img1, img2))
+        ssim_results.append(ssim_results_of_a_video)
+    ssim_results = np.array(ssim_results)
+    ssim = {}
+    ssim_std = {}
+    for clip_timestamp in range(len(video1)):
+        ssim[clip_timestamp] = np.mean(ssim_results[:,clip_timestamp])
+        ssim_std[clip_timestamp] = np.std(ssim_results[:,clip_timestamp])
+    result = {
+        "value": ssim,
+        "value_std": ssim_std,
+        "video_setting": video1.shape,
+        "video_setting_name": "time, channel, heigth, width",
+    }
+    return result
+# test code / using example
+def main():
+    NUMBER_OF_VIDEOS = 8
+    VIDEO_LENGTH = 50
+    CHANNEL = 3
+    SIZE = 64
+    videos1 = torch.zeros(NUMBER_OF_VIDEOS, VIDEO_LENGTH, CHANNEL, SIZE, SIZE, requires_grad=False)
+    videos2 = torch.zeros(NUMBER_OF_VIDEOS, VIDEO_LENGTH, CHANNEL, SIZE, SIZE, requires_grad=False)
+    device = torch.device("cuda")
+    import json
+    result = calculate_ssim(videos1, videos2)
+    print(json.dumps(result, indent=4))
+if __name__ == "__main__":
+    main()

opensora/eval/eval_clip_score.py ADDED Viewed

	@@ -0,0 +1,225 @@

+"""Calculates the CLIP Scores
+The CLIP model is a contrasitively learned language-image model. There is
+an image encoder and a text encoder. It is believed that the CLIP model could
+measure the similarity of cross modalities. Please find more information from
+https://github.com/openai/CLIP.
+The CLIP Score measures the Cosine Similarity between two embedded features.
+This repository utilizes the pretrained CLIP Model to calculate
+the mean average of cosine similarities.
+See --help to see further details.
+Code apapted from https://github.com/mseitzer/pytorch-fid and https://github.com/openai/CLIP.
+Copyright 2023 The Hong Kong Polytechnic University
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+"""
+import os
+import os.path as osp
+from argparse import ArgumentDefaultsHelpFormatter, ArgumentParser
+import clip
+import torch
+from PIL import Image
+from torch.utils.data import Dataset, DataLoader
+try:
+    from tqdm import tqdm
+except ImportError:
+    # If tqdm is not available, provide a mock version of it
+    def tqdm(x):
+        return x
+IMAGE_EXTENSIONS = {'bmp', 'jpg', 'jpeg', 'pgm', 'png', 'ppm',
+                    'tif', 'tiff', 'webp'}
+TEXT_EXTENSIONS = {'txt'}
+class DummyDataset(Dataset):
+    FLAGS = ['img', 'txt']
+    def __init__(self, real_path, generated_path,
+                 real_flag: str = 'img',
+                 generated_flag: str = 'img',
+                 transform = None,
+                 tokenizer = None) -> None:
+        super().__init__()
+        assert real_flag in self.FLAGS and generated_flag in self.FLAGS, \
+            'CLIP Score only support modality of {}. However, get {} and {}'.format(
+                self.FLAGS, real_flag, generated_flag
+            )
+        self.real_folder = self._combine_without_prefix(real_path)
+        self.real_flag = real_flag
+        self.fake_foler = self._combine_without_prefix(generated_path)
+        self.generated_flag = generated_flag
+        self.transform = transform
+        self.tokenizer = tokenizer
+        # assert self._check()
+    def __len__(self):
+        return len(self.real_folder)
+    def __getitem__(self, index):
+        if index >= len(self):
+            raise IndexError
+        real_path = self.real_folder[index]
+        generated_path = self.fake_foler[index]
+        real_data = self._load_modality(real_path, self.real_flag)
+        fake_data = self._load_modality(generated_path, self.generated_flag)
+        sample = dict(real=real_data, fake=fake_data)
+        return sample
+    def _load_modality(self, path, modality):
+        if modality == 'img':
+            data = self._load_img(path)
+        elif modality == 'txt':
+            data = self._load_txt(path)
+        else:
+            raise TypeError("Got unexpected modality: {}".format(modality))
+        return data
+    def _load_img(self, path):
+        img = Image.open(path)
+        if self.transform is not None:
+            img = self.transform(img)
+        return img
+    def _load_txt(self, path):
+        with open(path, 'r') as fp:
+            data = fp.read()
+            fp.close()
+        if self.tokenizer is not None:
+            data = self.tokenizer(data).squeeze()
+        return data
+    def _check(self):
+        for idx in range(len(self)):
+            real_name = self.real_folder[idx].split('.')
+            fake_name = self.fake_folder[idx].split('.')
+            if fake_name != real_name:
+                return False
+        return True
+    def _combine_without_prefix(self, folder_path, prefix='.'):
+        folder = []
+        for name in os.listdir(folder_path):
+            if name[0] == prefix:
+                continue
+            folder.append(osp.join(folder_path, name))
+        folder.sort()
+        return folder
+@torch.no_grad()
+def calculate_clip_score(dataloader, model, real_flag, generated_flag):
+    score_acc = 0.
+    sample_num = 0.
+    logit_scale = model.logit_scale.exp()
+    for batch_data in tqdm(dataloader):
+        real = batch_data['real']
+        real_features = forward_modality(model, real, real_flag)
+        fake = batch_data['fake']
+        fake_features = forward_modality(model, fake, generated_flag)
+        # normalize features
+        real_features = real_features / real_features.norm(dim=1, keepdim=True).to(torch.float32)
+        fake_features = fake_features / fake_features.norm(dim=1, keepdim=True).to(torch.float32)
+        # calculate scores
+        # score = logit_scale * real_features @ fake_features.t()
+        # score_acc += torch.diag(score).sum()
+        score = logit_scale * (fake_features * real_features).sum()
+        score_acc += score
+        sample_num += real.shape[0]
+    return score_acc / sample_num
+def forward_modality(model, data, flag):
+    device = next(model.parameters()).device
+    if flag == 'img':
+        features = model.encode_image(data.to(device))
+    elif flag == 'txt':
+        features = model.encode_text(data.to(device))
+    else:
+        raise TypeError
+    return features
+def main():
+    parser = ArgumentParser(formatter_class=ArgumentDefaultsHelpFormatter)
+    parser.add_argument('--batch-size', type=int, default=50,
+                    help='Batch size to use')
+    parser.add_argument('--clip-model', type=str, default='ViT-B/32',
+                    help='CLIP model to use')
+    parser.add_argument('--num-workers', type=int, default=8,
+                    help=('Number of processes to use for data loading. '
+                          'Defaults to `min(8, num_cpus)`'))
+    parser.add_argument('--device', type=str, default=None,
+                    help='Device to use. Like cuda, cuda:0 or cpu')
+    parser.add_argument('--real_flag', type=str, default='img',
+                    help=('The modality of real path. '
+                          'Default to img'))
+    parser.add_argument('--generated_flag', type=str, default='txt',
+                    help=('The modality of generated path. '
+                          'Default to txt'))
+    parser.add_argument('--real_path', type=str,
+                    help=('Paths to the real images or '
+                          'to .npz statistic files'))
+    parser.add_argument('--generated_path', type=str,
+                    help=('Paths to the generated images or '
+                          'to .npz statistic files'))
+    args = parser.parse_args()
+    if args.device is None:
+        device = torch.device('cuda' if (torch.cuda.is_available()) else 'cpu')
+    else:
+        device = torch.device(args.device)
+    if args.num_workers is None:
+        try:
+            num_cpus = len(os.sched_getaffinity(0))
+        except AttributeError:
+            # os.sched_getaffinity is not available under Windows, use
+            # os.cpu_count instead (which may not return the *available* number
+            # of CPUs).
+            num_cpus = os.cpu_count()
+        num_workers = min(num_cpus, 8) if num_cpus is not None else 0
+    else:
+        num_workers = args.num_workers
+    print('Loading CLIP model: {}'.format(args.clip_model))
+    model, preprocess = clip.load(args.clip_model, device=device)
+    dataset = DummyDataset(args.real_path, args.generated_path,
+                           args.real_flag, args.generated_flag,
+                           transform=preprocess, tokenizer=clip.tokenize)
+    dataloader = DataLoader(dataset, args.batch_size,
+                            num_workers=num_workers, pin_memory=True)
+    print('Calculating CLIP Score:')
+    clip_score = calculate_clip_score(dataloader, model,
+                                      args.real_flag, args.generated_flag)
+    clip_score = clip_score.cpu().item()
+    print('CLIP Score: ', clip_score)
+if __name__ == '__main__':
+    main()

opensora/eval/eval_common_metric.py ADDED Viewed

	@@ -0,0 +1,224 @@

+"""Calculates the CLIP Scores
+The CLIP model is a contrasitively learned language-image model. There is
+an image encoder and a text encoder. It is believed that the CLIP model could
+measure the similarity of cross modalities. Please find more information from
+https://github.com/openai/CLIP.
+The CLIP Score measures the Cosine Similarity between two embedded features.
+This repository utilizes the pretrained CLIP Model to calculate
+the mean average of cosine similarities.
+See --help to see further details.
+Code apapted from https://github.com/mseitzer/pytorch-fid and https://github.com/openai/CLIP.
+Copyright 2023 The Hong Kong Polytechnic University
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+"""
+import os
+import os.path as osp
+from argparse import ArgumentDefaultsHelpFormatter, ArgumentParser
+import numpy as np
+import torch
+from torch.utils.data import Dataset, DataLoader, Subset
+from decord import VideoReader, cpu
+import random
+from pytorchvideo.transforms import ShortSideScale
+from torchvision.io import read_video
+from torchvision.transforms import Lambda, Compose
+from torchvision.transforms._transforms_video import CenterCropVideo
+import sys
+sys.path.append(".")
+from opensora.eval.cal_lpips import calculate_lpips
+from opensora.eval.cal_fvd import calculate_fvd
+from opensora.eval.cal_psnr import calculate_psnr
+from opensora.eval.cal_flolpips import calculate_flolpips
+from opensora.eval.cal_ssim import calculate_ssim
+try:
+    from tqdm import tqdm
+except ImportError:
+    # If tqdm is not available, provide a mock version of it
+    def tqdm(x):
+        return x
+class VideoDataset(Dataset):
+    def __init__(self,
+                 real_video_dir,
+                 generated_video_dir,
+                 num_frames,
+                 sample_rate = 1,
+                 crop_size=None,
+                 resolution=128,
+                 ) -> None:
+        super().__init__()
+        self.real_video_files = self._combine_without_prefix(real_video_dir)
+        self.generated_video_files = self._combine_without_prefix(generated_video_dir)
+        self.num_frames = num_frames
+        self.sample_rate = sample_rate
+        self.crop_size = crop_size
+        self.short_size = resolution
+    def __len__(self):
+        return len(self.real_video_files)
+    def __getitem__(self, index):
+        if index >= len(self):
+            raise IndexError
+        real_video_file = self.real_video_files[index]
+        generated_video_file = self.generated_video_files[index]
+        print(real_video_file, generated_video_file)
+        real_video_tensor  = self._load_video(real_video_file)
+        generated_video_tensor  = self._load_video(generated_video_file)
+        return {'real': real_video_tensor, 'generated':generated_video_tensor }
+    def _load_video(self, video_path):
+        num_frames = self.num_frames
+        sample_rate = self.sample_rate
+        decord_vr = VideoReader(video_path, ctx=cpu(0))
+        total_frames = len(decord_vr)
+        sample_frames_len = sample_rate * num_frames
+        if total_frames >= sample_frames_len:
+            s = 0
+            e = s + sample_frames_len
+            num_frames = num_frames
+        else:
+            s = 0
+            e = total_frames
+            num_frames = int(total_frames / sample_frames_len * num_frames)
+            print(f'sample_frames_len {sample_frames_len}, only can sample {num_frames * sample_rate}', video_path,
+                total_frames)
+        frame_id_list = np.linspace(s, e - 1, num_frames, dtype=int)
+        video_data = decord_vr.get_batch(frame_id_list).asnumpy()
+        video_data = torch.from_numpy(video_data)
+        video_data = video_data.permute(0, 3, 1, 2) # (T, H, W, C) -> (C, T, H, W)
+        return _preprocess(video_data, short_size=self.short_size, crop_size = self.crop_size)
+    def _combine_without_prefix(self, folder_path, prefix='.'):
+        folder = []
+        os.makedirs(folder_path, exist_ok=True)
+        for name in os.listdir(folder_path):
+            if name[0] == prefix:
+                continue
+            if osp.isfile(osp.join(folder_path, name)):
+                folder.append(osp.join(folder_path, name))
+        folder.sort()
+        return folder
+def _preprocess(video_data, short_size=128, crop_size=None):
+    transform = Compose(
+        [
+            Lambda(lambda x: x / 255.0),
+            ShortSideScale(size=short_size),
+            CenterCropVideo(crop_size=crop_size),
+        ]
+    )
+    video_outputs = transform(video_data)
+    # video_outputs = torch.unsqueeze(video_outputs, 0) # (bz,c,t,h,w)
+    return video_outputs
+def calculate_common_metric(args, dataloader, device):
+    score_list = []
+    for batch_data in tqdm(dataloader): # {'real': real_video_tensor, 'generated':generated_video_tensor }
+        real_videos = batch_data['real']
+        generated_videos = batch_data['generated']
+        assert real_videos.shape[2] == generated_videos.shape[2]
+        if args.metric == 'fvd':
+            tmp_list = list(calculate_fvd(real_videos, generated_videos, args.device, method=args.fvd_method)['value'].values())
+        elif args.metric == 'ssim':
+            tmp_list = list(calculate_ssim(real_videos, generated_videos)['value'].values())
+        elif args.metric == 'psnr':
+            tmp_list = list(calculate_psnr(real_videos, generated_videos)['value'].values())
+        elif args.metric == 'flolpips':
+            result = calculate_flolpips(real_videos, generated_videos, args.device)
+            tmp_list = list(result['value'].values())
+        else:
+            tmp_list  = list(calculate_lpips(real_videos, generated_videos, args.device)['value'].values())
+        score_list += tmp_list
+    return np.mean(score_list)
+def main():
+    parser = ArgumentParser(formatter_class=ArgumentDefaultsHelpFormatter)
+    parser.add_argument('--batch_size', type=int, default=2,
+                    help='Batch size to use')
+    parser.add_argument('--real_video_dir', type=str,
+                    help=('the path of real videos`'))
+    parser.add_argument('--generated_video_dir', type=str,
+                    help=('the path of generated videos`'))
+    parser.add_argument('--device', type=str, default=None,
+                    help='Device to use. Like cuda, cuda:0 or cpu')
+    parser.add_argument('--num_workers', type=int, default=8,
+                    help=('Number of processes to use for data loading. '
+                          'Defaults to `min(8, num_cpus)`'))
+    parser.add_argument('--sample_fps', type=int, default=30)
+    parser.add_argument('--resolution', type=int, default=336)
+    parser.add_argument('--crop_size', type=int, default=None)
+    parser.add_argument('--num_frames', type=int, default=100)
+    parser.add_argument('--sample_rate', type=int, default=1)
+    parser.add_argument('--subset_size', type=int, default=None)
+    parser.add_argument("--metric", type=str, default="fvd",choices=['fvd','psnr','ssim','lpips', 'flolpips'])
+    parser.add_argument("--fvd_method", type=str, default='styleganv',choices=['styleganv','videogpt'])
+    args = parser.parse_args()
+    if args.device is None:
+        device = torch.device('cuda' if (torch.cuda.is_available()) else 'cpu')
+    else:
+        device = torch.device(args.device)
+    if args.num_workers is None:
+        try:
+            num_cpus = len(os.sched_getaffinity(0))
+        except AttributeError:
+            # os.sched_getaffinity is not available under Windows, use
+            # os.cpu_count instead (which may not return the *available* number
+            # of CPUs).
+            num_cpus = os.cpu_count()
+        num_workers = min(num_cpus, 8) if num_cpus is not None else 0
+    else:
+        num_workers = args.num_workers
+    dataset = VideoDataset(args.real_video_dir,
+                           args.generated_video_dir,
+                            num_frames = args.num_frames,
+                            sample_rate = args.sample_rate,
+                            crop_size=args.crop_size,
+                            resolution=args.resolution)
+    if args.subset_size:
+        indices = range(args.subset_size)
+        dataset = Subset(dataset, indices=indices)
+    dataloader = DataLoader(dataset, args.batch_size,
+                            num_workers=num_workers, pin_memory=True)
+    metric_score = calculate_common_metric(args, dataloader,device)
+    print('metric: ', args.metric, " ",metric_score)
+if __name__ == '__main__':
+    main()

opensora/eval/flolpips/correlation/correlation.py ADDED Viewed

	@@ -0,0 +1,397 @@

+#!/usr/bin/env python
+import torch
+import cupy
+import re
+kernel_Correlation_rearrange = '''
+	extern "C" __global__ void kernel_Correlation_rearrange(
+		const int n,
+		const float* input,
+		float* output
+	) {
+	  int intIndex = (blockIdx.x * blockDim.x) + threadIdx.x;
+	  if (intIndex >= n) {
+	    return;
+	  }
+	  int intSample = blockIdx.z;
+	  int intChannel = blockIdx.y;
+	  float fltValue = input[(((intSample * SIZE_1(input)) + intChannel) * SIZE_2(input) * SIZE_3(input)) + intIndex];
+	  __syncthreads();
+	  int intPaddedY = (intIndex / SIZE_3(input)) + 4;
+	  int intPaddedX = (intIndex % SIZE_3(input)) + 4;
+	  int intRearrange = ((SIZE_3(input) + 8) * intPaddedY) + intPaddedX;
+	  output[(((intSample * SIZE_1(output) * SIZE_2(output)) + intRearrange) * SIZE_1(input)) + intChannel] = fltValue;
+	}
+'''
+kernel_Correlation_updateOutput = '''
+	extern "C" __global__ void kernel_Correlation_updateOutput(
+	  const int n,
+	  const float* rbot0,
+	  const float* rbot1,
+	  float* top
+	) {
+	  extern __shared__ char patch_data_char[];
+	  float *patch_data = (float *)patch_data_char;
+	  // First (upper left) position of kernel upper-left corner in current center position of neighborhood in image 1
+	  int x1 = blockIdx.x + 4;
+	  int y1 = blockIdx.y + 4;
+	  int item = blockIdx.z;
+	  int ch_off = threadIdx.x;
+	  // Load 3D patch into shared shared memory
+	  for (int j = 0; j < 1; j++) { // HEIGHT
+	    for (int i = 0; i < 1; i++) { // WIDTH
+	      int ji_off = (j + i) * SIZE_3(rbot0);
+	      for (int ch = ch_off; ch < SIZE_3(rbot0); ch += 32) { // CHANNELS
+	        int idx1 = ((item * SIZE_1(rbot0) + y1+j) * SIZE_2(rbot0) + x1+i) * SIZE_3(rbot0) + ch;
+	        int idxPatchData = ji_off + ch;
+	        patch_data[idxPatchData] = rbot0[idx1];
+	      }
+	    }
+	  }
+	  __syncthreads();
+	  __shared__ float sum[32];
+	  // Compute correlation
+	  for (int top_channel = 0; top_channel < SIZE_1(top); top_channel++) {
+	    sum[ch_off] = 0;
+	    int s2o = top_channel % 9 - 4;
+	    int s2p = top_channel / 9 - 4;
+	    for (int j = 0; j < 1; j++) { // HEIGHT
+	      for (int i = 0; i < 1; i++) { // WIDTH
+	        int ji_off = (j + i) * SIZE_3(rbot0);
+	        for (int ch = ch_off; ch < SIZE_3(rbot0); ch += 32) { // CHANNELS
+	          int x2 = x1 + s2o;
+	          int y2 = y1 + s2p;
+	          int idxPatchData = ji_off + ch;
+	          int idx2 = ((item * SIZE_1(rbot0) + y2+j) * SIZE_2(rbot0) + x2+i) * SIZE_3(rbot0) + ch;
+	          sum[ch_off] += patch_data[idxPatchData] * rbot1[idx2];
+	        }
+	      }
+	    }
+	    __syncthreads();
+	    if (ch_off == 0) {
+	      float total_sum = 0;
+	      for (int idx = 0; idx < 32; idx++) {
+	        total_sum += sum[idx];
+	      }
+	      const int sumelems = SIZE_3(rbot0);
+	      const int index = ((top_channel*SIZE_2(top) + blockIdx.y)*SIZE_3(top))+blockIdx.x;
+	      top[index + item*SIZE_1(top)*SIZE_2(top)*SIZE_3(top)] = total_sum / (float)sumelems;
+	    }
+	  }
+	}
+'''
+kernel_Correlation_updateGradFirst = '''
+	#define ROUND_OFF 50000
+	extern "C" __global__ void kernel_Correlation_updateGradFirst(
+	  const int n,
+	  const int intSample,
+	  const float* rbot0,
+	  const float* rbot1,
+	  const float* gradOutput,
+	  float* gradFirst,
+	  float* gradSecond
+	) { for (int intIndex = (blockIdx.x * blockDim.x) + threadIdx.x; intIndex < n; intIndex += blockDim.x * gridDim.x) {
+	  int n = intIndex % SIZE_1(gradFirst); // channels
+	  int l = (intIndex / SIZE_1(gradFirst)) % SIZE_3(gradFirst) + 4; // w-pos
+	  int m = (intIndex / SIZE_1(gradFirst) / SIZE_3(gradFirst)) % SIZE_2(gradFirst) + 4; // h-pos
+	  // round_off is a trick to enable integer division with ceil, even for negative numbers
+	  // We use a large offset, for the inner part not to become negative.
+	  const int round_off = ROUND_OFF;
+	  const int round_off_s1 = round_off;
+	  // We add round_off before_s1 the int division and subtract round_off after it, to ensure the formula matches ceil behavior:
+	  int xmin = (l - 4 + round_off_s1 - 1) + 1 - round_off; // ceil (l - 4)
+	  int ymin = (m - 4 + round_off_s1 - 1) + 1 - round_off; // ceil (l - 4)
+	  // Same here:
+	  int xmax = (l - 4 + round_off_s1) - round_off; // floor (l - 4)
+	  int ymax = (m - 4 + round_off_s1) - round_off; // floor (m - 4)
+	  float sum = 0;
+	  if (xmax>=0 && ymax>=0 && (xmin<=SIZE_3(gradOutput)-1) && (ymin<=SIZE_2(gradOutput)-1)) {
+	    xmin = max(0,xmin);
+	    xmax = min(SIZE_3(gradOutput)-1,xmax);
+	    ymin = max(0,ymin);
+	    ymax = min(SIZE_2(gradOutput)-1,ymax);
+	    for (int p = -4; p <= 4; p++) {
+	      for (int o = -4; o <= 4; o++) {
+	        // Get rbot1 data:
+	        int s2o = o;
+	        int s2p = p;
+	        int idxbot1 = ((intSample * SIZE_1(rbot0) + (m+s2p)) * SIZE_2(rbot0) + (l+s2o)) * SIZE_3(rbot0) + n;
+	        float bot1tmp = rbot1[idxbot1]; // rbot1[l+s2o,m+s2p,n]
+	        // Index offset for gradOutput in following loops:
+	        int op = (p+4) * 9 + (o+4); // index[o,p]
+	        int idxopoffset = (intSample * SIZE_1(gradOutput) + op);
+	        for (int y = ymin; y <= ymax; y++) {
+	          for (int x = xmin; x <= xmax; x++) {
+	            int idxgradOutput = (idxopoffset * SIZE_2(gradOutput) + y) * SIZE_3(gradOutput) + x; // gradOutput[x,y,o,p]
+	            sum += gradOutput[idxgradOutput] * bot1tmp;
+	          }
+	        }
+	      }
+	    }
+	  }
+	  const int sumelems = SIZE_1(gradFirst);
+	  const int bot0index = ((n * SIZE_2(gradFirst)) + (m-4)) * SIZE_3(gradFirst) + (l-4);
+	  gradFirst[bot0index + intSample*SIZE_1(gradFirst)*SIZE_2(gradFirst)*SIZE_3(gradFirst)] = sum / (float)sumelems;
+	} }
+'''
+kernel_Correlation_updateGradSecond = '''
+	#define ROUND_OFF 50000
+	extern "C" __global__ void kernel_Correlation_updateGradSecond(
+	  const int n,
+	  const int intSample,
+	  const float* rbot0,
+	  const float* rbot1,
+	  const float* gradOutput,
+	  float* gradFirst,
+	  float* gradSecond
+	) { for (int intIndex = (blockIdx.x * blockDim.x) + threadIdx.x; intIndex < n; intIndex += blockDim.x * gridDim.x) {
+	  int n = intIndex % SIZE_1(gradSecond); // channels
+	  int l = (intIndex / SIZE_1(gradSecond)) % SIZE_3(gradSecond) + 4; // w-pos
+	  int m = (intIndex / SIZE_1(gradSecond) / SIZE_3(gradSecond)) % SIZE_2(gradSecond) + 4; // h-pos
+	  // round_off is a trick to enable integer division with ceil, even for negative numbers
+	  // We use a large offset, for the inner part not to become negative.
+	  const int round_off = ROUND_OFF;
+	  const int round_off_s1 = round_off;
+	  float sum = 0;
+	  for (int p = -4; p <= 4; p++) {
+	    for (int o = -4; o <= 4; o++) {
+	      int s2o = o;
+	      int s2p = p;
+	      //Get X,Y ranges and clamp
+	      // We add round_off before_s1 the int division and subtract round_off after it, to ensure the formula matches ceil behavior:
+	      int xmin = (l - 4 - s2o + round_off_s1 - 1) + 1 - round_off; // ceil (l - 4 - s2o)
+	      int ymin = (m - 4 - s2p + round_off_s1 - 1) + 1 - round_off; // ceil (l - 4 - s2o)
+	      // Same here:
+	      int xmax = (l - 4 - s2o + round_off_s1) - round_off; // floor (l - 4 - s2o)
+	      int ymax = (m - 4 - s2p + round_off_s1) - round_off; // floor (m - 4 - s2p)
+	      if (xmax>=0 && ymax>=0 && (xmin<=SIZE_3(gradOutput)-1) && (ymin<=SIZE_2(gradOutput)-1)) {
+	        xmin = max(0,xmin);
+	        xmax = min(SIZE_3(gradOutput)-1,xmax);
+	        ymin = max(0,ymin);
+	        ymax = min(SIZE_2(gradOutput)-1,ymax);
+	        // Get rbot0 data:
+	        int idxbot0 = ((intSample * SIZE_1(rbot0) + (m-s2p)) * SIZE_2(rbot0) + (l-s2o)) * SIZE_3(rbot0) + n;
+	        float bot0tmp = rbot0[idxbot0]; // rbot1[l+s2o,m+s2p,n]
+	        // Index offset for gradOutput in following loops:
+	        int op = (p+4) * 9 + (o+4); // index[o,p]
+	        int idxopoffset = (intSample * SIZE_1(gradOutput) + op);
+	        for (int y = ymin; y <= ymax; y++) {
+	          for (int x = xmin; x <= xmax; x++) {
+	            int idxgradOutput = (idxopoffset * SIZE_2(gradOutput) + y) * SIZE_3(gradOutput) + x; // gradOutput[x,y,o,p]
+	            sum += gradOutput[idxgradOutput] * bot0tmp;
+	          }
+	        }
+	      }
+	    }
+	  }
+	  const int sumelems = SIZE_1(gradSecond);
+	  const int bot1index = ((n * SIZE_2(gradSecond)) + (m-4)) * SIZE_3(gradSecond) + (l-4);
+	  gradSecond[bot1index + intSample*SIZE_1(gradSecond)*SIZE_2(gradSecond)*SIZE_3(gradSecond)] = sum / (float)sumelems;
+	} }
+'''
+def cupy_kernel(strFunction, objVariables):
+	strKernel = globals()[strFunction]
+	while True:
+		objMatch = re.search('(SIZE_)([0-4])(\()([^\)]*)(\))', strKernel)
+		if objMatch is None:
+			break
+		# end
+		intArg = int(objMatch.group(2))
+		strTensor = objMatch.group(4)
+		intSizes = objVariables[strTensor].size()
+		strKernel = strKernel.replace(objMatch.group(), str(intSizes[intArg]))
+	# end
+	while True:
+		objMatch = re.search('(VALUE_)([0-4])(\()([^\)]+)(\))', strKernel)
+		if objMatch is None:
+			break
+		# end
+		intArgs = int(objMatch.group(2))
+		strArgs = objMatch.group(4).split(',')
+		strTensor = strArgs[0]
+		intStrides = objVariables[strTensor].stride()
+		strIndex = [ '((' + strArgs[intArg + 1].replace('{', '(').replace('}', ')').strip() + ')*' + str(intStrides[intArg]) + ')' for intArg in range(intArgs) ]
+		strKernel = strKernel.replace(objMatch.group(0), strTensor + '[' + str.join('+', strIndex) + ']')
+	# end
+	return strKernel
+# end
+@cupy.memoize(for_each_device=True)
+def cupy_launch(strFunction, strKernel):
+    return cupy.RawKernel(strKernel, strFunction)
+# end
+class _FunctionCorrelation(torch.autograd.Function):
+	@staticmethod
+	def forward(self, first, second):
+		rbot0 = first.new_zeros([ first.shape[0], first.shape[2] + 8, first.shape[3] + 8, first.shape[1] ])
+		rbot1 = first.new_zeros([ first.shape[0], first.shape[2] + 8, first.shape[3] + 8, first.shape[1] ])
+		self.save_for_backward(first, second, rbot0, rbot1)
+		first = first.contiguous();	assert(first.is_cuda == True)
+		second = second.contiguous(); assert(second.is_cuda == True)
+		output = first.new_zeros([ first.shape[0], 81, first.shape[2], first.shape[3] ])
+		if first.is_cuda == True:
+			n = first.shape[2] * first.shape[3]
+			cupy_launch('kernel_Correlation_rearrange', cupy_kernel('kernel_Correlation_rearrange', {
+				'input': first,
+				'output': rbot0
+			}))(
+				grid=tuple([ int((n + 16 - 1) / 16), first.shape[1], first.shape[0] ]),
+				block=tuple([ 16, 1, 1 ]),
+				args=[ n, first.data_ptr(), rbot0.data_ptr() ]
+			)
+			n = second.shape[2] * second.shape[3]
+			cupy_launch('kernel_Correlation_rearrange', cupy_kernel('kernel_Correlation_rearrange', {
+				'input': second,
+				'output': rbot1
+			}))(
+				grid=tuple([ int((n + 16 - 1) / 16), second.shape[1], second.shape[0] ]),
+				block=tuple([ 16, 1, 1 ]),
+				args=[ n, second.data_ptr(), rbot1.data_ptr() ]
+			)
+			n = output.shape[1] * output.shape[2] * output.shape[3]
+			cupy_launch('kernel_Correlation_updateOutput', cupy_kernel('kernel_Correlation_updateOutput', {
+				'rbot0': rbot0,
+				'rbot1': rbot1,
+				'top': output
+			}))(
+				grid=tuple([ output.shape[3], output.shape[2], output.shape[0] ]),
+				block=tuple([ 32, 1, 1 ]),
+				shared_mem=first.shape[1] * 4,
+				args=[ n, rbot0.data_ptr(), rbot1.data_ptr(), output.data_ptr() ]
+			)
+		elif first.is_cuda == False:
+			raise NotImplementedError()
+		# end
+		return output
+	# end
+	@staticmethod
+	def backward(self, gradOutput):
+		first, second, rbot0, rbot1 = self.saved_tensors
+		gradOutput = gradOutput.contiguous(); assert(gradOutput.is_cuda == True)
+		gradFirst = first.new_zeros([ first.shape[0], first.shape[1], first.shape[2], first.shape[3] ]) if self.needs_input_grad[0] == True else None
+		gradSecond = first.new_zeros([ first.shape[0], first.shape[1], first.shape[2], first.shape[3] ]) if self.needs_input_grad[1] == True else None
+		if first.is_cuda == True:
+			if gradFirst is not None:
+				for intSample in range(first.shape[0]):
+					n = first.shape[1] * first.shape[2] * first.shape[3]
+					cupy_launch('kernel_Correlation_updateGradFirst', cupy_kernel('kernel_Correlation_updateGradFirst', {
+						'rbot0': rbot0,
+						'rbot1': rbot1,
+						'gradOutput': gradOutput,
+						'gradFirst': gradFirst,
+						'gradSecond': None
+					}))(
+						grid=tuple([ int((n + 512 - 1) / 512), 1, 1 ]),
+						block=tuple([ 512, 1, 1 ]),
+						args=[ n, intSample, rbot0.data_ptr(), rbot1.data_ptr(), gradOutput.data_ptr(), gradFirst.data_ptr(), None ]
+					)
+				# end
+			# end
+			if gradSecond is not None:
+				for intSample in range(first.shape[0]):
+					n = first.shape[1] * first.shape[2] * first.shape[3]
+					cupy_launch('kernel_Correlation_updateGradSecond', cupy_kernel('kernel_Correlation_updateGradSecond', {
+						'rbot0': rbot0,
+						'rbot1': rbot1,
+						'gradOutput': gradOutput,
+						'gradFirst': None,
+						'gradSecond': gradSecond
+					}))(
+						grid=tuple([ int((n + 512 - 1) / 512), 1, 1 ]),
+						block=tuple([ 512, 1, 1 ]),
+						args=[ n, intSample, rbot0.data_ptr(), rbot1.data_ptr(), gradOutput.data_ptr(), None, gradSecond.data_ptr() ]
+					)
+				# end
+			# end
+		elif first.is_cuda == False:
+			raise NotImplementedError()
+		# end
+		return gradFirst, gradSecond
+	# end
+# end
+def FunctionCorrelation(tenFirst, tenSecond):
+	return _FunctionCorrelation.apply(tenFirst, tenSecond)
+# end
+class ModuleCorrelation(torch.nn.Module):
+	def __init__(self):
+		super(ModuleCorrelation, self).__init__()
+	# end
+	def forward(self, tenFirst, tenSecond):
+		return _FunctionCorrelation.apply(tenFirst, tenSecond)
+	# end
+# end

opensora/eval/flolpips/flolpips.py ADDED Viewed

	@@ -0,0 +1,308 @@

+from __future__ import absolute_import
+import os
+import numpy as np
+import torch
+import torch.nn as nn
+from torch.autograd import Variable
+from .pretrained_networks import vgg16, alexnet, squeezenet
+import torch.nn
+import torch.nn.functional as F
+import torchvision.transforms.functional as TF
+import cv2
+from .pwcnet import Network as PWCNet
+from .utils import *
+def spatial_average(in_tens, keepdim=True):
+    return in_tens.mean([2,3],keepdim=keepdim)
+def mw_spatial_average(in_tens, flow, keepdim=True):
+    _,_,h,w = in_tens.shape
+    flow = F.interpolate(flow, (h,w), align_corners=False, mode='bilinear')
+    flow_mag = torch.sqrt(flow[:,0:1]**2 + flow[:,1:2]**2)
+    flow_mag = flow_mag / torch.sum(flow_mag, dim=[1,2,3], keepdim=True)
+    return torch.sum(in_tens*flow_mag, dim=[2,3],keepdim=keepdim)
+def mtw_spatial_average(in_tens, flow, texture, keepdim=True):
+    _,_,h,w = in_tens.shape
+    flow = F.interpolate(flow, (h,w), align_corners=False, mode='bilinear')
+    texture = F.interpolate(texture, (h,w), align_corners=False, mode='bilinear')
+    flow_mag = torch.sqrt(flow[:,0:1]**2 + flow[:,1:2]**2)
+    flow_mag = (flow_mag - flow_mag.min()) / (flow_mag.max() - flow_mag.min()) + 1e-6
+    texture = (texture - texture.min()) / (texture.max() - texture.min()) + 1e-6
+    weight = flow_mag / texture
+    weight /= torch.sum(weight)
+    return torch.sum(in_tens*weight, dim=[2,3],keepdim=keepdim)
+def m2w_spatial_average(in_tens, flow, keepdim=True):
+    _,_,h,w = in_tens.shape
+    flow = F.interpolate(flow, (h,w), align_corners=False, mode='bilinear')
+    flow_mag = flow[:,0:1]**2 + flow[:,1:2]**2 # B,1,H,W
+    flow_mag = flow_mag / torch.sum(flow_mag)
+    return torch.sum(in_tens*flow_mag, dim=[2,3],keepdim=keepdim)
+def upsample(in_tens, out_HW=(64,64)): # assumes scale factor is same for H and W
+    in_H, in_W = in_tens.shape[2], in_tens.shape[3]
+    return nn.Upsample(size=out_HW, mode='bilinear', align_corners=False)(in_tens)
+# Learned perceptual metric
+class LPIPS(nn.Module):
+    def __init__(self, pretrained=True, net='alex', version='0.1', lpips=True, spatial=False,
+        pnet_rand=False, pnet_tune=False, use_dropout=True, model_path=None, eval_mode=True, verbose=False):
+        # lpips - [True] means with linear calibration on top of base network
+        # pretrained - [True] means load linear weights
+        super(LPIPS, self).__init__()
+        if(verbose):
+            print('Setting up [%s] perceptual loss: trunk [%s], v[%s], spatial [%s]'%
+                ('LPIPS' if lpips else 'baseline', net, version, 'on' if spatial else 'off'))
+        self.pnet_type = net
+        self.pnet_tune = pnet_tune
+        self.pnet_rand = pnet_rand
+        self.spatial = spatial
+        self.lpips = lpips # false means baseline of just averaging all layers
+        self.version = version
+        self.scaling_layer = ScalingLayer()
+        if(self.pnet_type in ['vgg','vgg16']):
+            net_type = vgg16
+            self.chns = [64,128,256,512,512]
+        elif(self.pnet_type=='alex'):
+            net_type = alexnet
+            self.chns = [64,192,384,256,256]
+        elif(self.pnet_type=='squeeze'):
+            net_type = squeezenet
+            self.chns = [64,128,256,384,384,512,512]
+        self.L = len(self.chns)
+        self.net = net_type(pretrained=not self.pnet_rand, requires_grad=self.pnet_tune)
+        if(lpips):
+            self.lin0 = NetLinLayer(self.chns[0], use_dropout=use_dropout)
+            self.lin1 = NetLinLayer(self.chns[1], use_dropout=use_dropout)
+            self.lin2 = NetLinLayer(self.chns[2], use_dropout=use_dropout)
+            self.lin3 = NetLinLayer(self.chns[3], use_dropout=use_dropout)
+            self.lin4 = NetLinLayer(self.chns[4], use_dropout=use_dropout)
+            self.lins = [self.lin0,self.lin1,self.lin2,self.lin3,self.lin4]
+            if(self.pnet_type=='squeeze'): # 7 layers for squeezenet
+                self.lin5 = NetLinLayer(self.chns[5], use_dropout=use_dropout)
+                self.lin6 = NetLinLayer(self.chns[6], use_dropout=use_dropout)
+                self.lins+=[self.lin5,self.lin6]
+            self.lins = nn.ModuleList(self.lins)
+            if(pretrained):
+                if(model_path is None):
+                    import inspect
+                    import os
+                    model_path = os.path.abspath(os.path.join(inspect.getfile(self.__init__), '..', 'weights/v%s/%s.pth'%(version,net)))
+                if(verbose):
+                    print('Loading model from: %s'%model_path)
+                self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False)
+        if(eval_mode):
+            self.eval()
+    def forward(self, in0, in1, retPerLayer=False, normalize=False):
+        if normalize: # turn on this flag if input is [0,1] so it can be adjusted to [-1, +1]
+            in0 = 2 * in0  - 1
+            in1 = 2 * in1  - 1
+        # v0.0 - original release had a bug, where input was not scaled
+        in0_input, in1_input = (self.scaling_layer(in0), self.scaling_layer(in1)) if self.version=='0.1' else (in0, in1)
+        outs0, outs1 = self.net.forward(in0_input), self.net.forward(in1_input)
+        feats0, feats1, diffs = {}, {}, {}
+        for kk in range(self.L):
+            feats0[kk], feats1[kk] = normalize_tensor(outs0[kk]), normalize_tensor(outs1[kk])
+            diffs[kk] = (feats0[kk]-feats1[kk])**2
+        if(self.lpips):
+            if(self.spatial):
+                res = [upsample(self.lins[kk](diffs[kk]), out_HW=in0.shape[2:]) for kk in range(self.L)]
+            else:
+                res = [spatial_average(self.lins[kk](diffs[kk]), keepdim=True) for kk in range(self.L)]
+        else:
+            if(self.spatial):
+                res = [upsample(diffs[kk].sum(dim=1,keepdim=True), out_HW=in0.shape[2:]) for kk in range(self.L)]
+            else:
+                res = [spatial_average(diffs[kk].sum(dim=1,keepdim=True), keepdim=True) for kk in range(self.L)]
+        # val = res[0]
+        # for l in range(1,self.L):
+        #     val += res[l]
+        #     print(val)
+        # a = spatial_average(self.lins[kk](diffs[kk]), keepdim=True)
+        # b = torch.max(self.lins[kk](feats0[kk]**2))
+        # for kk in range(self.L):
+        #     a += spatial_average(self.lins[kk](diffs[kk]), keepdim=True)
+        #     b = torch.max(b,torch.max(self.lins[kk](feats0[kk]**2)))
+        # a = a/self.L
+        # from IPython import embed
+        # embed()
+        # return 10*torch.log10(b/a)
+        # if(retPerLayer):
+        #     return (val, res)
+        # else:
+        return torch.sum(torch.cat(res, 1), dim=(1,2,3), keepdims=False)
+class ScalingLayer(nn.Module):
+    def __init__(self):
+        super(ScalingLayer, self).__init__()
+        self.register_buffer('shift', torch.Tensor([-.030,-.088,-.188])[None,:,None,None])
+        self.register_buffer('scale', torch.Tensor([.458,.448,.450])[None,:,None,None])
+    def forward(self, inp):
+        return (inp - self.shift) / self.scale
+class NetLinLayer(nn.Module):
+    ''' A single linear layer which does a 1x1 conv '''
+    def __init__(self, chn_in, chn_out=1, use_dropout=False):
+        super(NetLinLayer, self).__init__()
+        layers = [nn.Dropout(),] if(use_dropout) else []
+        layers += [nn.Conv2d(chn_in, chn_out, 1, stride=1, padding=0, bias=False),]
+        self.model = nn.Sequential(*layers)
+    def forward(self, x):
+        return self.model(x)
+class Dist2LogitLayer(nn.Module):
+    ''' takes 2 distances, puts through fc layers, spits out value between [0,1] (if use_sigmoid is True) '''
+    def __init__(self, chn_mid=32, use_sigmoid=True):
+        super(Dist2LogitLayer, self).__init__()
+        layers = [nn.Conv2d(5, chn_mid, 1, stride=1, padding=0, bias=True),]
+        layers += [nn.LeakyReLU(0.2,True),]
+        layers += [nn.Conv2d(chn_mid, chn_mid, 1, stride=1, padding=0, bias=True),]
+        layers += [nn.LeakyReLU(0.2,True),]
+        layers += [nn.Conv2d(chn_mid, 1, 1, stride=1, padding=0, bias=True),]
+        if(use_sigmoid):
+            layers += [nn.Sigmoid(),]
+        self.model = nn.Sequential(*layers)
+    def forward(self,d0,d1,eps=0.1):
+        return self.model.forward(torch.cat((d0,d1,d0-d1,d0/(d1+eps),d1/(d0+eps)),dim=1))
+class BCERankingLoss(nn.Module):
+    def __init__(self, chn_mid=32):
+        super(BCERankingLoss, self).__init__()
+        self.net = Dist2LogitLayer(chn_mid=chn_mid)
+        # self.parameters = list(self.net.parameters())
+        self.loss = torch.nn.BCELoss()
+    def forward(self, d0, d1, judge):
+        per = (judge+1.)/2.
+        self.logit = self.net.forward(d0,d1)
+        return self.loss(self.logit, per)
+# L2, DSSIM metrics
+class FakeNet(nn.Module):
+    def __init__(self, use_gpu=True, colorspace='Lab'):
+        super(FakeNet, self).__init__()
+        self.use_gpu = use_gpu
+        self.colorspace = colorspace
+class L2(FakeNet):
+    def forward(self, in0, in1, retPerLayer=None):
+        assert(in0.size()[0]==1) # currently only supports batchSize 1
+        if(self.colorspace=='RGB'):
+            (N,C,X,Y) = in0.size()
+            value = torch.mean(torch.mean(torch.mean((in0-in1)**2,dim=1).view(N,1,X,Y),dim=2).view(N,1,1,Y),dim=3).view(N)
+            return value
+        elif(self.colorspace=='Lab'):
+            value = l2(tensor2np(tensor2tensorlab(in0.data,to_norm=False)),
+                tensor2np(tensor2tensorlab(in1.data,to_norm=False)), range=100.).astype('float')
+            ret_var = Variable( torch.Tensor((value,) ) )
+            if(self.use_gpu):
+                ret_var = ret_var.cuda()
+            return ret_var
+class DSSIM(FakeNet):
+    def forward(self, in0, in1, retPerLayer=None):
+        assert(in0.size()[0]==1) # currently only supports batchSize 1
+        if(self.colorspace=='RGB'):
+            value = dssim(1.*tensor2im(in0.data), 1.*tensor2im(in1.data), range=255.).astype('float')
+        elif(self.colorspace=='Lab'):
+            value = dssim(tensor2np(tensor2tensorlab(in0.data,to_norm=False)),
+                tensor2np(tensor2tensorlab(in1.data,to_norm=False)), range=100.).astype('float')
+        ret_var = Variable( torch.Tensor((value,) ) )
+        if(self.use_gpu):
+            ret_var = ret_var.cuda()
+        return ret_var
+def print_network(net):
+    num_params = 0
+    for param in net.parameters():
+        num_params += param.numel()
+    print('Network',net)
+    print('Total number of parameters: %d' % num_params)
+class FloLPIPS(LPIPS):
+    def __init__(self, pretrained=True, net='alex', version='0.1', lpips=True, spatial=False, pnet_rand=False, pnet_tune=False, use_dropout=True, model_path=None, eval_mode=True, verbose=False):
+        super(FloLPIPS, self).__init__(pretrained, net, version, lpips, spatial, pnet_rand, pnet_tune, use_dropout, model_path, eval_mode, verbose)
+    def forward(self, in0, in1, flow, retPerLayer=False, normalize=False):
+        if normalize: # turn on this flag if input is [0,1] so it can be adjusted to [-1, +1]
+            in0 = 2 * in0  - 1
+            in1 = 2 * in1  - 1
+        in0_input, in1_input = (self.scaling_layer(in0), self.scaling_layer(in1)) if self.version=='0.1' else (in0, in1)
+        outs0, outs1 = self.net.forward(in0_input), self.net.forward(in1_input)
+        feats0, feats1, diffs = {}, {}, {}
+        for kk in range(self.L):
+            feats0[kk], feats1[kk] = normalize_tensor(outs0[kk]), normalize_tensor(outs1[kk])
+            diffs[kk] = (feats0[kk]-feats1[kk])**2
+        res = [mw_spatial_average(self.lins[kk](diffs[kk]), flow, keepdim=True) for kk in range(self.L)]
+        return torch.sum(torch.cat(res, 1), dim=(1,2,3), keepdims=False)
+class Flolpips(nn.Module):
+    def __init__(self):
+        super(Flolpips, self).__init__()
+        self.loss_fn = FloLPIPS(net='alex',version='0.1')
+        self.flownet = PWCNet()
+    @torch.no_grad()
+    def forward(self, I0, I1, frame_dis, frame_ref):
+        """
+        args:
+            I0: first frame of the triplet, shape: [B, C, H, W]
+            I1: third frame of the triplet, shape: [B, C, H, W]
+            frame_dis: prediction of the intermediate frame, shape: [B, C, H, W]
+            frame_ref: ground-truth of the intermediate frame, shape: [B, C, H, W]
+        """
+        assert I0.size() == I1.size() == frame_dis.size() == frame_ref.size(), \
+                "the 4 input tensors should have same size"
+        flow_ref = self.flownet(frame_ref, I0)
+        flow_dis = self.flownet(frame_dis, I0)
+        flow_diff = flow_ref - flow_dis
+        flolpips_wrt_I0 = self.loss_fn.forward(frame_ref, frame_dis, flow_diff, normalize=True)
+        flow_ref = self.flownet(frame_ref, I1)
+        flow_dis = self.flownet(frame_dis, I1)
+        flow_diff = flow_ref - flow_dis
+        flolpips_wrt_I1 = self.loss_fn.forward(frame_ref, frame_dis, flow_diff, normalize=True)
+        flolpips = (flolpips_wrt_I0 + flolpips_wrt_I1) / 2
+        return flolpips

opensora/eval/flolpips/pretrained_networks.py ADDED Viewed

	@@ -0,0 +1,180 @@

+from collections import namedtuple
+import torch
+from torchvision import models as tv
+class squeezenet(torch.nn.Module):
+    def __init__(self, requires_grad=False, pretrained=True):
+        super(squeezenet, self).__init__()
+        pretrained_features = tv.squeezenet1_1(pretrained=pretrained).features
+        self.slice1 = torch.nn.Sequential()
+        self.slice2 = torch.nn.Sequential()
+        self.slice3 = torch.nn.Sequential()
+        self.slice4 = torch.nn.Sequential()
+        self.slice5 = torch.nn.Sequential()
+        self.slice6 = torch.nn.Sequential()
+        self.slice7 = torch.nn.Sequential()
+        self.N_slices = 7
+        for x in range(2):
+            self.slice1.add_module(str(x), pretrained_features[x])
+        for x in range(2,5):
+            self.slice2.add_module(str(x), pretrained_features[x])
+        for x in range(5, 8):
+            self.slice3.add_module(str(x), pretrained_features[x])
+        for x in range(8, 10):
+            self.slice4.add_module(str(x), pretrained_features[x])
+        for x in range(10, 11):
+            self.slice5.add_module(str(x), pretrained_features[x])
+        for x in range(11, 12):
+            self.slice6.add_module(str(x), pretrained_features[x])
+        for x in range(12, 13):
+            self.slice7.add_module(str(x), pretrained_features[x])
+        if not requires_grad:
+            for param in self.parameters():
+                param.requires_grad = False
+    def forward(self, X):
+        h = self.slice1(X)
+        h_relu1 = h
+        h = self.slice2(h)
+        h_relu2 = h
+        h = self.slice3(h)
+        h_relu3 = h
+        h = self.slice4(h)
+        h_relu4 = h
+        h = self.slice5(h)
+        h_relu5 = h
+        h = self.slice6(h)
+        h_relu6 = h
+        h = self.slice7(h)
+        h_relu7 = h
+        vgg_outputs = namedtuple("SqueezeOutputs", ['relu1','relu2','relu3','relu4','relu5','relu6','relu7'])
+        out = vgg_outputs(h_relu1,h_relu2,h_relu3,h_relu4,h_relu5,h_relu6,h_relu7)
+        return out
+class alexnet(torch.nn.Module):
+    def __init__(self, requires_grad=False, pretrained=True):
+        super(alexnet, self).__init__()
+        alexnet_pretrained_features = tv.alexnet(pretrained=pretrained).features
+        self.slice1 = torch.nn.Sequential()
+        self.slice2 = torch.nn.Sequential()
+        self.slice3 = torch.nn.Sequential()
+        self.slice4 = torch.nn.Sequential()
+        self.slice5 = torch.nn.Sequential()
+        self.N_slices = 5
+        for x in range(2):
+            self.slice1.add_module(str(x), alexnet_pretrained_features[x])
+        for x in range(2, 5):
+            self.slice2.add_module(str(x), alexnet_pretrained_features[x])
+        for x in range(5, 8):
+            self.slice3.add_module(str(x), alexnet_pretrained_features[x])
+        for x in range(8, 10):
+            self.slice4.add_module(str(x), alexnet_pretrained_features[x])
+        for x in range(10, 12):
+            self.slice5.add_module(str(x), alexnet_pretrained_features[x])
+        if not requires_grad:
+            for param in self.parameters():
+                param.requires_grad = False
+    def forward(self, X):
+        h = self.slice1(X)
+        h_relu1 = h
+        h = self.slice2(h)
+        h_relu2 = h
+        h = self.slice3(h)
+        h_relu3 = h
+        h = self.slice4(h)
+        h_relu4 = h
+        h = self.slice5(h)
+        h_relu5 = h
+        alexnet_outputs = namedtuple("AlexnetOutputs", ['relu1', 'relu2', 'relu3', 'relu4', 'relu5'])
+        out = alexnet_outputs(h_relu1, h_relu2, h_relu3, h_relu4, h_relu5)
+        return out
+class vgg16(torch.nn.Module):
+    def __init__(self, requires_grad=False, pretrained=True):
+        super(vgg16, self).__init__()
+        vgg_pretrained_features = tv.vgg16(pretrained=pretrained).features
+        self.slice1 = torch.nn.Sequential()
+        self.slice2 = torch.nn.Sequential()
+        self.slice3 = torch.nn.Sequential()
+        self.slice4 = torch.nn.Sequential()
+        self.slice5 = torch.nn.Sequential()
+        self.N_slices = 5
+        for x in range(4):
+            self.slice1.add_module(str(x), vgg_pretrained_features[x])
+        for x in range(4, 9):
+            self.slice2.add_module(str(x), vgg_pretrained_features[x])
+        for x in range(9, 16):
+            self.slice3.add_module(str(x), vgg_pretrained_features[x])
+        for x in range(16, 23):
+            self.slice4.add_module(str(x), vgg_pretrained_features[x])
+        for x in range(23, 30):
+            self.slice5.add_module(str(x), vgg_pretrained_features[x])
+        if not requires_grad:
+            for param in self.parameters():
+                param.requires_grad = False
+    def forward(self, X):
+        h = self.slice1(X)
+        h_relu1_2 = h
+        h = self.slice2(h)
+        h_relu2_2 = h
+        h = self.slice3(h)
+        h_relu3_3 = h
+        h = self.slice4(h)
+        h_relu4_3 = h
+        h = self.slice5(h)
+        h_relu5_3 = h
+        vgg_outputs = namedtuple("VggOutputs", ['relu1_2', 'relu2_2', 'relu3_3', 'relu4_3', 'relu5_3'])
+        out = vgg_outputs(h_relu1_2, h_relu2_2, h_relu3_3, h_relu4_3, h_relu5_3)
+        return out
+class resnet(torch.nn.Module):
+    def __init__(self, requires_grad=False, pretrained=True, num=18):
+        super(resnet, self).__init__()
+        if(num==18):
+            self.net = tv.resnet18(pretrained=pretrained)
+        elif(num==34):
+            self.net = tv.resnet34(pretrained=pretrained)
+        elif(num==50):
+            self.net = tv.resnet50(pretrained=pretrained)
+        elif(num==101):
+            self.net = tv.resnet101(pretrained=pretrained)
+        elif(num==152):
+            self.net = tv.resnet152(pretrained=pretrained)
+        self.N_slices = 5
+        self.conv1 = self.net.conv1
+        self.bn1 = self.net.bn1
+        self.relu = self.net.relu
+        self.maxpool = self.net.maxpool
+        self.layer1 = self.net.layer1
+        self.layer2 = self.net.layer2
+        self.layer3 = self.net.layer3
+        self.layer4 = self.net.layer4
+    def forward(self, X):
+        h = self.conv1(X)
+        h = self.bn1(h)
+        h = self.relu(h)
+        h_relu1 = h
+        h = self.maxpool(h)
+        h = self.layer1(h)
+        h_conv2 = h
+        h = self.layer2(h)
+        h_conv3 = h
+        h = self.layer3(h)
+        h_conv4 = h
+        h = self.layer4(h)
+        h_conv5 = h
+        outputs = namedtuple("Outputs", ['relu1','conv2','conv3','conv4','conv5'])
+        out = outputs(h_relu1, h_conv2, h_conv3, h_conv4, h_conv5)
+        return out

opensora/eval/flolpips/pwcnet.py ADDED Viewed

	@@ -0,0 +1,344 @@

+#!/usr/bin/env python
+import torch
+import getopt
+import math
+import numpy
+import os
+import PIL
+import PIL.Image
+import sys
+# try:
+from .correlation import correlation # the custom cost volume layer
+# except:
+# 	sys.path.insert(0, './correlation'); import correlation # you should consider upgrading python
+# end
+##########################################################
+# assert(int(str('').join(torch.__version__.split('.')[0:2])) >= 13) # requires at least pytorch version 1.3.0
+# torch.set_grad_enabled(False) # make sure to not compute gradients for computational performance
+# torch.backends.cudnn.enabled = True # make sure to use cudnn for computational performance
+# ##########################################################
+# arguments_strModel = 'default' # 'default', or 'chairs-things'
+# arguments_strFirst = './images/first.png'
+# arguments_strSecond = './images/second.png'
+# arguments_strOut = './out.flo'
+# for strOption, strArgument in getopt.getopt(sys.argv[1:], '', [ strParameter[2:] + '=' for strParameter in sys.argv[1::2] ])[0]:
+# 	if strOption == '--model' and strArgument != '': arguments_strModel = strArgument # which model to use
+# 	if strOption == '--first' and strArgument != '': arguments_strFirst = strArgument # path to the first frame
+# 	if strOption == '--second' and strArgument != '': arguments_strSecond = strArgument # path to the second frame
+# 	if strOption == '--out' and strArgument != '': arguments_strOut = strArgument # path to where the output should be stored
+# end
+##########################################################
+def backwarp(tenInput, tenFlow):
+	backwarp_tenGrid = {}
+	backwarp_tenPartial = {}
+	if str(tenFlow.shape) not in backwarp_tenGrid:
+		tenHor = torch.linspace(-1.0 + (1.0 / tenFlow.shape[3]), 1.0 - (1.0 / tenFlow.shape[3]), tenFlow.shape[3]).view(1, 1, 1, -1).expand(-1, -1, tenFlow.shape[2], -1)
+		tenVer = torch.linspace(-1.0 + (1.0 / tenFlow.shape[2]), 1.0 - (1.0 / tenFlow.shape[2]), tenFlow.shape[2]).view(1, 1, -1, 1).expand(-1, -1, -1, tenFlow.shape[3])
+		backwarp_tenGrid[str(tenFlow.shape)] = torch.cat([ tenHor, tenVer ], 1).cuda()
+	# end
+	if str(tenFlow.shape) not in backwarp_tenPartial:
+		backwarp_tenPartial[str(tenFlow.shape)] = tenFlow.new_ones([ tenFlow.shape[0], 1, tenFlow.shape[2], tenFlow.shape[3] ])
+	# end
+	tenFlow = torch.cat([ tenFlow[:, 0:1, :, :] / ((tenInput.shape[3] - 1.0) / 2.0), tenFlow[:, 1:2, :, :] / ((tenInput.shape[2] - 1.0) / 2.0) ], 1)
+	tenInput = torch.cat([ tenInput, backwarp_tenPartial[str(tenFlow.shape)] ], 1)
+	tenOutput = torch.nn.functional.grid_sample(input=tenInput, grid=(backwarp_tenGrid[str(tenFlow.shape)] + tenFlow).permute(0, 2, 3, 1), mode='bilinear', padding_mode='zeros', align_corners=False)
+	tenMask = tenOutput[:, -1:, :, :]; tenMask[tenMask > 0.999] = 1.0; tenMask[tenMask < 1.0] = 0.0
+	return tenOutput[:, :-1, :, :] * tenMask
+# end
+##########################################################
+class Network(torch.nn.Module):
+	def __init__(self):
+		super(Network, self).__init__()
+		class Extractor(torch.nn.Module):
+			def __init__(self):
+				super(Extractor, self).__init__()
+				self.netOne = torch.nn.Sequential(
+					torch.nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=2, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
+					torch.nn.Conv2d(in_channels=16, out_channels=16, kernel_size=3, stride=1, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
+					torch.nn.Conv2d(in_channels=16, out_channels=16, kernel_size=3, stride=1, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1)
+				)
+				self.netTwo = torch.nn.Sequential(
+					torch.nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=2, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
+					torch.nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3, stride=1, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
+					torch.nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3, stride=1, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1)
+				)
+				self.netThr = torch.nn.Sequential(
+					torch.nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=2, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
+					torch.nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
+					torch.nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1)
+				)
+				self.netFou = torch.nn.Sequential(
+					torch.nn.Conv2d(in_channels=64, out_channels=96, kernel_size=3, stride=2, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
+					torch.nn.Conv2d(in_channels=96, out_channels=96, kernel_size=3, stride=1, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
+					torch.nn.Conv2d(in_channels=96, out_channels=96, kernel_size=3, stride=1, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1)
+				)
+				self.netFiv = torch.nn.Sequential(
+					torch.nn.Conv2d(in_channels=96, out_channels=128, kernel_size=3, stride=2, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
+					torch.nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
+					torch.nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1)
+				)
+				self.netSix = torch.nn.Sequential(
+					torch.nn.Conv2d(in_channels=128, out_channels=196, kernel_size=3, stride=2, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
+					torch.nn.Conv2d(in_channels=196, out_channels=196, kernel_size=3, stride=1, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
+					torch.nn.Conv2d(in_channels=196, out_channels=196, kernel_size=3, stride=1, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1)
+				)
+			# end
+			def forward(self, tenInput):
+				tenOne = self.netOne(tenInput)
+				tenTwo = self.netTwo(tenOne)
+				tenThr = self.netThr(tenTwo)
+				tenFou = self.netFou(tenThr)
+				tenFiv = self.netFiv(tenFou)
+				tenSix = self.netSix(tenFiv)
+				return [ tenOne, tenTwo, tenThr, tenFou, tenFiv, tenSix ]
+			# end
+		# end
+		class Decoder(torch.nn.Module):
+			def __init__(self, intLevel):
+				super(Decoder, self).__init__()
+				intPrevious = [ None, None, 81 + 32 + 2 + 2, 81 + 64 + 2 + 2, 81 + 96 + 2 + 2, 81 + 128 + 2 + 2, 81, None ][intLevel + 1]
+				intCurrent = [ None, None, 81 + 32 + 2 + 2, 81 + 64 + 2 + 2, 81 + 96 + 2 + 2, 81 + 128 + 2 + 2, 81, None ][intLevel + 0]
+				if intLevel < 6: self.netUpflow = torch.nn.ConvTranspose2d(in_channels=2, out_channels=2, kernel_size=4, stride=2, padding=1)
+				if intLevel < 6: self.netUpfeat = torch.nn.ConvTranspose2d(in_channels=intPrevious + 128 + 128 + 96 + 64 + 32, out_channels=2, kernel_size=4, stride=2, padding=1)
+				if intLevel < 6: self.fltBackwarp = [ None, None, None, 5.0, 2.5, 1.25, 0.625, None ][intLevel + 1]
+				self.netOne = torch.nn.Sequential(
+					torch.nn.Conv2d(in_channels=intCurrent, out_channels=128, kernel_size=3, stride=1, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1)
+				)
+				self.netTwo = torch.nn.Sequential(
+					torch.nn.Conv2d(in_channels=intCurrent + 128, out_channels=128, kernel_size=3, stride=1, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1)
+				)
+				self.netThr = torch.nn.Sequential(
+					torch.nn.Conv2d(in_channels=intCurrent + 128 + 128, out_channels=96, kernel_size=3, stride=1, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1)
+				)
+				self.netFou = torch.nn.Sequential(
+					torch.nn.Conv2d(in_channels=intCurrent + 128 + 128 + 96, out_channels=64, kernel_size=3, stride=1, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1)
+				)
+				self.netFiv = torch.nn.Sequential(
+					torch.nn.Conv2d(in_channels=intCurrent + 128 + 128 + 96 + 64, out_channels=32, kernel_size=3, stride=1, padding=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1)
+				)
+				self.netSix = torch.nn.Sequential(
+					torch.nn.Conv2d(in_channels=intCurrent + 128 + 128 + 96 + 64 + 32, out_channels=2, kernel_size=3, stride=1, padding=1)
+				)
+			# end
+			def forward(self, tenFirst, tenSecond, objPrevious):
+				tenFlow = None
+				tenFeat = None
+				if objPrevious is None:
+					tenFlow = None
+					tenFeat = None
+					tenVolume = torch.nn.functional.leaky_relu(input=correlation.FunctionCorrelation(tenFirst=tenFirst, tenSecond=tenSecond), negative_slope=0.1, inplace=False)
+					tenFeat = torch.cat([ tenVolume ], 1)
+				elif objPrevious is not None:
+					tenFlow = self.netUpflow(objPrevious['tenFlow'])
+					tenFeat = self.netUpfeat(objPrevious['tenFeat'])
+					tenVolume = torch.nn.functional.leaky_relu(input=correlation.FunctionCorrelation(tenFirst=tenFirst, tenSecond=backwarp(tenInput=tenSecond, tenFlow=tenFlow * self.fltBackwarp)), negative_slope=0.1, inplace=False)
+					tenFeat = torch.cat([ tenVolume, tenFirst, tenFlow, tenFeat ], 1)
+				# end
+				tenFeat = torch.cat([ self.netOne(tenFeat), tenFeat ], 1)
+				tenFeat = torch.cat([ self.netTwo(tenFeat), tenFeat ], 1)
+				tenFeat = torch.cat([ self.netThr(tenFeat), tenFeat ], 1)
+				tenFeat = torch.cat([ self.netFou(tenFeat), tenFeat ], 1)
+				tenFeat = torch.cat([ self.netFiv(tenFeat), tenFeat ], 1)
+				tenFlow = self.netSix(tenFeat)
+				return {
+					'tenFlow': tenFlow,
+					'tenFeat': tenFeat
+				}
+			# end
+		# end
+		class Refiner(torch.nn.Module):
+			def __init__(self):
+				super(Refiner, self).__init__()
+				self.netMain = torch.nn.Sequential(
+					torch.nn.Conv2d(in_channels=81 + 32 + 2 + 2 + 128 + 128 + 96 + 64 + 32, out_channels=128, kernel_size=3, stride=1, padding=1, dilation=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
+					torch.nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, padding=2, dilation=2),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
+					torch.nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, padding=4, dilation=4),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
+					torch.nn.Conv2d(in_channels=128, out_channels=96, kernel_size=3, stride=1, padding=8, dilation=8),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
+					torch.nn.Conv2d(in_channels=96, out_channels=64, kernel_size=3, stride=1, padding=16, dilation=16),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
+					torch.nn.Conv2d(in_channels=64, out_channels=32, kernel_size=3, stride=1, padding=1, dilation=1),
+					torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
+					torch.nn.Conv2d(in_channels=32, out_channels=2, kernel_size=3, stride=1, padding=1, dilation=1)
+				)
+			# end
+			def forward(self, tenInput):
+				return self.netMain(tenInput)
+			# end
+		# end
+		self.netExtractor = Extractor()
+		self.netTwo = Decoder(2)
+		self.netThr = Decoder(3)
+		self.netFou = Decoder(4)
+		self.netFiv = Decoder(5)
+		self.netSix = Decoder(6)
+		self.netRefiner = Refiner()
+		self.load_state_dict({ strKey.replace('module', 'net'): tenWeight for strKey, tenWeight in torch.hub.load_state_dict_from_url(url='http://content.sniklaus.com/github/pytorch-pwc/network-' + 'default' + '.pytorch').items() })
+	# end
+	def forward(self, tenFirst, tenSecond):
+		intWidth = tenFirst.shape[3]
+		intHeight = tenFirst.shape[2]
+		intPreprocessedWidth = int(math.floor(math.ceil(intWidth / 64.0) * 64.0))
+		intPreprocessedHeight = int(math.floor(math.ceil(intHeight / 64.0) * 64.0))
+		tenPreprocessedFirst = torch.nn.functional.interpolate(input=tenFirst, size=(intPreprocessedHeight, intPreprocessedWidth), mode='bilinear', align_corners=False)
+		tenPreprocessedSecond = torch.nn.functional.interpolate(input=tenSecond, size=(intPreprocessedHeight, intPreprocessedWidth), mode='bilinear', align_corners=False)
+		tenFirst = self.netExtractor(tenPreprocessedFirst)
+		tenSecond = self.netExtractor(tenPreprocessedSecond)
+		objEstimate = self.netSix(tenFirst[-1], tenSecond[-1], None)
+		objEstimate = self.netFiv(tenFirst[-2], tenSecond[-2], objEstimate)
+		objEstimate = self.netFou(tenFirst[-3], tenSecond[-3], objEstimate)
+		objEstimate = self.netThr(tenFirst[-4], tenSecond[-4], objEstimate)
+		objEstimate = self.netTwo(tenFirst[-5], tenSecond[-5], objEstimate)
+		tenFlow = objEstimate['tenFlow'] + self.netRefiner(objEstimate['tenFeat'])
+		tenFlow = 20.0 * torch.nn.functional.interpolate(input=tenFlow, size=(intHeight, intWidth), mode='bilinear', align_corners=False)
+		tenFlow[:, 0, :, :] *= float(intWidth) / float(intPreprocessedWidth)
+		tenFlow[:, 1, :, :] *= float(intHeight) / float(intPreprocessedHeight)
+		return tenFlow
+	# end
+# end
+netNetwork = None
+##########################################################
+def estimate(tenFirst, tenSecond):
+	global netNetwork
+	if netNetwork is None:
+		netNetwork = Network().cuda().eval()
+	# end
+	assert(tenFirst.shape[1] == tenSecond.shape[1])
+	assert(tenFirst.shape[2] == tenSecond.shape[2])
+	intWidth = tenFirst.shape[2]
+	intHeight = tenFirst.shape[1]
+	assert(intWidth == 1024) # remember that there is no guarantee for correctness, comment this line out if you acknowledge this and want to continue
+	assert(intHeight == 436) # remember that there is no guarantee for correctness, comment this line out if you acknowledge this and want to continue
+	tenPreprocessedFirst = tenFirst.cuda().view(1, 3, intHeight, intWidth)
+	tenPreprocessedSecond = tenSecond.cuda().view(1, 3, intHeight, intWidth)
+	intPreprocessedWidth = int(math.floor(math.ceil(intWidth / 64.0) * 64.0))
+	intPreprocessedHeight = int(math.floor(math.ceil(intHeight / 64.0) * 64.0))
+	tenPreprocessedFirst = torch.nn.functional.interpolate(input=tenPreprocessedFirst, size=(intPreprocessedHeight, intPreprocessedWidth), mode='bilinear', align_corners=False)
+	tenPreprocessedSecond = torch.nn.functional.interpolate(input=tenPreprocessedSecond, size=(intPreprocessedHeight, intPreprocessedWidth), mode='bilinear', align_corners=False)
+	tenFlow = 20.0 * torch.nn.functional.interpolate(input=netNetwork(tenPreprocessedFirst, tenPreprocessedSecond), size=(intHeight, intWidth), mode='bilinear', align_corners=False)
+	tenFlow[:, 0, :, :] *= float(intWidth) / float(intPreprocessedWidth)
+	tenFlow[:, 1, :, :] *= float(intHeight) / float(intPreprocessedHeight)
+	return tenFlow[0, :, :, :].cpu()
+# end
+##########################################################
+# if __name__ == '__main__':
+# 	tenFirst = torch.FloatTensor(numpy.ascontiguousarray(numpy.array(PIL.Image.open(arguments_strFirst))[:, :, ::-1].transpose(2, 0, 1).astype(numpy.float32) * (1.0 / 255.0)))
+# 	tenSecond = torch.FloatTensor(numpy.ascontiguousarray(numpy.array(PIL.Image.open(arguments_strSecond))[:, :, ::-1].transpose(2, 0, 1).astype(numpy.float32) * (1.0 / 255.0)))
+# 	tenOutput = estimate(tenFirst, tenSecond)
+# 	objOutput = open(arguments_strOut, 'wb')
+# 	numpy.array([ 80, 73, 69, 72 ], numpy.uint8).tofile(objOutput)
+# 	numpy.array([ tenOutput.shape[2], tenOutput.shape[1] ], numpy.int32).tofile(objOutput)
+# 	numpy.array(tenOutput.numpy().transpose(1, 2, 0), numpy.float32).tofile(objOutput)
+# 	objOutput.close()
+# end

opensora/eval/flolpips/utils.py ADDED Viewed

	@@ -0,0 +1,95 @@

+import numpy as np
+import cv2
+import torch
+def normalize_tensor(in_feat,eps=1e-10):
+    norm_factor = torch.sqrt(torch.sum(in_feat**2,dim=1,keepdim=True))
+    return in_feat/(norm_factor+eps)
+def l2(p0, p1, range=255.):
+    return .5*np.mean((p0 / range - p1 / range)**2)
+def dssim(p0, p1, range=255.):
+    from skimage.measure import compare_ssim
+    return (1 - compare_ssim(p0, p1, data_range=range, multichannel=True)) / 2.
+def tensor2im(image_tensor, imtype=np.uint8, cent=1., factor=255./2.):
+    image_numpy = image_tensor[0].cpu().float().numpy()
+    image_numpy = (np.transpose(image_numpy, (1, 2, 0)) + cent) * factor
+    return image_numpy.astype(imtype)
+def tensor2np(tensor_obj):
+    # change dimension of a tensor object into a numpy array
+    return tensor_obj[0].cpu().float().numpy().transpose((1,2,0))
+def np2tensor(np_obj):
+     # change dimenion of np array into tensor array
+    return torch.Tensor(np_obj[:, :, :, np.newaxis].transpose((3, 2, 0, 1)))
+def tensor2tensorlab(image_tensor,to_norm=True,mc_only=False):
+    # image tensor to lab tensor
+    from skimage import color
+    img = tensor2im(image_tensor)
+    img_lab = color.rgb2lab(img)
+    if(mc_only):
+        img_lab[:,:,0] = img_lab[:,:,0]-50
+    if(to_norm and not mc_only):
+        img_lab[:,:,0] = img_lab[:,:,0]-50
+        img_lab = img_lab/100.
+    return np2tensor(img_lab)
+def read_frame_yuv2rgb(stream, width, height, iFrame, bit_depth, pix_fmt='420'):
+    if pix_fmt == '420':
+        multiplier = 1
+        uv_factor = 2
+    elif pix_fmt == '444':
+        multiplier = 2
+        uv_factor = 1
+    else:
+        print('Pixel format {} is not supported'.format(pix_fmt))
+        return
+    if bit_depth == 8:
+        datatype = np.uint8
+        stream.seek(iFrame*1.5*width*height*multiplier)
+        Y = np.fromfile(stream, dtype=datatype, count=width*height).reshape((height, width))
+        # read chroma samples and upsample since original is 4:2:0 sampling
+        U = np.fromfile(stream, dtype=datatype, count=(width//uv_factor)*(height//uv_factor)).\
+                                reshape((height//uv_factor, width//uv_factor))
+        V = np.fromfile(stream, dtype=datatype, count=(width//uv_factor)*(height//uv_factor)).\
+                                reshape((height//uv_factor, width//uv_factor))
+    else:
+        datatype = np.uint16
+        stream.seek(iFrame*3*width*height*multiplier)
+        Y = np.fromfile(stream, dtype=datatype, count=width*height).reshape((height, width))
+        U = np.fromfile(stream, dtype=datatype, count=(width//uv_factor)*(height//uv_factor)).\
+                                reshape((height//uv_factor, width//uv_factor))
+        V = np.fromfile(stream, dtype=datatype, count=(width//uv_factor)*(height//uv_factor)).\
+                                reshape((height//uv_factor, width//uv_factor))
+    if pix_fmt == '420':
+        yuv = np.empty((height*3//2, width), dtype=datatype)
+        yuv[0:height,:] = Y
+        yuv[height:height+height//4,:] = U.reshape(-1, width)
+        yuv[height+height//4:,:] = V.reshape(-1, width)
+        if bit_depth != 8:
+            yuv = (yuv/(2**bit_depth-1)*255).astype(np.uint8)
+        #convert to rgb
+        rgb = cv2.cvtColor(yuv, cv2.COLOR_YUV2RGB_I420)
+    else:
+        yvu = np.stack([Y,V,U],axis=2)
+        if bit_depth != 8:
+            yvu = (yvu/(2**bit_depth-1)*255).astype(np.uint8)
+        rgb = cv2.cvtColor(yvu, cv2.COLOR_YCrCb2RGB)
+    return rgb

opensora/eval/fvd/styleganv/fvd.py ADDED Viewed

	@@ -0,0 +1,90 @@

+import torch
+import os
+import math
+import torch.nn.functional as F
+# https://github.com/universome/fvd-comparison
+def load_i3d_pretrained(device=torch.device('cpu')):
+    i3D_WEIGHTS_URL = "https://www.dropbox.com/s/ge9e5ujwgetktms/i3d_torchscript.pt"
+    filepath = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'i3d_torchscript.pt')
+    print(filepath)
+    if not os.path.exists(filepath):
+        print(f"preparing for download {i3D_WEIGHTS_URL}, you can download it by yourself.")
+        os.system(f"wget {i3D_WEIGHTS_URL} -O {filepath}")
+    i3d = torch.jit.load(filepath).eval().to(device)
+    i3d = torch.nn.DataParallel(i3d)
+    return i3d
+def get_feats(videos, detector, device, bs=10):
+    # videos : torch.tensor BCTHW [0, 1]
+    detector_kwargs = dict(rescale=False, resize=False, return_features=True) # Return raw features before the softmax layer.
+    feats = np.empty((0, 400))
+    with torch.no_grad():
+        for i in range((len(videos)-1)//bs + 1):
+            feats = np.vstack([feats, detector(torch.stack([preprocess_single(video) for video in videos[i*bs:(i+1)*bs]]).to(device), **detector_kwargs).detach().cpu().numpy()])
+    return feats
+def get_fvd_feats(videos, i3d, device, bs=10):
+    # videos in [0, 1] as torch tensor BCTHW
+    # videos = [preprocess_single(video) for video in videos]
+    embeddings = get_feats(videos, i3d, device, bs)
+    return embeddings
+def preprocess_single(video, resolution=224, sequence_length=None):
+    # video: CTHW, [0, 1]
+    c, t, h, w = video.shape
+    # temporal crop
+    if sequence_length is not None:
+        assert sequence_length <= t
+        video = video[:, :sequence_length]
+    # scale shorter side to resolution
+    scale = resolution / min(h, w)
+    if h < w:
+        target_size = (resolution, math.ceil(w * scale))
+    else:
+        target_size = (math.ceil(h * scale), resolution)
+    video = F.interpolate(video, size=target_size, mode='bilinear', align_corners=False)
+    # center crop
+    c, t, h, w = video.shape
+    w_start = (w - resolution) // 2
+    h_start = (h - resolution) // 2
+    video = video[:, :, h_start:h_start + resolution, w_start:w_start + resolution]
+    # [0, 1] -> [-1, 1]
+    video = (video - 0.5) * 2
+    return video.contiguous()
+"""
+Copy-pasted from https://github.com/cvpr2022-stylegan-v/stylegan-v/blob/main/src/metrics/frechet_video_distance.py
+"""
+from typing import Tuple
+from scipy.linalg import sqrtm
+import numpy as np
+def compute_stats(feats: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
+    mu = feats.mean(axis=0) # [d]
+    sigma = np.cov(feats, rowvar=False) # [d, d]
+    return mu, sigma
+def frechet_distance(feats_fake: np.ndarray, feats_real: np.ndarray) -> float:
+    mu_gen, sigma_gen = compute_stats(feats_fake)
+    mu_real, sigma_real = compute_stats(feats_real)
+    m = np.square(mu_gen - mu_real).sum()
+    if feats_fake.shape[0]>1:
+        s, _ = sqrtm(np.dot(sigma_gen, sigma_real), disp=False) # pylint: disable=no-member
+        fid = np.real(m + np.trace(sigma_gen + sigma_real - s * 2))
+    else:
+        fid = np.real(m)
+    return float(fid)

opensora/eval/fvd/styleganv/i3d_torchscript.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bec6519f66ea534e953026b4ae2c65553c17bf105611c746d904657e5860a5e2
+size 51235320