xiangzai commited on 3 days ago

Commit

342a08e

verified ·

1 Parent(s): 5484dca

Add files using upload-large-folder tool

Browse files

Files changed (20) hide show

REG/evaluations/README.md +72 -0
REG/evaluations/evaluator.py +679 -0
REG/evaluations/requirements.txt +4 -0
REG/models/clip_vit.py +426 -0
REG/models/jepa.py +547 -0
REG/models/mae_vit.py +71 -0
REG/models/mocov3_vit.py +207 -0
REG/models/sit.py +420 -0
back/LICENSE +21 -0
back/README.md +156 -0
back/eval.sh +52 -0
back/loss.py +168 -0
back/requirements.txt +97 -0
back/sample_from_checkpoint.py +596 -0
back/samples.sh +15 -0
back/samples_0.5.log +0 -0
back/samples_ddp.sh +32 -0
back/train.py +670 -0
back/train.sh +43 -0
back/utils.py +225 -0

REG/evaluations/README.md ADDED Viewed

	@@ -0,0 +1,72 @@

+# Evaluations
+To compare different generative models, we use FID, sFID, Precision, Recall, and Inception Score. These metrics can all be calculated using batches of samples, which we store in `.npz` (numpy) files.
+# Download batches
+We provide pre-computed sample batches for the reference datasets, our diffusion models, and several baselines we compare against. These are all stored in `.npz` format.
+Reference dataset batches contain pre-computed statistics over the whole dataset, as well as 10,000 images for computing Precision and Recall. All other batches contain 50,000 images which can be used to compute statistics and Precision/Recall.
+Here are links to download all of the sample and reference batches:
+ * LSUN
+   * LSUN bedroom: [reference batch](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/lsun/bedroom/VIRTUAL_lsun_bedroom256.npz)
+     * [ADM (dropout)](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/lsun/bedroom/admnet_dropout_lsun_bedroom.npz)
+     * [DDPM](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/lsun/bedroom/ddpm_lsun_bedroom.npz)
+     * [IDDPM](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/lsun/bedroom/iddpm_lsun_bedroom.npz)
+     * [StyleGAN](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/lsun/bedroom/stylegan_lsun_bedroom.npz)
+   * LSUN cat: [reference batch](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/lsun/cat/VIRTUAL_lsun_cat256.npz)
+     * [ADM (dropout)](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/lsun/cat/admnet_dropout_lsun_cat.npz)
+     * [StyleGAN2](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/lsun/cat/stylegan2_lsun_cat.npz)
+   * LSUN horse: [reference batch](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/lsun/horse/VIRTUAL_lsun_horse256.npz)
+     * [ADM (dropout)](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/lsun/horse/admnet_dropout_lsun_horse.npz)
+     * [ADM](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/lsun/horse/admnet_lsun_horse.npz)
+ * ImageNet
+   * ImageNet 64x64: [reference batch](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/64/VIRTUAL_imagenet64_labeled.npz)
+     * [ADM](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/64/admnet_imagenet64.npz)
+     * [IDDPM](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/64/iddpm_imagenet64.npz)
+     * [BigGAN](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/64/biggan_deep_imagenet64.npz)
+   * ImageNet 128x128: [reference batch](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/128/VIRTUAL_imagenet128_labeled.npz)
+     * [ADM](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/128/admnet_imagenet128.npz)
+     * [ADM-G](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/128/admnet_guided_imagenet128.npz)
+     * [ADM-G, 25 steps](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/128/admnet_guided_25step_imagenet128.npz)
+     * [BigGAN-deep (trunc=1.0)](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/128/biggan_deep_trunc1_imagenet128.npz)
+   * ImageNet 256x256: [reference batch](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/256/VIRTUAL_imagenet256_labeled.npz)
+     * [ADM](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/256/admnet_imagenet256.npz)
+     * [ADM-G](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/256/admnet_guided_imagenet256.npz)
+     * [ADM-G, 25 step](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/256/admnet_guided_25step_imagenet256.npz)
+     * [ADM-G + ADM-U](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/256/admnet_guided_upsampled_imagenet256.npz)
+     * [ADM-U](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/256/admnet_upsampled_imagenet256.npz)
+     * [BigGAN-deep (trunc=1.0)](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/256/biggan_deep_trunc1_imagenet256.npz)
+   * ImageNet 512x512: [reference batch](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/512/VIRTUAL_imagenet512.npz)
+     * [ADM](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/512/admnet_imagenet512.npz)
+     * [ADM-G](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/512/admnet_guided_imagenet512.npz)
+     * [ADM-G, 25 step](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/512/admnet_guided_25step_imagenet512.npz)
+     * [ADM-G + ADM-U](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/512/admnet_guided_upsampled_imagenet512.npz)
+     * [ADM-U](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/512/admnet_upsampled_imagenet512.npz)
+     * [BigGAN-deep (trunc=1.0)](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/512/biggan_deep_trunc1_imagenet512.npz)
+# Run evaluations
+First, generate or download a batch of samples and download the corresponding reference batch for the given dataset. For this example, we'll use ImageNet 256x256, so the refernce batch is `VIRTUAL_imagenet256_labeled.npz` and we can use the sample batch `admnet_guided_upsampled_imagenet256.npz`.
+Next, run the `evaluator.py` script. The requirements of this script can be found in [requirements.txt](requirements.txt). Pass two arguments to the script: the reference batch and the sample batch. The script will download the InceptionV3 model used for evaluations into the current working directory (if it is not already present). This file is roughly 100MB.
+The output of the script will look something like this, where the first `...` is a bunch of verbose TensorFlow logging:
+```
+$ python evaluator.py VIRTUAL_imagenet256_labeled.npz admnet_guided_upsampled_imagenet256.npz
+...
+computing reference batch activations...
+computing/reading reference batch statistics...
+computing sample batch activations...
+computing/reading sample batch statistics...
+Computing evaluations...
+Inception Score: 215.8370361328125
+FID: 3.9425574129223264
+sFID: 6.140433703346162
+Precision: 0.8265
+Recall: 0.5309
+```

REG/evaluations/evaluator.py ADDED Viewed

	@@ -0,0 +1,679 @@

+import argparse
+import io
+import os
+import random
+import warnings
+import zipfile
+from abc import ABC, abstractmethod
+from contextlib import contextmanager
+from functools import partial
+from multiprocessing import cpu_count
+from multiprocessing.pool import ThreadPool
+from typing import Iterable, Optional, Tuple
+import numpy as np
+import requests
+import tensorflow.compat.v1 as tf
+from scipy import linalg
+from tqdm.auto import tqdm
+INCEPTION_V3_URL = "https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/classify_image_graph_def.pb"
+INCEPTION_V3_PATH = "classify_image_graph_def.pb"
+FID_POOL_NAME = "pool_3:0"
+FID_SPATIAL_NAME = "mixed_6/conv:0"
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--ref_batch", help="path to reference batch npz file")
+    parser.add_argument("--sample_batch", help="path to sample batch npz file")
+    parser.add_argument("--save_path", help="path to sample batch npz file")
+    parser.add_argument("--cfg_cond", default=1, type=int)
+    parser.add_argument("--step", default=1, type=int)
+    parser.add_argument("--cfg", default=1.0, type=float)
+    parser.add_argument("--cls_cfg", default=1.0, type=float)
+    parser.add_argument("--gh", default=1.0, type=float)
+    parser.add_argument("--num_steps", default=250, type=int)
+    args = parser.parse_args()
+    if not os.path.exists(args.save_path):
+        os.mkdir(args.save_path)
+    config = tf.ConfigProto(
+        allow_soft_placement=True  # allows DecodeJpeg to run on CPU in Inception graph
+    )
+    config.gpu_options.allow_growth = True
+    evaluator = Evaluator(tf.Session(config=config))
+    print("warming up TensorFlow...")
+    # This will cause TF to print a bunch of verbose stuff now rather
+    # than after the next print(), to help prevent confusion.
+    evaluator.warmup()
+    print("computing reference batch activations...")
+    ref_acts = evaluator.read_activations(args.ref_batch)
+    print("computing/reading reference batch statistics...")
+    ref_stats, ref_stats_spatial = evaluator.read_statistics(args.ref_batch, ref_acts)
+    print("computing sample batch activations...")
+    sample_acts = evaluator.read_activations(args.sample_batch)
+    print("computing/reading sample batch statistics...")
+    sample_stats, sample_stats_spatial = evaluator.read_statistics(args.sample_batch, sample_acts)
+    print("Computing evaluations...")
+    Inception_Score = evaluator.compute_inception_score(sample_acts[0])
+    FID = sample_stats.frechet_distance(ref_stats)
+    sFID = sample_stats_spatial.frechet_distance(ref_stats_spatial)
+    prec, recall = evaluator.compute_prec_recall(ref_acts[0], sample_acts[0])
+    print("Inception Score:", Inception_Score)
+    print("FID:", FID)
+    print("sFID:", sFID)
+    print("Precision:", prec)
+    print("Recall:", recall)
+    if args.cfg_cond:
+        file_path = args.save_path + str(args.num_steps) + str(args.step) + str(args.cfg) + str(args.gh) + str(args.cls_cfg)+ "cfg_cond_true.txt"
+    else:
+        file_path = args.save_path + str(args.num_steps) + str(args.step) + str(args.cfg) + str(args.gh) + str(args.cls_cfg)+ "cfg_cond_false.txt"
+    with open(file_path, "w") as file:
+        file.write("Inception Score: {}\n".format(Inception_Score))
+        file.write("FID: {}\n".format(FID))
+        file.write("sFID: {}\n".format(sFID))
+        file.write("Precision: {}\n".format(prec))
+        file.write("Recall: {}\n".format(recall))
+class InvalidFIDException(Exception):
+    pass
+class FIDStatistics:
+    def __init__(self, mu: np.ndarray, sigma: np.ndarray):
+        self.mu = mu
+        self.sigma = sigma
+    def frechet_distance(self, other, eps=1e-6):
+        """
+        Compute the Frechet distance between two sets of statistics.
+        """
+        # https://github.com/bioinf-jku/TTUR/blob/73ab375cdf952a12686d9aa7978567771084da42/fid.py#L132
+        mu1, sigma1 = self.mu, self.sigma
+        mu2, sigma2 = other.mu, other.sigma
+        mu1 = np.atleast_1d(mu1)
+        mu2 = np.atleast_1d(mu2)
+        sigma1 = np.atleast_2d(sigma1)
+        sigma2 = np.atleast_2d(sigma2)
+        assert (
+            mu1.shape == mu2.shape
+        ), f"Training and test mean vectors have different lengths: {mu1.shape}, {mu2.shape}"
+        assert (
+            sigma1.shape == sigma2.shape
+        ), f"Training and test covariances have different dimensions: {sigma1.shape}, {sigma2.shape}"
+        diff = mu1 - mu2
+        # product might be almost singular
+        covmean, _ = linalg.sqrtm(sigma1.dot(sigma2), disp=False)
+        if not np.isfinite(covmean).all():
+            msg = (
+                "fid calculation produces singular product; adding %s to diagonal of cov estimates"
+                % eps
+            )
+            warnings.warn(msg)
+            offset = np.eye(sigma1.shape[0]) * eps
+            covmean = linalg.sqrtm((sigma1 + offset).dot(sigma2 + offset))
+        # numerical error might give slight imaginary component
+        if np.iscomplexobj(covmean):
+            if not np.allclose(np.diagonal(covmean).imag, 0, atol=1e-3):
+                m = np.max(np.abs(covmean.imag))
+                raise ValueError("Imaginary component {}".format(m))
+            covmean = covmean.real
+        tr_covmean = np.trace(covmean)
+        return diff.dot(diff) + np.trace(sigma1) + np.trace(sigma2) - 2 * tr_covmean
+class Evaluator:
+    def __init__(
+        self,
+        session,
+        batch_size=64,
+        softmax_batch_size=512,
+    ):
+        self.sess = session
+        self.batch_size = batch_size
+        self.softmax_batch_size = softmax_batch_size
+        self.manifold_estimator = ManifoldEstimator(session)
+        with self.sess.graph.as_default():
+            self.image_input = tf.placeholder(tf.float32, shape=[None, None, None, 3])
+            self.softmax_input = tf.placeholder(tf.float32, shape=[None, 2048])
+            self.pool_features, self.spatial_features = _create_feature_graph(self.image_input)
+            self.softmax = _create_softmax_graph(self.softmax_input)
+    def warmup(self):
+        self.compute_activations(np.zeros([1, 8, 64, 64, 3]))
+    def read_activations(self, npz_path: str) -> Tuple[np.ndarray, np.ndarray]:
+        with open_npz_array(npz_path, "arr_0") as reader:
+            return self.compute_activations(reader.read_batches(self.batch_size))
+    def compute_activations(self, batches: Iterable[np.ndarray]) -> Tuple[np.ndarray, np.ndarray]:
+        """
+        Compute image features for downstream evals.
+        :param batches: a iterator over NHWC numpy arrays in [0, 255].
+        :return: a tuple of numpy arrays of shape [N x X], where X is a feature
+                 dimension. The tuple is (pool_3, spatial).
+        """
+        preds = []
+        spatial_preds = []
+        for batch in tqdm(batches):
+            batch = batch.astype(np.float32)
+            pred, spatial_pred = self.sess.run(
+                [self.pool_features, self.spatial_features], {self.image_input: batch}
+            )
+            preds.append(pred.reshape([pred.shape[0], -1]))
+            spatial_preds.append(spatial_pred.reshape([spatial_pred.shape[0], -1]))
+        return (
+            np.concatenate(preds, axis=0),
+            np.concatenate(spatial_preds, axis=0),
+        )
+    def read_statistics(
+        self, npz_path: str, activations: Tuple[np.ndarray, np.ndarray]
+    ) -> Tuple[FIDStatistics, FIDStatistics]:
+        obj = np.load(npz_path)
+        if "mu" in list(obj.keys()):
+            return FIDStatistics(obj["mu"], obj["sigma"]), FIDStatistics(
+                obj["mu_s"], obj["sigma_s"]
+            )
+        return tuple(self.compute_statistics(x) for x in activations)
+    def compute_statistics(self, activations: np.ndarray) -> FIDStatistics:
+        mu = np.mean(activations, axis=0)
+        sigma = np.cov(activations, rowvar=False)
+        return FIDStatistics(mu, sigma)
+    def compute_inception_score(self, activations: np.ndarray, split_size: int = 5000) -> float:
+        softmax_out = []
+        for i in range(0, len(activations), self.softmax_batch_size):
+            acts = activations[i : i + self.softmax_batch_size]
+            softmax_out.append(self.sess.run(self.softmax, feed_dict={self.softmax_input: acts}))
+        preds = np.concatenate(softmax_out, axis=0)
+        # https://github.com/openai/improved-gan/blob/4f5d1ec5c16a7eceb206f42bfc652693601e1d5c/inception_score/model.py#L46
+        scores = []
+        for i in range(0, len(preds), split_size):
+            part = preds[i : i + split_size]
+            kl = part * (np.log(part) - np.log(np.expand_dims(np.mean(part, 0), 0)))
+            kl = np.mean(np.sum(kl, 1))
+            scores.append(np.exp(kl))
+        return float(np.mean(scores))
+    def compute_prec_recall(
+        self, activations_ref: np.ndarray, activations_sample: np.ndarray
+    ) -> Tuple[float, float]:
+        radii_1 = self.manifold_estimator.manifold_radii(activations_ref)
+        radii_2 = self.manifold_estimator.manifold_radii(activations_sample)
+        pr = self.manifold_estimator.evaluate_pr(
+            activations_ref, radii_1, activations_sample, radii_2
+        )
+        return (float(pr[0][0]), float(pr[1][0]))
+class ManifoldEstimator:
+    """
+    A helper for comparing manifolds of feature vectors.
+    Adapted from https://github.com/kynkaat/improved-precision-and-recall-metric/blob/f60f25e5ad933a79135c783fcda53de30f42c9b9/precision_recall.py#L57
+    """
+    def __init__(
+        self,
+        session,
+        row_batch_size=10000,
+        col_batch_size=10000,
+        nhood_sizes=(3,),
+        clamp_to_percentile=None,
+        eps=1e-5,
+    ):
+        """
+        Estimate the manifold of given feature vectors.
+        :param session: the TensorFlow session.
+        :param row_batch_size: row batch size to compute pairwise distances
+                               (parameter to trade-off between memory usage and performance).
+        :param col_batch_size: column batch size to compute pairwise distances.
+        :param nhood_sizes: number of neighbors used to estimate the manifold.
+        :param clamp_to_percentile: prune hyperspheres that have radius larger than
+                                    the given percentile.
+        :param eps: small number for numerical stability.
+        """
+        self.distance_block = DistanceBlock(session)
+        self.row_batch_size = row_batch_size
+        self.col_batch_size = col_batch_size
+        self.nhood_sizes = nhood_sizes
+        self.num_nhoods = len(nhood_sizes)
+        self.clamp_to_percentile = clamp_to_percentile
+        self.eps = eps
+    def warmup(self):
+        feats, radii = (
+            np.zeros([1, 2048], dtype=np.float32),
+            np.zeros([1, 1], dtype=np.float32),
+        )
+        self.evaluate_pr(feats, radii, feats, radii)
+    def manifold_radii(self, features: np.ndarray) -> np.ndarray:
+        num_images = len(features)
+        # Estimate manifold of features by calculating distances to k-NN of each sample.
+        radii = np.zeros([num_images, self.num_nhoods], dtype=np.float32)
+        distance_batch = np.zeros([self.row_batch_size, num_images], dtype=np.float32)
+        seq = np.arange(max(self.nhood_sizes) + 1, dtype=np.int32)
+        for begin1 in range(0, num_images, self.row_batch_size):
+            end1 = min(begin1 + self.row_batch_size, num_images)
+            row_batch = features[begin1:end1]
+            for begin2 in range(0, num_images, self.col_batch_size):
+                end2 = min(begin2 + self.col_batch_size, num_images)
+                col_batch = features[begin2:end2]
+                # Compute distances between batches.
+                distance_batch[
+                    0 : end1 - begin1, begin2:end2
+                ] = self.distance_block.pairwise_distances(row_batch, col_batch)
+            # Find the k-nearest neighbor from the current batch.
+            radii[begin1:end1, :] = np.concatenate(
+                [
+                    x[:, self.nhood_sizes]
+                    for x in _numpy_partition(distance_batch[0 : end1 - begin1, :], seq, axis=1)
+                ],
+                axis=0,
+            )
+        if self.clamp_to_percentile is not None:
+            max_distances = np.percentile(radii, self.clamp_to_percentile, axis=0)
+            radii[radii > max_distances] = 0
+        return radii
+    def evaluate(self, features: np.ndarray, radii: np.ndarray, eval_features: np.ndarray):
+        """
+        Evaluate if new feature vectors are at the manifold.
+        """
+        num_eval_images = eval_features.shape[0]
+        num_ref_images = radii.shape[0]
+        distance_batch = np.zeros([self.row_batch_size, num_ref_images], dtype=np.float32)
+        batch_predictions = np.zeros([num_eval_images, self.num_nhoods], dtype=np.int32)
+        max_realism_score = np.zeros([num_eval_images], dtype=np.float32)
+        nearest_indices = np.zeros([num_eval_images], dtype=np.int32)
+        for begin1 in range(0, num_eval_images, self.row_batch_size):
+            end1 = min(begin1 + self.row_batch_size, num_eval_images)
+            feature_batch = eval_features[begin1:end1]
+            for begin2 in range(0, num_ref_images, self.col_batch_size):
+                end2 = min(begin2 + self.col_batch_size, num_ref_images)
+                ref_batch = features[begin2:end2]
+                distance_batch[
+                    0 : end1 - begin1, begin2:end2
+                ] = self.distance_block.pairwise_distances(feature_batch, ref_batch)
+            # From the minibatch of new feature vectors, determine if they are in the estimated manifold.
+            # If a feature vector is inside a hypersphere of some reference sample, then
+            # the new sample lies at the estimated manifold.
+            # The radii of the hyperspheres are determined from distances of neighborhood size k.
+            samples_in_manifold = distance_batch[0 : end1 - begin1, :, None] <= radii
+            batch_predictions[begin1:end1] = np.any(samples_in_manifold, axis=1).astype(np.int32)
+            max_realism_score[begin1:end1] = np.max(
+                radii[:, 0] / (distance_batch[0 : end1 - begin1, :] + self.eps), axis=1
+            )
+            nearest_indices[begin1:end1] = np.argmin(distance_batch[0 : end1 - begin1, :], axis=1)
+        return {
+            "fraction": float(np.mean(batch_predictions)),
+            "batch_predictions": batch_predictions,
+            "max_realisim_score": max_realism_score,
+            "nearest_indices": nearest_indices,
+        }
+    def evaluate_pr(
+        self,
+        features_1: np.ndarray,
+        radii_1: np.ndarray,
+        features_2: np.ndarray,
+        radii_2: np.ndarray,
+    ) -> Tuple[np.ndarray, np.ndarray]:
+        """
+        Evaluate precision and recall efficiently.
+        :param features_1: [N1 x D] feature vectors for reference batch.
+        :param radii_1: [N1 x K1] radii for reference vectors.
+        :param features_2: [N2 x D] feature vectors for the other batch.
+        :param radii_2: [N x K2] radii for other vectors.
+        :return: a tuple of arrays for (precision, recall):
+                 - precision: an np.ndarray of length K1
+                 - recall: an np.ndarray of length K2
+        """
+        features_1_status = np.zeros([len(features_1), radii_2.shape[1]], dtype=np.bool_)
+        features_2_status = np.zeros([len(features_2), radii_1.shape[1]], dtype=np.bool_)
+        for begin_1 in range(0, len(features_1), self.row_batch_size):
+            end_1 = begin_1 + self.row_batch_size
+            batch_1 = features_1[begin_1:end_1]
+            for begin_2 in range(0, len(features_2), self.col_batch_size):
+                end_2 = begin_2 + self.col_batch_size
+                batch_2 = features_2[begin_2:end_2]
+                batch_1_in, batch_2_in = self.distance_block.less_thans(
+                    batch_1, radii_1[begin_1:end_1], batch_2, radii_2[begin_2:end_2]
+                )
+                features_1_status[begin_1:end_1] |= batch_1_in
+                features_2_status[begin_2:end_2] |= batch_2_in
+        return (
+            np.mean(features_2_status.astype(np.float64), axis=0),
+            np.mean(features_1_status.astype(np.float64), axis=0),
+        )
+class DistanceBlock:
+    """
+    Calculate pairwise distances between vectors.
+    Adapted from https://github.com/kynkaat/improved-precision-and-recall-metric/blob/f60f25e5ad933a79135c783fcda53de30f42c9b9/precision_recall.py#L34
+    """
+    def __init__(self, session):
+        self.session = session
+        # Initialize TF graph to calculate pairwise distances.
+        with session.graph.as_default():
+            self._features_batch1 = tf.placeholder(tf.float32, shape=[None, None])
+            self._features_batch2 = tf.placeholder(tf.float32, shape=[None, None])
+            distance_block_16 = _batch_pairwise_distances(
+                tf.cast(self._features_batch1, tf.float16),
+                tf.cast(self._features_batch2, tf.float16),
+            )
+            self.distance_block = tf.cond(
+                tf.reduce_all(tf.math.is_finite(distance_block_16)),
+                lambda: tf.cast(distance_block_16, tf.float32),
+                lambda: _batch_pairwise_distances(self._features_batch1, self._features_batch2),
+            )
+            # Extra logic for less thans.
+            self._radii1 = tf.placeholder(tf.float32, shape=[None, None])
+            self._radii2 = tf.placeholder(tf.float32, shape=[None, None])
+            dist32 = tf.cast(self.distance_block, tf.float32)[..., None]
+            self._batch_1_in = tf.math.reduce_any(dist32 <= self._radii2, axis=1)
+            self._batch_2_in = tf.math.reduce_any(dist32 <= self._radii1[:, None], axis=0)
+    def pairwise_distances(self, U, V):
+        """
+        Evaluate pairwise distances between two batches of feature vectors.
+        """
+        return self.session.run(
+            self.distance_block,
+            feed_dict={self._features_batch1: U, self._features_batch2: V},
+        )
+    def less_thans(self, batch_1, radii_1, batch_2, radii_2):
+        return self.session.run(
+            [self._batch_1_in, self._batch_2_in],
+            feed_dict={
+                self._features_batch1: batch_1,
+                self._features_batch2: batch_2,
+                self._radii1: radii_1,
+                self._radii2: radii_2,
+            },
+        )
+def _batch_pairwise_distances(U, V):
+    """
+    Compute pairwise distances between two batches of feature vectors.
+    """
+    with tf.variable_scope("pairwise_dist_block"):
+        # Squared norms of each row in U and V.
+        norm_u = tf.reduce_sum(tf.square(U), 1)
+        norm_v = tf.reduce_sum(tf.square(V), 1)
+        # norm_u as a column and norm_v as a row vectors.
+        norm_u = tf.reshape(norm_u, [-1, 1])
+        norm_v = tf.reshape(norm_v, [1, -1])
+        # Pairwise squared Euclidean distances.
+        D = tf.maximum(norm_u - 2 * tf.matmul(U, V, False, True) + norm_v, 0.0)
+    return D
+class NpzArrayReader(ABC):
+    @abstractmethod
+    def read_batch(self, batch_size: int) -> Optional[np.ndarray]:
+        pass
+    @abstractmethod
+    def remaining(self) -> int:
+        pass
+    def read_batches(self, batch_size: int) -> Iterable[np.ndarray]:
+        def gen_fn():
+            while True:
+                batch = self.read_batch(batch_size)
+                if batch is None:
+                    break
+                yield batch
+        rem = self.remaining()
+        num_batches = rem // batch_size + int(rem % batch_size != 0)
+        return BatchIterator(gen_fn, num_batches)
+class BatchIterator:
+    def __init__(self, gen_fn, length):
+        self.gen_fn = gen_fn
+        self.length = length
+    def __len__(self):
+        return self.length
+    def __iter__(self):
+        return self.gen_fn()
+class StreamingNpzArrayReader(NpzArrayReader):
+    def __init__(self, arr_f, shape, dtype):
+        self.arr_f = arr_f
+        self.shape = shape
+        self.dtype = dtype
+        self.idx = 0
+    def read_batch(self, batch_size: int) -> Optional[np.ndarray]:
+        if self.idx >= self.shape[0]:
+            return None
+        bs = min(batch_size, self.shape[0] - self.idx)
+        self.idx += bs
+        if self.dtype.itemsize == 0:
+            return np.ndarray([bs, *self.shape[1:]], dtype=self.dtype)
+        read_count = bs * np.prod(self.shape[1:])
+        read_size = int(read_count * self.dtype.itemsize)
+        data = _read_bytes(self.arr_f, read_size, "array data")
+        return np.frombuffer(data, dtype=self.dtype).reshape([bs, *self.shape[1:]])
+    def remaining(self) -> int:
+        return max(0, self.shape[0] - self.idx)
+class MemoryNpzArrayReader(NpzArrayReader):
+    def __init__(self, arr):
+        self.arr = arr
+        self.idx = 0
+    @classmethod
+    def load(cls, path: str, arr_name: str):
+        with open(path, "rb") as f:
+            arr = np.load(f)[arr_name]
+        return cls(arr)
+    def read_batch(self, batch_size: int) -> Optional[np.ndarray]:
+        if self.idx >= self.arr.shape[0]:
+            return None
+        res = self.arr[self.idx : self.idx + batch_size]
+        self.idx += batch_size
+        return res
+    def remaining(self) -> int:
+        return max(0, self.arr.shape[0] - self.idx)
+@contextmanager
+def open_npz_array(path: str, arr_name: str) -> NpzArrayReader:
+    with _open_npy_file(path, arr_name) as arr_f:
+        version = np.lib.format.read_magic(arr_f)
+        if version == (1, 0):
+            header = np.lib.format.read_array_header_1_0(arr_f)
+        elif version == (2, 0):
+            header = np.lib.format.read_array_header_2_0(arr_f)
+        else:
+            yield MemoryNpzArrayReader.load(path, arr_name)
+            return
+        shape, fortran, dtype = header
+        if fortran or dtype.hasobject:
+            yield MemoryNpzArrayReader.load(path, arr_name)
+        else:
+            yield StreamingNpzArrayReader(arr_f, shape, dtype)
+def _read_bytes(fp, size, error_template="ran out of data"):
+    """
+    Copied from: https://github.com/numpy/numpy/blob/fb215c76967739268de71aa4bda55dd1b062bc2e/numpy/lib/format.py#L788-L886
+    Read from file-like object until size bytes are read.
+    Raises ValueError if not EOF is encountered before size bytes are read.
+    Non-blocking objects only supported if they derive from io objects.
+    Required as e.g. ZipExtFile in python 2.6 can return less data than
+    requested.
+    """
+    data = bytes()
+    while True:
+        # io files (default in python3) return None or raise on
+        # would-block, python2 file will truncate, probably nothing can be
+        # done about that.  note that regular files can't be non-blocking
+        try:
+            r = fp.read(size - len(data))
+            data += r
+            if len(r) == 0 or len(data) == size:
+                break
+        except io.BlockingIOError:
+            pass
+    if len(data) != size:
+        msg = "EOF: reading %s, expected %d bytes got %d"
+        raise ValueError(msg % (error_template, size, len(data)))
+    else:
+        return data
+@contextmanager
+def _open_npy_file(path: str, arr_name: str):
+    with open(path, "rb") as f:
+        with zipfile.ZipFile(f, "r") as zip_f:
+            if f"{arr_name}.npy" not in zip_f.namelist():
+                raise ValueError(f"missing {arr_name} in npz file")
+            with zip_f.open(f"{arr_name}.npy", "r") as arr_f:
+                yield arr_f
+def _download_inception_model():
+    if os.path.exists(INCEPTION_V3_PATH):
+        return
+    print("downloading InceptionV3 model...")
+    with requests.get(INCEPTION_V3_URL, stream=True) as r:
+        r.raise_for_status()
+        tmp_path = INCEPTION_V3_PATH + ".tmp"
+        with open(tmp_path, "wb") as f:
+            for chunk in tqdm(r.iter_content(chunk_size=8192)):
+                f.write(chunk)
+        os.rename(tmp_path, INCEPTION_V3_PATH)
+def _create_feature_graph(input_batch):
+    _download_inception_model()
+    prefix = f"{random.randrange(2**32)}_{random.randrange(2**32)}"
+    with open(INCEPTION_V3_PATH, "rb") as f:
+        graph_def = tf.GraphDef()
+        graph_def.ParseFromString(f.read())
+    pool3, spatial = tf.import_graph_def(
+        graph_def,
+        input_map={f"ExpandDims:0": input_batch},
+        return_elements=[FID_POOL_NAME, FID_SPATIAL_NAME],
+        name=prefix,
+    )
+    _update_shapes(pool3)
+    spatial = spatial[..., :7]
+    return pool3, spatial
+def _create_softmax_graph(input_batch):
+    _download_inception_model()
+    prefix = f"{random.randrange(2**32)}_{random.randrange(2**32)}"
+    with open(INCEPTION_V3_PATH, "rb") as f:
+        graph_def = tf.GraphDef()
+        graph_def.ParseFromString(f.read())
+    (matmul,) = tf.import_graph_def(
+        graph_def, return_elements=[f"softmax/logits/MatMul"], name=prefix
+    )
+    w = matmul.inputs[1]
+    logits = tf.matmul(input_batch, w)
+    return tf.nn.softmax(logits)
+def _update_shapes(pool3):
+    # https://github.com/bioinf-jku/TTUR/blob/73ab375cdf952a12686d9aa7978567771084da42/fid.py#L50-L63
+    ops = pool3.graph.get_operations()
+    for op in ops:
+        for o in op.outputs:
+            shape = o.get_shape()
+            if shape._dims is not None:  # pylint: disable=protected-access
+                # shape = [s.value for s in shape] TF 1.x
+                shape = [s for s in shape]  # TF 2.x
+                new_shape = []
+                for j, s in enumerate(shape):
+                    if s == 1 and j == 0:
+                        new_shape.append(None)
+                    else:
+                        new_shape.append(s)
+                o.__dict__["_shape_val"] = tf.TensorShape(new_shape)
+    return pool3
+def _numpy_partition(arr, kth, **kwargs):
+    num_workers = min(cpu_count(), len(arr))
+    chunk_size = len(arr) // num_workers
+    extra = len(arr) % num_workers
+    start_idx = 0
+    batches = []
+    for i in range(num_workers):
+        size = chunk_size + (1 if i < extra else 0)
+        batches.append(arr[start_idx : start_idx + size])
+        start_idx += size
+    with ThreadPool(num_workers) as pool:
+        return list(pool.map(partial(np.partition, kth=kth, **kwargs), batches))
+if __name__ == "__main__":
+    main()

REG/evaluations/requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+tensorflow-gpu>=2.0
+scipy
+requests
+tqdm

REG/models/clip_vit.py ADDED Viewed

	@@ -0,0 +1,426 @@

+from collections import OrderedDict
+from typing import Tuple, Union
+import numpy as np
+import torch
+import torch.nn.functional as F
+from torch import nn
+import clip
+class Bottleneck(nn.Module):
+    expansion = 4
+    def __init__(self, inplanes, planes, stride=1):
+        super().__init__()
+        # all conv layers have stride 1. an avgpool is performed after the second convolution when stride > 1
+        self.conv1 = nn.Conv2d(inplanes, planes, 1, bias=False)
+        self.bn1 = nn.BatchNorm2d(planes)
+        self.relu1 = nn.ReLU(inplace=True)
+        self.conv2 = nn.Conv2d(planes, planes, 3, padding=1, bias=False)
+        self.bn2 = nn.BatchNorm2d(planes)
+        self.relu2 = nn.ReLU(inplace=True)
+        self.avgpool = nn.AvgPool2d(stride) if stride > 1 else nn.Identity()
+        self.conv3 = nn.Conv2d(planes, planes * self.expansion, 1, bias=False)
+        self.bn3 = nn.BatchNorm2d(planes * self.expansion)
+        self.relu3 = nn.ReLU(inplace=True)
+        self.downsample = None
+        self.stride = stride
+        if stride > 1 or inplanes != planes * Bottleneck.expansion:
+            # downsampling layer is prepended with an avgpool, and the subsequent convolution has stride 1
+            self.downsample = nn.Sequential(OrderedDict([
+                ("-1", nn.AvgPool2d(stride)),
+                ("0", nn.Conv2d(inplanes, planes * self.expansion, 1, stride=1, bias=False)),
+                ("1", nn.BatchNorm2d(planes * self.expansion))
+            ]))
+    def forward(self, x: torch.Tensor):
+        identity = x
+        out = self.relu1(self.bn1(self.conv1(x)))
+        out = self.relu2(self.bn2(self.conv2(out)))
+        out = self.avgpool(out)
+        out = self.bn3(self.conv3(out))
+        if self.downsample is not None:
+            identity = self.downsample(x)
+        out += identity
+        out = self.relu3(out)
+        return out
+class AttentionPool2d(nn.Module):
+    def __init__(self, spacial_dim: int, embed_dim: int, num_heads: int, output_dim: int = None):
+        super().__init__()
+        self.positional_embedding = nn.Parameter(torch.randn(spacial_dim ** 2 + 1, embed_dim) / embed_dim ** 0.5)
+        self.k_proj = nn.Linear(embed_dim, embed_dim)
+        self.q_proj = nn.Linear(embed_dim, embed_dim)
+        self.v_proj = nn.Linear(embed_dim, embed_dim)
+        self.c_proj = nn.Linear(embed_dim, output_dim or embed_dim)
+        self.num_heads = num_heads
+    def forward(self, x):
+        x = x.flatten(start_dim=2).permute(2, 0, 1)  # NCHW -> (HW)NC
+        x = torch.cat([x.mean(dim=0, keepdim=True), x], dim=0)  # (HW+1)NC
+        x = x + self.positional_embedding[:, None, :].to(x.dtype)  # (HW+1)NC
+        x, _ = F.multi_head_attention_forward(
+            query=x[:1], key=x, value=x,
+            embed_dim_to_check=x.shape[-1],
+            num_heads=self.num_heads,
+            q_proj_weight=self.q_proj.weight,
+            k_proj_weight=self.k_proj.weight,
+            v_proj_weight=self.v_proj.weight,
+            in_proj_weight=None,
+            in_proj_bias=torch.cat([self.q_proj.bias, self.k_proj.bias, self.v_proj.bias]),
+            bias_k=None,
+            bias_v=None,
+            add_zero_attn=False,
+            dropout_p=0,
+            out_proj_weight=self.c_proj.weight,
+            out_proj_bias=self.c_proj.bias,
+            use_separate_proj_weight=True,
+            training=self.training,
+            need_weights=False
+        )
+        return x.squeeze(0)
+class ModifiedResNet(nn.Module):
+    """
+    A ResNet class that is similar to torchvision's but contains the following changes:
+    - There are now 3 "stem" convolutions as opposed to 1, with an average pool instead of a max pool.
+    - Performs anti-aliasing strided convolutions, where an avgpool is prepended to convolutions with stride > 1
+    - The final pooling layer is a QKV attention instead of an average pool
+    """
+    def __init__(self, layers, output_dim, heads, input_resolution=224, width=64):
+        super().__init__()
+        self.output_dim = output_dim
+        self.input_resolution = input_resolution
+        # the 3-layer stem
+        self.conv1 = nn.Conv2d(3, width // 2, kernel_size=3, stride=2, padding=1, bias=False)
+        self.bn1 = nn.BatchNorm2d(width // 2)
+        self.relu1 = nn.ReLU(inplace=True)
+        self.conv2 = nn.Conv2d(width // 2, width // 2, kernel_size=3, padding=1, bias=False)
+        self.bn2 = nn.BatchNorm2d(width // 2)
+        self.relu2 = nn.ReLU(inplace=True)
+        self.conv3 = nn.Conv2d(width // 2, width, kernel_size=3, padding=1, bias=False)
+        self.bn3 = nn.BatchNorm2d(width)
+        self.relu3 = nn.ReLU(inplace=True)
+        self.avgpool = nn.AvgPool2d(2)
+        # residual layers
+        self._inplanes = width  # this is a *mutable* variable used during construction
+        self.layer1 = self._make_layer(width, layers[0])
+        self.layer2 = self._make_layer(width * 2, layers[1], stride=2)
+        self.layer3 = self._make_layer(width * 4, layers[2], stride=2)
+        self.layer4 = self._make_layer(width * 8, layers[3], stride=2)
+        embed_dim = width * 32  # the ResNet feature dimension
+        self.attnpool = AttentionPool2d(input_resolution // 32, embed_dim, heads, output_dim)
+    def _make_layer(self, planes, blocks, stride=1):
+        layers = [Bottleneck(self._inplanes, planes, stride)]
+        self._inplanes = planes * Bottleneck.expansion
+        for _ in range(1, blocks):
+            layers.append(Bottleneck(self._inplanes, planes))
+        return nn.Sequential(*layers)
+    def forward(self, x):
+        def stem(x):
+            x = self.relu1(self.bn1(self.conv1(x)))
+            x = self.relu2(self.bn2(self.conv2(x)))
+            x = self.relu3(self.bn3(self.conv3(x)))
+            x = self.avgpool(x)
+            return x
+        x = x.type(self.conv1.weight.dtype)
+        x = stem(x)
+        x = self.layer1(x)
+        x = self.layer2(x)
+        x = self.layer3(x)
+        x = self.layer4(x)
+        x = self.attnpool(x)
+        return x
+class LayerNorm(nn.LayerNorm):
+    """Subclass torch's LayerNorm to handle fp16."""
+    def forward(self, x: torch.Tensor):
+        orig_type = x.dtype
+        ret = super().forward(x.type(torch.float32))
+        return ret.type(orig_type)
+class QuickGELU(nn.Module):
+    def forward(self, x: torch.Tensor):
+        return x * torch.sigmoid(1.702 * x)
+class ResidualAttentionBlock(nn.Module):
+    def __init__(self, d_model: int, n_head: int, attn_mask: torch.Tensor = None):
+        super().__init__()
+        self.attn = nn.MultiheadAttention(d_model, n_head)
+        self.ln_1 = LayerNorm(d_model)
+        self.mlp = nn.Sequential(OrderedDict([
+            ("c_fc", nn.Linear(d_model, d_model * 4)),
+            ("gelu", QuickGELU()),
+            ("c_proj", nn.Linear(d_model * 4, d_model))
+        ]))
+        self.ln_2 = LayerNorm(d_model)
+        self.attn_mask = attn_mask
+    def attention(self, x: torch.Tensor):
+        self.attn_mask = self.attn_mask.to(dtype=x.dtype, device=x.device) if self.attn_mask is not None else None
+        return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]
+    def forward(self, x: torch.Tensor):
+        x = x + self.attention(self.ln_1(x))
+        x = x + self.mlp(self.ln_2(x))
+        return x
+class Transformer(nn.Module):
+    def __init__(self, width: int, layers: int, heads: int, attn_mask: torch.Tensor = None):
+        super().__init__()
+        self.width = width
+        self.layers = layers
+        self.resblocks = nn.Sequential(*[ResidualAttentionBlock(width, heads, attn_mask) for _ in range(layers)])
+    def forward(self, x: torch.Tensor):
+        return self.resblocks(x)
+class UpdatedVisionTransformer(nn.Module):
+    def __init__(self, model):
+        super().__init__()
+        self.model = model
+    def forward(self, x: torch.Tensor):
+        x = self.model.conv1(x)  # shape = [*, width, grid, grid]
+        x = x.reshape(x.shape[0], x.shape[1], -1)  # shape = [*, width, grid ** 2]
+        x = x.permute(0, 2, 1)  # shape = [*, grid ** 2, width]
+        x = torch.cat([self.model.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1)  # shape = [*, grid ** 2 + 1, width]
+        x = x + self.model.positional_embedding.to(x.dtype)
+        x = self.model.ln_pre(x)
+        x = x.permute(1, 0, 2)  # NLD -> LND
+        x = self.model.transformer(x)
+        x = x.permute(1, 0, 2)[:, 1:]  # LND -> NLD
+        # x = self.ln_post(x[:, 0, :])
+        # if self.proj is not None:
+        #     x = x @ self.proj
+        return x
+class CLIP(nn.Module):
+    def __init__(self,
+                 embed_dim: int,
+                 # vision
+                 image_resolution: int,
+                 vision_layers: Union[Tuple[int, int, int, int], int],
+                 vision_width: int,
+                 vision_patch_size: int,
+                 # text
+                 context_length: int,
+                 vocab_size: int,
+                 transformer_width: int,
+                 transformer_heads: int,
+                 transformer_layers: int
+                 ):
+        super().__init__()
+        self.context_length = context_length
+        if isinstance(vision_layers, (tuple, list)):
+            vision_heads = vision_width * 32 // 64
+            self.visual = ModifiedResNet(
+                layers=vision_layers,
+                output_dim=embed_dim,
+                heads=vision_heads,
+                input_resolution=image_resolution,
+                width=vision_width
+            )
+        else:
+            vision_heads = vision_width // 64
+            self.visual = UpdatedVisionTransformer(
+                input_resolution=image_resolution,
+                patch_size=vision_patch_size,
+                width=vision_width,
+                layers=vision_layers,
+                heads=vision_heads,
+                output_dim=embed_dim
+            )
+        self.transformer = Transformer(
+            width=transformer_width,
+            layers=transformer_layers,
+            heads=transformer_heads,
+            attn_mask=self.build_attention_mask()
+        )
+        self.vocab_size = vocab_size
+        self.token_embedding = nn.Embedding(vocab_size, transformer_width)
+        self.positional_embedding = nn.Parameter(torch.empty(self.context_length, transformer_width))
+        self.ln_final = LayerNorm(transformer_width)
+        self.text_projection = nn.Parameter(torch.empty(transformer_width, embed_dim))
+        self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
+        self.initialize_parameters()
+    def initialize_parameters(self):
+        nn.init.normal_(self.token_embedding.weight, std=0.02)
+        nn.init.normal_(self.positional_embedding, std=0.01)
+        if isinstance(self.visual, ModifiedResNet):
+            if self.visual.attnpool is not None:
+                std = self.visual.attnpool.c_proj.in_features ** -0.5
+                nn.init.normal_(self.visual.attnpool.q_proj.weight, std=std)
+                nn.init.normal_(self.visual.attnpool.k_proj.weight, std=std)
+                nn.init.normal_(self.visual.attnpool.v_proj.weight, std=std)
+                nn.init.normal_(self.visual.attnpool.c_proj.weight, std=std)
+            for resnet_block in [self.visual.layer1, self.visual.layer2, self.visual.layer3, self.visual.layer4]:
+                for name, param in resnet_block.named_parameters():
+                    if name.endswith("bn3.weight"):
+                        nn.init.zeros_(param)
+        proj_std = (self.transformer.width ** -0.5) * ((2 * self.transformer.layers) ** -0.5)
+        attn_std = self.transformer.width ** -0.5
+        fc_std = (2 * self.transformer.width) ** -0.5
+        for block in self.transformer.resblocks:
+            nn.init.normal_(block.attn.in_proj_weight, std=attn_std)
+            nn.init.normal_(block.attn.out_proj.weight, std=proj_std)
+            nn.init.normal_(block.mlp.c_fc.weight, std=fc_std)
+            nn.init.normal_(block.mlp.c_proj.weight, std=proj_std)
+        if self.text_projection is not None:
+            nn.init.normal_(self.text_projection, std=self.transformer.width ** -0.5)
+    def build_attention_mask(self):
+        # lazily create causal attention mask, with full attention between the vision tokens
+        # pytorch uses additive attention mask; fill with -inf
+        mask = torch.empty(self.context_length, self.context_length)
+        mask.fill_(float("-inf"))
+        mask.triu_(1)  # zero out the lower diagonal
+        return mask
+    @property
+    def dtype(self):
+        return self.visual.conv1.weight.dtype
+    def encode_image(self, image):
+        return self.visual(image.type(self.dtype))
+    def encode_text(self, text):
+        x = self.token_embedding(text).type(self.dtype)  # [batch_size, n_ctx, d_model]
+        x = x + self.positional_embedding.type(self.dtype)
+        x = x.permute(1, 0, 2)  # NLD -> LND
+        x = self.transformer(x)
+        x = x.permute(1, 0, 2)  # LND -> NLD
+        x = self.ln_final(x).type(self.dtype)
+        # x.shape = [batch_size, n_ctx, transformer.width]
+        # take features from the eot embedding (eot_token is the highest number in each sequence)
+        x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection
+        return x
+    def forward(self, image, text):
+        image_features = self.encode_image(image)
+        text_features = self.encode_text(text)
+        # normalized features
+        image_features = image_features / image_features.norm(dim=1, keepdim=True)
+        text_features = text_features / text_features.norm(dim=1, keepdim=True)
+        # cosine similarity as logits
+        logit_scale = self.logit_scale.exp()
+        logits_per_image = logit_scale * image_features @ text_features.t()
+        logits_per_text = logits_per_image.t()
+        # shape = [global_batch_size, global_batch_size]
+        return logits_per_image, logits_per_text
+def convert_weights(model: nn.Module):
+    """Convert applicable model parameters to fp16"""
+    def _convert_weights_to_fp16(l):
+        if isinstance(l, (nn.Conv1d, nn.Conv2d, nn.Linear)):
+            l.weight.data = l.weight.data.half()
+            if l.bias is not None:
+                l.bias.data = l.bias.data.half()
+        if isinstance(l, nn.MultiheadAttention):
+            for attr in [*[f"{s}_proj_weight" for s in ["in", "q", "k", "v"]], "in_proj_bias", "bias_k", "bias_v"]:
+                tensor = getattr(l, attr)
+                if tensor is not None:
+                    tensor.data = tensor.data.half()
+        for name in ["text_projection", "proj"]:
+            if hasattr(l, name):
+                attr = getattr(l, name)
+                if attr is not None:
+                    attr.data = attr.data.half()
+    model.apply(_convert_weights_to_fp16)
+def build_model(state_dict: dict):
+    vit = "visual.proj" in state_dict
+    if vit:
+        vision_width = state_dict["visual.conv1.weight"].shape[0]
+        vision_layers = len([k for k in state_dict.keys() if k.startswith("visual.") and k.endswith(".attn.in_proj_weight")])
+        vision_patch_size = state_dict["visual.conv1.weight"].shape[-1]
+        grid_size = round((state_dict["visual.positional_embedding"].shape[0] - 1) ** 0.5)
+        image_resolution = vision_patch_size * grid_size
+    else:
+        counts: list = [len(set(k.split(".")[2] for k in state_dict if k.startswith(f"visual.layer{b}"))) for b in [1, 2, 3, 4]]
+        vision_layers = tuple(counts)
+        vision_width = state_dict["visual.layer1.0.conv1.weight"].shape[0]
+        output_width = round((state_dict["visual.attnpool.positional_embedding"].shape[0] - 1) ** 0.5)
+        vision_patch_size = None
+        assert output_width ** 2 + 1 == state_dict["visual.attnpool.positional_embedding"].shape[0]
+        image_resolution = output_width * 32
+    embed_dim = state_dict["text_projection"].shape[1]
+    context_length = state_dict["positional_embedding"].shape[0]
+    vocab_size = state_dict["token_embedding.weight"].shape[0]
+    transformer_width = state_dict["ln_final.weight"].shape[0]
+    transformer_heads = transformer_width // 64
+    transformer_layers = len(set(k.split(".")[2] for k in state_dict if k.startswith("transformer.resblocks")))
+    model = CLIP(
+        embed_dim,
+        image_resolution, vision_layers, vision_width, vision_patch_size,
+        context_length, vocab_size, transformer_width, transformer_heads, transformer_layers
+    )
+    for key in ["input_resolution", "context_length", "vocab_size"]:
+        if key in state_dict:
+            del state_dict[key]
+    convert_weights(model)
+    model.load_state_dict(state_dict)
+    return model.eval()

REG/models/jepa.py ADDED Viewed

	@@ -0,0 +1,547 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+#
+import math
+from functools import partial
+import numpy as np
+import torch
+import torch.nn as nn
+def _no_grad_trunc_normal_(tensor, mean, std, a, b):
+    # Cut & paste from PyTorch official master until it's in a few official releases - RW
+    # Method based on https://people.sc.fsu.edu/~jburkardt/presentations/truncated_normal.pdf
+    def norm_cdf(x):
+        # Computes standard normal cumulative distribution function
+        return (1. + math.erf(x / math.sqrt(2.))) / 2.
+    with torch.no_grad():
+        # Values are generated by using a truncated uniform distribution and
+        # then using the inverse CDF for the normal distribution.
+        # Get upper and lower cdf values
+        l = norm_cdf((a - mean) / std)
+        u = norm_cdf((b - mean) / std)
+        # Uniformly fill tensor with values from [l, u], then translate to
+        # [2l-1, 2u-1].
+        tensor.uniform_(2 * l - 1, 2 * u - 1)
+        # Use inverse cdf transform for normal distribution to get truncated
+        # standard normal
+        tensor.erfinv_()
+        # Transform to proper mean, std
+        tensor.mul_(std * math.sqrt(2.))
+        tensor.add_(mean)
+        # Clamp to ensure it's in the proper range
+        tensor.clamp_(min=a, max=b)
+        return tensor
+def trunc_normal_(tensor, mean=0., std=1., a=-2., b=2.):
+    return _no_grad_trunc_normal_(tensor, mean, std, a, b)
+def repeat_interleave_batch(x, B, repeat):
+    N = len(x) // B
+    x = torch.cat([
+        torch.cat([x[i*B:(i+1)*B] for _ in range(repeat)], dim=0)
+        for i in range(N)
+    ], dim=0)
+    return x
+def apply_masks(x, masks):
+    """
+    :param x: tensor of shape [B (batch-size), N (num-patches), D (feature-dim)]
+    :param masks: list of tensors containing indices of patches in [N] to keep
+    """
+    all_x = []
+    for m in masks:
+        mask_keep = m.unsqueeze(-1).repeat(1, 1, x.size(-1))
+        all_x += [torch.gather(x, dim=1, index=mask_keep)]
+    return torch.cat(all_x, dim=0)
+def get_2d_sincos_pos_embed(embed_dim, grid_size, cls_token=False):
+    """
+    grid_size: int of the grid height and width
+    return:
+    pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
+    """
+    grid_h = np.arange(grid_size, dtype=float)
+    grid_w = np.arange(grid_size, dtype=float)
+    grid = np.meshgrid(grid_w, grid_h)  # here w goes first
+    grid = np.stack(grid, axis=0)
+    grid = grid.reshape([2, 1, grid_size, grid_size])
+    pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)
+    if cls_token:
+        pos_embed = np.concatenate([np.zeros([1, embed_dim]), pos_embed], axis=0)
+    return pos_embed
+def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
+    assert embed_dim % 2 == 0
+    # use half of dimensions to encode grid_h
+    emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0])  # (H*W, D/2)
+    emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[1])  # (H*W, D/2)
+    emb = np.concatenate([emb_h, emb_w], axis=1)  # (H*W, D)
+    return emb
+def get_1d_sincos_pos_embed(embed_dim, grid_size, cls_token=False):
+    """
+    grid_size: int of the grid length
+    return:
+    pos_embed: [grid_size, embed_dim] or [1+grid_size, embed_dim] (w/ or w/o cls_token)
+    """
+    grid = np.arange(grid_size, dtype=float)
+    pos_embed = get_1d_sincos_pos_embed_from_grid(embed_dim, grid)
+    if cls_token:
+        pos_embed = np.concatenate([np.zeros([1, embed_dim]), pos_embed], axis=0)
+    return pos_embed
+def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
+    """
+    embed_dim: output dimension for each position
+    pos: a list of positions to be encoded: size (M,)
+    out: (M, D)
+    """
+    assert embed_dim % 2 == 0
+    omega = np.arange(embed_dim // 2, dtype=float)
+    omega /= embed_dim / 2.
+    omega = 1. / 10000**omega   # (D/2,)
+    pos = pos.reshape(-1)   # (M,)
+    out = np.einsum('m,d->md', pos, omega)   # (M, D/2), outer product
+    emb_sin = np.sin(out)  # (M, D/2)
+    emb_cos = np.cos(out)  # (M, D/2)
+    emb = np.concatenate([emb_sin, emb_cos], axis=1)  # (M, D)
+    return emb
+def drop_path(x, drop_prob: float = 0., training: bool = False):
+    if drop_prob == 0. or not training:
+        return x
+    keep_prob = 1 - drop_prob
+    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
+    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
+    random_tensor.floor_()  # binarize
+    output = x.div(keep_prob) * random_tensor
+    return output
+class DropPath(nn.Module):
+    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
+    """
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+    def forward(self, x):
+        return drop_path(x, self.drop_prob, self.training)
+class MLP(nn.Module):
+    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+class Attention(nn.Module):
+    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = qk_scale or head_dim ** -0.5
+        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+    def forward(self, x):
+        B, N, C = x.shape
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[2]
+        attn = (q @ k.transpose(-2, -1)) * self.scale
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x, attn
+class Block(nn.Module):
+    def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,
+                 drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm):
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+        self.attn = Attention(
+            dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop)
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = MLP(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
+    def forward(self, x, return_attention=False):
+        y, attn = self.attn(self.norm1(x))
+        if return_attention:
+            return attn
+        x = x + self.drop_path(y)
+        x = x + self.drop_path(self.mlp(self.norm2(x)))
+        return x
+class PatchEmbed(nn.Module):
+    """ Image to Patch Embedding
+    """
+    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
+        super().__init__()
+        num_patches = (img_size // patch_size) * (img_size // patch_size)
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.num_patches = num_patches
+        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
+    def forward(self, x):
+        B, C, H, W = x.shape
+        x = self.proj(x).flatten(2).transpose(1, 2)
+        return x
+class ConvEmbed(nn.Module):
+    """
+    3x3 Convolution stems for ViT following ViTC models
+    """
+    def __init__(self, channels, strides, img_size=224, in_chans=3, batch_norm=True):
+        super().__init__()
+        # Build the stems
+        stem = []
+        channels = [in_chans] + channels
+        for i in range(len(channels) - 2):
+            stem += [nn.Conv2d(channels[i], channels[i+1], kernel_size=3,
+                               stride=strides[i], padding=1, bias=(not batch_norm))]
+            if batch_norm:
+                stem += [nn.BatchNorm2d(channels[i+1])]
+            stem += [nn.ReLU(inplace=True)]
+        stem += [nn.Conv2d(channels[-2], channels[-1], kernel_size=1, stride=strides[-1])]
+        self.stem = nn.Sequential(*stem)
+        # Comptute the number of patches
+        stride_prod = int(np.prod(strides))
+        self.num_patches = (img_size[0] // stride_prod)**2
+    def forward(self, x):
+        p = self.stem(x)
+        return p.flatten(2).transpose(1, 2)
+class VisionTransformerPredictor(nn.Module):
+    """ Vision Transformer """
+    def __init__(
+        self,
+        num_patches,
+        embed_dim=768,
+        predictor_embed_dim=384,
+        depth=6,
+        num_heads=12,
+        mlp_ratio=4.0,
+        qkv_bias=True,
+        qk_scale=None,
+        drop_rate=0.0,
+        attn_drop_rate=0.0,
+        drop_path_rate=0.0,
+        norm_layer=nn.LayerNorm,
+        init_std=0.02,
+        **kwargs
+    ):
+        super().__init__()
+        self.predictor_embed = nn.Linear(embed_dim, predictor_embed_dim, bias=True)
+        self.mask_token = nn.Parameter(torch.zeros(1, 1, predictor_embed_dim))
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule
+        # --
+        self.predictor_pos_embed = nn.Parameter(torch.zeros(1, num_patches, predictor_embed_dim),
+                                                requires_grad=False)
+        predictor_pos_embed = get_2d_sincos_pos_embed(self.predictor_pos_embed.shape[-1],
+                                                      int(num_patches**.5),
+                                                      cls_token=False)
+        self.predictor_pos_embed.data.copy_(torch.from_numpy(predictor_pos_embed).float().unsqueeze(0))
+        # --
+        self.predictor_blocks = nn.ModuleList([
+            Block(
+                dim=predictor_embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale,
+                drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer)
+            for i in range(depth)])
+        self.predictor_norm = norm_layer(predictor_embed_dim)
+        self.predictor_proj = nn.Linear(predictor_embed_dim, embed_dim, bias=True)
+        # ------
+        self.init_std = init_std
+        trunc_normal_(self.mask_token, std=self.init_std)
+        self.apply(self._init_weights)
+        self.fix_init_weight()
+    def fix_init_weight(self):
+        def rescale(param, layer_id):
+            param.div_(math.sqrt(2.0 * layer_id))
+        for layer_id, layer in enumerate(self.predictor_blocks):
+            rescale(layer.attn.proj.weight.data, layer_id + 1)
+            rescale(layer.mlp.fc2.weight.data, layer_id + 1)
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=self.init_std)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+        elif isinstance(m, nn.Conv2d):
+            trunc_normal_(m.weight, std=self.init_std)
+            if m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+    def forward(self, x, masks_x, masks):
+        assert (masks is not None) and (masks_x is not None), 'Cannot run predictor without mask indices'
+        if not isinstance(masks_x, list):
+            masks_x = [masks_x]
+        if not isinstance(masks, list):
+            masks = [masks]
+        # -- Batch Size
+        B = len(x) // len(masks_x)
+        # -- map from encoder-dim to pedictor-dim
+        x = self.predictor_embed(x)
+        # -- add positional embedding to x tokens
+        x_pos_embed = self.predictor_pos_embed.repeat(B, 1, 1)
+        x += apply_masks(x_pos_embed, masks_x)
+        _, N_ctxt, D = x.shape
+        # -- concat mask tokens to x
+        pos_embs = self.predictor_pos_embed.repeat(B, 1, 1)
+        pos_embs = apply_masks(pos_embs, masks)
+        pos_embs = repeat_interleave_batch(pos_embs, B, repeat=len(masks_x))
+        # --
+        pred_tokens = self.mask_token.repeat(pos_embs.size(0), pos_embs.size(1), 1)
+        # --
+        pred_tokens += pos_embs
+        x = x.repeat(len(masks), 1, 1)
+        x = torch.cat([x, pred_tokens], dim=1)
+        # -- fwd prop
+        for blk in self.predictor_blocks:
+            x = blk(x)
+        x = self.predictor_norm(x)
+        # -- return preds for mask tokens
+        x = x[:, N_ctxt:]
+        x = self.predictor_proj(x)
+        return x
+class VisionTransformer(nn.Module):
+    """ Vision Transformer """
+    def __init__(
+        self,
+        img_size=[224],
+        patch_size=16,
+        in_chans=3,
+        embed_dim=768,
+        predictor_embed_dim=384,
+        depth=12,
+        predictor_depth=12,
+        num_heads=12,
+        mlp_ratio=4.0,
+        qkv_bias=True,
+        qk_scale=None,
+        drop_rate=0.0,
+        attn_drop_rate=0.0,
+        drop_path_rate=0.0,
+        norm_layer=nn.LayerNorm,
+        init_std=0.02,
+        **kwargs
+    ):
+        super().__init__()
+        self.num_features = self.embed_dim = embed_dim
+        self.num_heads = num_heads
+        # --
+        self.patch_embed = PatchEmbed(
+            img_size=img_size[0],
+            patch_size=patch_size,
+            in_chans=in_chans,
+            embed_dim=embed_dim)
+        num_patches = self.patch_embed.num_patches
+        # --
+        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches, embed_dim), requires_grad=False)
+        pos_embed = get_2d_sincos_pos_embed(self.pos_embed.shape[-1],
+                                            int(self.patch_embed.num_patches**.5),
+                                            cls_token=False)
+        self.pos_embed.data.copy_(torch.from_numpy(pos_embed).float().unsqueeze(0))
+        # --
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule
+        self.blocks = nn.ModuleList([
+            Block(
+                dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale,
+                drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer)
+            for i in range(depth)])
+        self.norm = norm_layer(embed_dim)
+        # ------
+        self.init_std = init_std
+        self.apply(self._init_weights)
+        self.fix_init_weight()
+    def fix_init_weight(self):
+        def rescale(param, layer_id):
+            param.div_(math.sqrt(2.0 * layer_id))
+        for layer_id, layer in enumerate(self.blocks):
+            rescale(layer.attn.proj.weight.data, layer_id + 1)
+            rescale(layer.mlp.fc2.weight.data, layer_id + 1)
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=self.init_std)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+        elif isinstance(m, nn.Conv2d):
+            trunc_normal_(m.weight, std=self.init_std)
+            if m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+    def forward(self, x, masks=None):
+        if masks is not None:
+            if not isinstance(masks, list):
+                masks = [masks]
+        # -- patchify x
+        x = self.patch_embed(x)
+        B, N, D = x.shape
+        # -- add positional embedding to x
+        pos_embed = self.interpolate_pos_encoding(x, self.pos_embed)
+        x = x + pos_embed
+        # -- mask x
+        if masks is not None:
+            x = apply_masks(x, masks)
+        # -- fwd prop
+        for i, blk in enumerate(self.blocks):
+            x = blk(x)
+        if self.norm is not None:
+            x = self.norm(x)
+        return x
+    def interpolate_pos_encoding(self, x, pos_embed):
+        npatch = x.shape[1] - 1
+        N = pos_embed.shape[1] - 1
+        if npatch == N:
+            return pos_embed
+        class_emb = pos_embed[:, 0]
+        pos_embed = pos_embed[:, 1:]
+        dim = x.shape[-1]
+        pos_embed = nn.functional.interpolate(
+            pos_embed.reshape(1, int(math.sqrt(N)), int(math.sqrt(N)), dim).permute(0, 3, 1, 2),
+            scale_factor=math.sqrt(npatch / N),
+            mode='bicubic',
+        )
+        pos_embed = pos_embed.permute(0, 2, 3, 1).view(1, -1, dim)
+        return torch.cat((class_emb.unsqueeze(0), pos_embed), dim=1)
+def vit_predictor(**kwargs):
+    model = VisionTransformerPredictor(
+        mlp_ratio=4, qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6),
+        **kwargs)
+    return model
+def vit_tiny(patch_size=16, **kwargs):
+    model = VisionTransformer(
+        patch_size=patch_size, embed_dim=192, depth=12, num_heads=3, mlp_ratio=4,
+        qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    return model
+def vit_small(patch_size=16, **kwargs):
+    model = VisionTransformer(
+        patch_size=patch_size, embed_dim=384, depth=12, num_heads=6, mlp_ratio=4,
+        qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    return model
+def vit_base(patch_size=16, **kwargs):
+    model = VisionTransformer(
+        patch_size=patch_size, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4,
+        qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    return model
+def vit_large(patch_size=16, **kwargs):
+    model = VisionTransformer(
+        patch_size=patch_size, embed_dim=1024, depth=24, num_heads=16, mlp_ratio=4,
+        qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    return model
+def vit_huge(patch_size=16, **kwargs):
+    model = VisionTransformer(
+        patch_size=patch_size, embed_dim=1280, depth=32, num_heads=16, mlp_ratio=4,
+        qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    return model
+def vit_giant(patch_size=16, **kwargs):
+    model = VisionTransformer(
+        patch_size=patch_size, embed_dim=1408, depth=40, num_heads=16, mlp_ratio=48/11,
+        qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    return model
+VIT_EMBED_DIMS = {
+    'vit_tiny': 192,
+    'vit_small': 384,
+    'vit_base': 768,
+    'vit_large': 1024,
+    'vit_huge': 1280,
+    'vit_giant': 1408,
+}

REG/models/mae_vit.py ADDED Viewed

	@@ -0,0 +1,71 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+# --------------------------------------------------------
+# References:
+# timm: https://github.com/rwightman/pytorch-image-models/tree/master/timm
+# DeiT: https://github.com/facebookresearch/deit
+# --------------------------------------------------------
+from functools import partial
+import torch
+import torch.nn as nn
+import timm.models.vision_transformer
+class VisionTransformer(timm.models.vision_transformer.VisionTransformer):
+    """ Vision Transformer with support for global average pooling
+    """
+    def __init__(self, global_pool=False, **kwargs):
+        super(VisionTransformer, self).__init__(**kwargs)
+        self.global_pool = global_pool
+        if self.global_pool:
+            norm_layer = kwargs['norm_layer']
+            embed_dim = kwargs['embed_dim']
+            self.fc_norm = norm_layer(embed_dim)
+            del self.norm  # remove the original norm
+    def forward_features(self, x):
+        B = x.shape[0]
+        x = self.patch_embed(x)
+        cls_tokens = self.cls_token.expand(B, -1, -1)  # stole cls_tokens impl from Phil Wang, thanks
+        x = torch.cat((cls_tokens, x), dim=1)
+        x = x + self.pos_embed
+        x = self.pos_drop(x)
+        for blk in self.blocks:
+            x = blk(x)
+        x = x[:, 1:, :] #.mean(dim=1)  # global pool without cls token
+        return x
+def vit_base_patch16(**kwargs):
+    model = VisionTransformer(
+        num_classes=0,
+        patch_size=16, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    return model
+def vit_large_patch16(**kwargs):
+    model = VisionTransformer(
+        num_classes=0,
+        patch_size=16, embed_dim=1024, depth=24, num_heads=16, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    return model
+def vit_huge_patch14(**kwargs):
+    model = VisionTransformer(
+        patch_size=14, embed_dim=1280, depth=32, num_heads=16, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    return model

REG/models/mocov3_vit.py ADDED Viewed

	@@ -0,0 +1,207 @@

+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+import math
+import torch
+import torch.nn as nn
+from functools import partial, reduce
+from operator import mul
+from timm.layers.helpers import to_2tuple
+from timm.models.vision_transformer import VisionTransformer, _cfg
+from timm.models.vision_transformer import PatchEmbed
+__all__ = [
+    'vit_small',
+    'vit_base',
+    'vit_large',
+    'vit_conv_small',
+    'vit_conv_base',
+]
+def patchify_avg(input_tensor, patch_size):
+    # Ensure input tensor is 4D: (batch_size, channels, height, width)
+    if input_tensor.dim() != 4:
+        raise ValueError("Input tensor must be 4D (batch_size, channels, height, width)")
+    # Get input tensor dimensions
+    batch_size, channels, height, width = input_tensor.shape
+    # Ensure patch_size is valid
+    patch_height, patch_width = patch_size, patch_size
+    if height % patch_height != 0 or width % patch_width != 0:
+        raise ValueError("Input tensor dimensions must be divisible by patch_size")
+    # Use unfold to create patches
+    patches = input_tensor.unfold(2, patch_height, patch_height).unfold(3, patch_width, patch_width)
+    # Reshape patches to desired format: (batch_size, num_patches, channels)
+    patches = patches.contiguous().view(
+        batch_size, channels, -1, patch_height, patch_width
+        ).mean(dim=-1).mean(dim=-1)
+    patches = patches.permute(0, 2, 1).contiguous()
+    return patches
+class VisionTransformerMoCo(VisionTransformer):
+    def __init__(self, stop_grad_conv1=False, **kwargs):
+        super().__init__(**kwargs)
+        # Use fixed 2D sin-cos position embedding
+        self.build_2d_sincos_position_embedding()
+        # weight initialization
+        for name, m in self.named_modules():
+            if isinstance(m, nn.Linear):
+                if 'qkv' in name:
+                    # treat the weights of Q, K, V separately
+                    val = math.sqrt(6. / float(m.weight.shape[0] // 3 + m.weight.shape[1]))
+                    nn.init.uniform_(m.weight, -val, val)
+                else:
+                    nn.init.xavier_uniform_(m.weight)
+                nn.init.zeros_(m.bias)
+        nn.init.normal_(self.cls_token, std=1e-6)
+        if isinstance(self.patch_embed, PatchEmbed):
+            # xavier_uniform initialization
+            val = math.sqrt(6. / float(3 * reduce(mul, self.patch_embed.patch_size, 1) + self.embed_dim))
+            nn.init.uniform_(self.patch_embed.proj.weight, -val, val)
+            nn.init.zeros_(self.patch_embed.proj.bias)
+            if stop_grad_conv1:
+                self.patch_embed.proj.weight.requires_grad = False
+                self.patch_embed.proj.bias.requires_grad = False
+    def build_2d_sincos_position_embedding(self, temperature=10000.):
+        h = self.patch_embed.img_size[0] // self.patch_embed.patch_size[0]
+        w = self.patch_embed.img_size[1] // self.patch_embed.patch_size[1]
+        grid_w = torch.arange(w, dtype=torch.float32)
+        grid_h = torch.arange(h, dtype=torch.float32)
+        grid_w, grid_h = torch.meshgrid(grid_w, grid_h)
+        assert self.embed_dim % 4 == 0, 'Embed dimension must be divisible by 4 for 2D sin-cos position embedding'
+        pos_dim = self.embed_dim // 4
+        omega = torch.arange(pos_dim, dtype=torch.float32) / pos_dim
+        omega = 1. / (temperature**omega)
+        out_w = torch.einsum('m,d->md', [grid_w.flatten(), omega])
+        out_h = torch.einsum('m,d->md', [grid_h.flatten(), omega])
+        pos_emb = torch.cat([torch.sin(out_w), torch.cos(out_w), torch.sin(out_h), torch.cos(out_h)], dim=1)[None, :, :]
+        # assert self.num_tokens == 1, 'Assuming one and only one token, [cls]'
+        pe_token = torch.zeros([1, 1, self.embed_dim], dtype=torch.float32)
+        self.pos_embed = nn.Parameter(torch.cat([pe_token, pos_emb], dim=1))
+        self.pos_embed.requires_grad = False
+    def forward_diffusion_output(self, x):
+        x = x.reshape(*x.shape[0:2], -1).permute(0, 2, 1)
+        x = self._pos_embed(x)
+        x = self.patch_drop(x)
+        x = self.norm_pre(x)
+        x = self.blocks(x)
+        x = self.norm(x)
+        return x
+class ConvStem(nn.Module):
+    """
+    ConvStem, from Early Convolutions Help Transformers See Better, Tete et al. https://arxiv.org/abs/2106.14881
+    """
+    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True):
+        super().__init__()
+        assert patch_size == 16, 'ConvStem only supports patch size of 16'
+        assert embed_dim % 8 == 0, 'Embed dimension must be divisible by 8 for ConvStem'
+        img_size = to_2tuple(img_size)
+        patch_size = to_2tuple(patch_size)
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.grid_size = (img_size[0] // patch_size[0], img_size[1] // patch_size[1])
+        self.num_patches = self.grid_size[0] * self.grid_size[1]
+        self.flatten = flatten
+        # build stem, similar to the design in https://arxiv.org/abs/2106.14881
+        stem = []
+        input_dim, output_dim = 3, embed_dim // 8
+        for l in range(4):
+            stem.append(nn.Conv2d(input_dim, output_dim, kernel_size=3, stride=2, padding=1, bias=False))
+            stem.append(nn.BatchNorm2d(output_dim))
+            stem.append(nn.ReLU(inplace=True))
+            input_dim = output_dim
+            output_dim *= 2
+        stem.append(nn.Conv2d(input_dim, embed_dim, kernel_size=1))
+        self.proj = nn.Sequential(*stem)
+        self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()
+    def forward(self, x):
+        B, C, H, W = x.shape
+        assert H == self.img_size[0] and W == self.img_size[1], \
+            f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
+        x = self.proj(x)
+        if self.flatten:
+            x = x.flatten(2).transpose(1, 2)  # BCHW -> BNC
+        x = self.norm(x)
+        return x
+def vit_small(**kwargs):
+    model = VisionTransformerMoCo(
+        img_size=256,
+        patch_size=16, embed_dim=384, depth=12, num_heads=12, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    return model
+def vit_base(**kwargs):
+    model = VisionTransformerMoCo(
+        img_size=256,
+        patch_size=16, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    return model
+def vit_large(**kwargs):
+    model = VisionTransformerMoCo(
+        img_size=256,
+        patch_size=16, embed_dim=1024, depth=24, num_heads=16, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    return model
+def vit_conv_small(**kwargs):
+    # minus one ViT block
+    model = VisionTransformerMoCo(
+        patch_size=16, embed_dim=384, depth=11, num_heads=12, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), embed_layer=ConvStem, **kwargs)
+    model.default_cfg = _cfg()
+    return model
+def vit_conv_base(**kwargs):
+    # minus one ViT block
+    model = VisionTransformerMoCo(
+        patch_size=16, embed_dim=768, depth=11, num_heads=12, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), embed_layer=ConvStem, **kwargs)
+    model.default_cfg = _cfg()
+    return model
+def build_mlp(num_layers, input_dim, mlp_dim, output_dim, last_bn=True):
+    mlp = []
+    for l in range(num_layers):
+        dim1 = input_dim if l == 0 else mlp_dim
+        dim2 = output_dim if l == num_layers - 1 else mlp_dim
+        mlp.append(nn.Linear(dim1, dim2, bias=False))
+        if l < num_layers - 1:
+            mlp.append(nn.BatchNorm1d(dim2))
+            mlp.append(nn.ReLU(inplace=True))
+        elif last_bn:
+            # follow SimCLR's design: https://github.com/google-research/simclr/blob/master/model_util.py#L157
+            # for simplicity, we further removed gamma in BN
+            mlp.append(nn.BatchNorm1d(dim2, affine=False))
+    return nn.Sequential(*mlp)

REG/models/sit.py ADDED Viewed

	@@ -0,0 +1,420 @@

+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+# --------------------------------------------------------
+# References:
+# GLIDE: https://github.com/openai/glide-text2im
+# MAE: https://github.com/facebookresearch/mae/blob/main/models_mae.py
+# --------------------------------------------------------
+import torch
+import torch.nn as nn
+import numpy as np
+import math
+from timm.models.vision_transformer import PatchEmbed, Attention, Mlp
+def build_mlp(hidden_size, projector_dim, z_dim):
+    return nn.Sequential(
+                nn.Linear(hidden_size, projector_dim),
+                nn.SiLU(),
+                nn.Linear(projector_dim, projector_dim),
+                nn.SiLU(),
+                nn.Linear(projector_dim, z_dim),
+            )
+def modulate(x, shift, scale):
+    return x * (1 + scale.unsqueeze(1)) + shift.unsqueeze(1)
+#################################################################################
+#               Embedding Layers for Timesteps and Class Labels                 #
+#################################################################################
+class TimestepEmbedder(nn.Module):
+    """
+    Embeds scalar timesteps into vector representations.
+    """
+    def __init__(self, hidden_size, frequency_embedding_size=256):
+        super().__init__()
+        self.mlp = nn.Sequential(
+            nn.Linear(frequency_embedding_size, hidden_size, bias=True),
+            nn.SiLU(),
+            nn.Linear(hidden_size, hidden_size, bias=True),
+        )
+        self.frequency_embedding_size = frequency_embedding_size
+    @staticmethod
+    def positional_embedding(t, dim, max_period=10000):
+        """
+        Create sinusoidal timestep embeddings.
+        :param t: a 1-D Tensor of N indices, one per batch element.
+                          These may be fractional.
+        :param dim: the dimension of the output.
+        :param max_period: controls the minimum frequency of the embeddings.
+        :return: an (N, D) Tensor of positional embeddings.
+        """
+        # https://github.com/openai/glide-text2im/blob/main/glide_text2im/nn.py
+        half = dim // 2
+        freqs = torch.exp(
+            -math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32) / half
+        ).to(device=t.device)
+        args = t[:, None].float() * freqs[None]
+        embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
+        if dim % 2:
+            embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
+        return embedding
+    def forward(self, t):
+        self.timestep_embedding = self.positional_embedding
+        t_freq = self.timestep_embedding(t, dim=self.frequency_embedding_size).to(t.dtype)
+        t_emb = self.mlp(t_freq)
+        return t_emb
+class LabelEmbedder(nn.Module):
+    """
+    Embeds class labels into vector representations. Also handles label dropout for classifier-free guidance.
+    """
+    def __init__(self, num_classes, hidden_size, dropout_prob):
+        super().__init__()
+        use_cfg_embedding = dropout_prob > 0
+        self.embedding_table = nn.Embedding(num_classes + use_cfg_embedding, hidden_size)
+        self.num_classes = num_classes
+        self.dropout_prob = dropout_prob
+    def token_drop(self, labels, force_drop_ids=None):
+        """
+        Drops labels to enable classifier-free guidance.
+        """
+        if force_drop_ids is None:
+            drop_ids = torch.rand(labels.shape[0], device=labels.device) < self.dropout_prob
+        else:
+            drop_ids = force_drop_ids == 1
+        labels = torch.where(drop_ids, self.num_classes, labels)
+        return labels
+    def forward(self, labels, train, force_drop_ids=None):
+        use_dropout = self.dropout_prob > 0
+        if (train and use_dropout) or (force_drop_ids is not None):
+            labels = self.token_drop(labels, force_drop_ids)
+        embeddings = self.embedding_table(labels)
+        return embeddings
+#################################################################################
+#                                 Core SiT Model                                #
+#################################################################################
+class SiTBlock(nn.Module):
+    """
+    A SiT block with adaptive layer norm zero (adaLN-Zero) conditioning.
+    """
+    def __init__(self, hidden_size, num_heads, mlp_ratio=4.0, **block_kwargs):
+        super().__init__()
+        self.norm1 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.attn = Attention(
+            hidden_size, num_heads=num_heads, qkv_bias=True, qk_norm=block_kwargs["qk_norm"]
+            )
+        if "fused_attn" in block_kwargs.keys():
+            self.attn.fused_attn = block_kwargs["fused_attn"]
+        self.norm2 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        mlp_hidden_dim = int(hidden_size * mlp_ratio)
+        approx_gelu = lambda: nn.GELU(approximate="tanh")
+        self.mlp = Mlp(
+            in_features=hidden_size, hidden_features=mlp_hidden_dim, act_layer=approx_gelu, drop=0
+            )
+        self.adaLN_modulation = nn.Sequential(
+            nn.SiLU(),
+            nn.Linear(hidden_size, 6 * hidden_size, bias=True)
+        )
+    def forward(self, x, c):
+        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
+            self.adaLN_modulation(c).chunk(6, dim=-1)
+        )
+        x = x + gate_msa.unsqueeze(1) * self.attn(modulate(self.norm1(x), shift_msa, scale_msa))
+        x = x + gate_mlp.unsqueeze(1) * self.mlp(modulate(self.norm2(x), shift_mlp, scale_mlp))
+        return x
+class FinalLayer(nn.Module):
+    """
+    The final layer of SiT.
+    """
+    def __init__(self, hidden_size, patch_size, out_channels, cls_token_dim):
+        super().__init__()
+        self.norm_final = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.linear = nn.Linear(hidden_size, patch_size * patch_size * out_channels, bias=True)
+        self.linear_cls = nn.Linear(hidden_size, cls_token_dim, bias=True)
+        self.adaLN_modulation = nn.Sequential(
+            nn.SiLU(),
+            nn.Linear(hidden_size, 2 * hidden_size, bias=True)
+        )
+    def forward(self, x, c, cls=None):
+        shift, scale = self.adaLN_modulation(c).chunk(2, dim=-1)
+        x = modulate(self.norm_final(x), shift, scale)
+        if cls is None:
+            x = self.linear(x)
+            return x, None
+        else:
+            cls_token = self.linear_cls(x[:, 0]).unsqueeze(1)
+            x = self.linear(x[:, 1:])
+            return x, cls_token.squeeze(1)
+class SiT(nn.Module):
+    """
+    Diffusion model with a Transformer backbone.
+    """
+    def __init__(
+        self,
+        path_type='edm',
+        input_size=32,
+        patch_size=2,
+        in_channels=4,
+        hidden_size=1152,
+        decoder_hidden_size=768,
+        encoder_depth=8,
+        depth=28,
+        num_heads=16,
+        mlp_ratio=4.0,
+        class_dropout_prob=0.1,
+        num_classes=1000,
+        use_cfg=False,
+        z_dims=[768],
+        projector_dim=2048,
+        cls_token_dim=768,
+        **block_kwargs # fused_attn
+    ):
+        super().__init__()
+        self.path_type = path_type
+        self.in_channels = in_channels
+        self.out_channels = in_channels
+        self.patch_size = patch_size
+        self.num_heads = num_heads
+        self.use_cfg = use_cfg
+        self.num_classes = num_classes
+        self.z_dims = z_dims
+        self.encoder_depth = encoder_depth
+        self.x_embedder = PatchEmbed(
+            input_size, patch_size, in_channels, hidden_size, bias=True
+            )
+        self.t_embedder = TimestepEmbedder(hidden_size) # timestep embedding type
+        self.y_embedder = LabelEmbedder(num_classes, hidden_size, class_dropout_prob)
+        num_patches = self.x_embedder.num_patches
+        # Will use fixed sin-cos embedding:
+        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches+1, hidden_size), requires_grad=False)
+        self.blocks = nn.ModuleList([
+            SiTBlock(hidden_size, num_heads, mlp_ratio=mlp_ratio, **block_kwargs) for _ in range(depth)
+        ])
+        self.projectors = nn.ModuleList([
+            build_mlp(hidden_size, projector_dim, z_dim) for z_dim in z_dims
+            ])
+        z_dim = self.z_dims[0]
+        cls_token_dim = z_dim
+        self.final_layer = FinalLayer(decoder_hidden_size, patch_size, self.out_channels, cls_token_dim)
+        self.cls_projectors2 = nn.Linear(in_features=cls_token_dim, out_features=hidden_size, bias=True)
+        self.wg_norm = nn.LayerNorm(hidden_size, elementwise_affine=True, eps=1e-6)
+        self.initialize_weights()
+    def initialize_weights(self):
+        # Initialize transformer layers:
+        def _basic_init(module):
+            if isinstance(module, nn.Linear):
+                torch.nn.init.xavier_uniform_(module.weight)
+                if module.bias is not None:
+                    nn.init.constant_(module.bias, 0)
+        self.apply(_basic_init)
+        # Initialize (and freeze) pos_embed by sin-cos embedding:
+        pos_embed = get_2d_sincos_pos_embed(
+            self.pos_embed.shape[-1], int(self.x_embedder.num_patches ** 0.5), cls_token=1, extra_tokens=1
+            )
+        self.pos_embed.data.copy_(torch.from_numpy(pos_embed).float().unsqueeze(0))
+        # Initialize patch_embed like nn.Linear (instead of nn.Conv2d):
+        w = self.x_embedder.proj.weight.data
+        nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
+        nn.init.constant_(self.x_embedder.proj.bias, 0)
+        # Initialize label embedding table:
+        nn.init.normal_(self.y_embedder.embedding_table.weight, std=0.02)
+        # Initialize timestep embedding MLP:
+        nn.init.normal_(self.t_embedder.mlp[0].weight, std=0.02)
+        nn.init.normal_(self.t_embedder.mlp[2].weight, std=0.02)
+        # Zero-out adaLN modulation layers in SiT blocks:
+        for block in self.blocks:
+            nn.init.constant_(block.adaLN_modulation[-1].weight, 0)
+            nn.init.constant_(block.adaLN_modulation[-1].bias, 0)
+        # Zero-out output layers:
+        nn.init.constant_(self.final_layer.adaLN_modulation[-1].weight, 0)
+        nn.init.constant_(self.final_layer.adaLN_modulation[-1].bias, 0)
+        nn.init.constant_(self.final_layer.linear.weight, 0)
+        nn.init.constant_(self.final_layer.linear.bias, 0)
+        nn.init.constant_(self.final_layer.linear_cls.weight, 0)
+        nn.init.constant_(self.final_layer.linear_cls.bias, 0)
+    def unpatchify(self, x, patch_size=None):
+        """
+        x: (N, T, patch_size**2 * C)
+        imgs: (N, C, H, W)
+        """
+        c = self.out_channels
+        p = self.x_embedder.patch_size[0] if patch_size is None else patch_size
+        h = w = int(x.shape[1] ** 0.5)
+        assert h * w == x.shape[1]
+        x = x.reshape(shape=(x.shape[0], h, w, p, p, c))
+        x = torch.einsum('nhwpqc->nchpwq', x)
+        imgs = x.reshape(shape=(x.shape[0], c, h * p, w * p))
+        return imgs
+    def forward(self, x, t, y, return_logvar=False, cls_token=None):
+        """
+        Forward pass of SiT.
+        x: (N, C, H, W) tensor of spatial inputs (images or latent representations of images)
+        t: (N,) tensor of diffusion timesteps
+        y: (N,) tensor of class labels
+        """
+        #cat with cls_token
+        x = self.x_embedder(x)   # (N, T, D), where T = H * W / patch_size ** 2
+        if cls_token is not None:
+            cls_token = self.cls_projectors2(cls_token)
+            cls_token = self.wg_norm(cls_token)
+            cls_token = cls_token.unsqueeze(1)  # [b, length, d]
+            x = torch.cat((cls_token, x), dim=1)
+            x = x + self.pos_embed
+        else:
+            exit()
+        N, T, D = x.shape
+        # timestep and class embedding
+        t_embed = self.t_embedder(t)                   # (N, D)
+        y = self.y_embedder(y, self.training)    # (N, D)
+        c = t_embed + y
+        for i, block in enumerate(self.blocks):
+            x = block(x, c)
+            if (i + 1) == self.encoder_depth:
+                zs = [projector(x.reshape(-1, D)).reshape(N, T, -1) for projector in self.projectors]
+        x, cls_token = self.final_layer(x, c, cls=cls_token)
+        x = self.unpatchify(x)
+        return x, zs, cls_token
+#################################################################################
+#                   Sine/Cosine Positional Embedding Functions                  #
+#################################################################################
+# https://github.com/facebookresearch/mae/blob/main/util/pos_embed.py
+def get_2d_sincos_pos_embed(embed_dim, grid_size, cls_token=False, extra_tokens=0):
+    """
+    grid_size: int of the grid height and width
+    return:
+    pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
+    """
+    grid_h = np.arange(grid_size, dtype=np.float32)
+    grid_w = np.arange(grid_size, dtype=np.float32)
+    grid = np.meshgrid(grid_w, grid_h)  # here w goes first
+    grid = np.stack(grid, axis=0)
+    grid = grid.reshape([2, 1, grid_size, grid_size])
+    pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)
+    if cls_token and extra_tokens > 0:
+        pos_embed = np.concatenate([np.zeros([extra_tokens, embed_dim]), pos_embed], axis=0)
+    return pos_embed
+def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
+    assert embed_dim % 2 == 0
+    # use half of dimensions to encode grid_h
+    emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0])  # (H*W, D/2)
+    emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[1])  # (H*W, D/2)
+    emb = np.concatenate([emb_h, emb_w], axis=1) # (H*W, D)
+    return emb
+def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
+    """
+    embed_dim: output dimension for each position
+    pos: a list of positions to be encoded: size (M,)
+    out: (M, D)
+    """
+    assert embed_dim % 2 == 0
+    omega = np.arange(embed_dim // 2, dtype=np.float64)
+    omega /= embed_dim / 2.
+    omega = 1. / 10000**omega  # (D/2,)
+    pos = pos.reshape(-1)  # (M,)
+    out = np.einsum('m,d->md', pos, omega)  # (M, D/2), outer product
+    emb_sin = np.sin(out) # (M, D/2)
+    emb_cos = np.cos(out) # (M, D/2)
+    emb = np.concatenate([emb_sin, emb_cos], axis=1)  # (M, D)
+    return emb
+#################################################################################
+#                                   SiT Configs                                  #
+#################################################################################
+def SiT_XL_2(**kwargs):
+    return SiT(depth=28, hidden_size=1152, decoder_hidden_size=1152, patch_size=2, num_heads=16, **kwargs)
+def SiT_XL_4(**kwargs):
+    return SiT(depth=28, hidden_size=1152, decoder_hidden_size=1152, patch_size=4, num_heads=16, **kwargs)
+def SiT_XL_8(**kwargs):
+    return SiT(depth=28, hidden_size=1152, decoder_hidden_size=1152, patch_size=8, num_heads=16, **kwargs)
+def SiT_L_2(**kwargs):
+    return SiT(depth=24, hidden_size=1024, decoder_hidden_size=1024, patch_size=2, num_heads=16, **kwargs)
+def SiT_L_4(**kwargs):
+    return SiT(depth=24, hidden_size=1024, decoder_hidden_size=1024, patch_size=4, num_heads=16, **kwargs)
+def SiT_L_8(**kwargs):
+    return SiT(depth=24, hidden_size=1024, decoder_hidden_size=1024, patch_size=8, num_heads=16, **kwargs)
+def SiT_B_2(**kwargs):
+    return SiT(depth=12, hidden_size=768, decoder_hidden_size=768, patch_size=2, num_heads=12, **kwargs)
+def SiT_B_4(**kwargs):
+    return SiT(depth=12, hidden_size=768, decoder_hidden_size=768, patch_size=4, num_heads=12, **kwargs)
+def SiT_B_8(**kwargs):
+    return SiT(depth=12, hidden_size=768, decoder_hidden_size=768, patch_size=8, num_heads=12, **kwargs)
+def SiT_S_2(**kwargs):
+    return SiT(depth=12, hidden_size=384, patch_size=2, num_heads=6, **kwargs)
+def SiT_S_4(**kwargs):
+    return SiT(depth=12, hidden_size=384, patch_size=4, num_heads=6, **kwargs)
+def SiT_S_8(**kwargs):
+    return SiT(depth=12, hidden_size=384, patch_size=8, num_heads=6, **kwargs)
+SiT_models = {
+    'SiT-XL/2': SiT_XL_2,  'SiT-XL/4': SiT_XL_4,  'SiT-XL/8': SiT_XL_8,
+    'SiT-L/2':  SiT_L_2,   'SiT-L/4':  SiT_L_4,   'SiT-L/8':  SiT_L_8,
+    'SiT-B/2':  SiT_B_2,   'SiT-B/4':  SiT_B_4,   'SiT-B/8':  SiT_B_8,
+    'SiT-S/2':  SiT_S_2,   'SiT-S/4':  SiT_S_4,   'SiT-S/8':  SiT_S_8,
+}

back/LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2024 Sihyun Yu
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

back/README.md ADDED Viewed

	@@ -0,0 +1,156 @@

+<p align="center">
+  <h1 align="center">Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think (NeurIPS 2025 Oral)
+</h1>
+  <p align="center">
+      <a href='https://github.com/Martinser' style='text-decoration: none' >Ge Wu</a><sup>1</sup>&emsp;
+      <a href='https://github.com/ShenZhang-Shin' style='text-decoration: none' >Shen Zhang</a><sup>3</sup>&emsp;
+      <a href='' style='text-decoration: none' >Ruijing Shi</a><sup>1</sup>&emsp;
+      <a href='https://shgao.site/' style='text-decoration: none' >Shanghua Gao</a><sup>4</sup>&emsp;
+      <a href='https://zhenyuanchenai.github.io/' style='text-decoration: none' >Zhenyuan Chen</a><sup>1</sup>&emsp;
+      <a href='https://scholar.google.com/citations?user=6Z66DAwAAAAJ&hl=en' style='text-decoration: none' >Lei Wang</a><sup>1</sup>&emsp;
+      <a href='https://www.zhihu.com/people/chen-zhao-wei-16-2' style='text-decoration: none' >Zhaowei Chen</a><sup>3</sup>&emsp;
+      <a href='https://gao-hongcheng.github.io/' style='text-decoration: none' >Hongcheng Gao</a><sup>5</sup>&emsp;
+      <a href='https://scholar.google.com/citations?view_op=list_works&hl=zh-CN&hl=zh-CN&user=0xP6bxcAAAAJ' style='text-decoration: none' >Yao Tang</a><sup>3</sup>&emsp;
+      <a href='https://scholar.google.com/citations?user=6CIDtZQAAAAJ&hl=en' style='text-decoration: none' >Jian Yang</a><sup>1</sup>&emsp;
+      <a href='https://mmcheng.net/cmm/' style='text-decoration: none' >Ming-Ming Cheng</a><sup>1,2</sup>&emsp;
+      <a href='https://implus.github.io/' style='text-decoration: none' >Xiang Li</a><sup>1,2*</sup>&emsp;
+        <p align="center">
+        $^{1}$ VCIP, CS, Nankai University, $^{2}$ NKIARI, Shenzhen Futian, $^{3}$ JIIOV Technology,
+        $^{4}$ Harvard University, $^{5}$ University of Chinese Academy of Sciences
+        <p align='center'>
+      <div align="center">
+       <a href='https://arxiv.org/abs/2507.01467v2'><img src='https://img.shields.io/badge/arXiv-2507.01467v2-brown.svg?logo=arxiv&logoColor=white'></a>
+	<a href='https://huggingface.co/Martinser/REG/tree/main'><img src='https://img.shields.io/badge/🤗-Model-blue.svg'></a>
+		  <a href='https://zhuanlan.zhihu.com/p/1952346823168595518'><img src='https://img.shields.io/badge/Zhihu-chinese_article-blue.svg?logo=zhihu&logoColor=white'></a>
+	  </div>
+    <p align='center'>
+    </p>
+   </p>
+</p>
+## 🚩 Overview
+![overview](fig/reg.png)
+REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations.
+We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations.
+In this work, we propose a straightforward method called Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising.
+REG acquires the capability to produce coherent image-class pairs directly from pure noise,
+substantially improving both generation quality and training efficiency.
+This is accomplished with negligible additional inference overhead, **requiring only one single additional token for denoising (<0.5\% increase in FLOPs and latency).**
+The inference process concurrently reconstructs both image latents and their corresponding global semantics, where the acquired semantic knowledge actively guides and enhances the image generation process.
+On ImageNet $256{\times}256$, SiT-XL/2 + REG demonstrates remarkable convergence acceleration, **achieving $\textbf{63}\times$ and $\textbf{23}\times$ faster training than SiT-XL/2 and SiT-XL/2 + REPA, respectively.**
+More impressively, SiT-L/2 + REG trained for merely 400K iterations outperforms SiT-XL/2 + REPA trained for 4M iterations ($\textbf{10}\times$ longer).
+## 📰 News
+- **[2025.08.05]** We have released the pre-trained weights of REG + SiT-XL/2 in 4M (800 epochs).
+## 📝 Results
+- Performance on ImageNet $256{\times}256$ with FID=1.36 by introducing a single class token.
+- $\textbf{63}\times$ and $\textbf{23}\times$ faster training than SiT-XL/2 and SiT-XL/2 + REPA.
+<div align="center">
+<img src="fig/img.png" alt="Results">
+</div>
+## 📋 Plan
+- More training steps on ImageNet 256&512 and T2I.
+## 👊 Usage
+### 1. Environment setup
+```bash
+conda create -n reg python=3.10.16 -y
+conda activate reg
+pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1
+pip install -r requirements.txt
+```
+### 2. Dataset
+#### Dataset download
+Currently, we provide experiments for ImageNet. You can place the data that you want and can specifiy it via `--data-dir` arguments in training scripts.
+#### Preprocessing data
+Please refer to the preprocessing guide. And you can directly download our processed data, ImageNet data [link](https://huggingface.co/WindATree/ImageNet-256-VAE/tree/main), and ImageNet data after VAE encoder [link]( https://huggingface.co/WindATree/vae-sd/tree/main)
+### 3. Training
+Run train.sh
+```bash
+bash train.sh
+```
+train.sh contains the following content.
+```bash
+accelerate launch --multi_gpu --num_processes $NUM_GPUS train.py \
+    --report-to="wandb" \
+    --allow-tf32 \
+    --mixed-precision="fp16" \
+    --seed=0 \
+    --path-type="linear" \
+    --prediction="v" \
+    --weighting="uniform" \
+    --model="SiT-B/2" \
+    --enc-type="dinov2-vit-b" \
+    --proj-coeff=0.5 \
+    --encoder-depth=4 \     #SiT-L/XL use 8, SiT-B use 4
+    --output-dir="your_path" \
+    --exp-name="linear-dinov2-b-enc4" \
+    --batch-size=256 \
+    --data-dir="data_path/imagenet_vae" \
+    --cls=0.03
+```
+Then this script will automatically create the folder in `exps` to save logs and checkpoints. You can adjust the following options:
+- `--models`: `[SiT-B/2, SiT-L/2, SiT-XL/2]`
+- `--enc-type`: `[dinov2-vit-b, clip-vit-L]`
+- `--proj-coeff`: Any values larger than 0
+- `--encoder-depth`: Any values between 1 to the depth of the model
+- `--output-dir`: Any directory that you want to save checkpoints and logs
+- `--exp-name`: Any string name (the folder will be created under `output-dir`)
+- `--cls`: Weight coefficients of REG loss
+### 4. Generate images and evaluation
+You can generate images and get the final results through the following script.
+The weight of REG can be found in this [link](https://pan.baidu.com/s/1QX2p3ybh1KfNU7wsp5McWw?pwd=khpp) or [HF](https://huggingface.co/Martinser/REG/tree/main).
+```bash
+bash eval.sh
+```
+## Citation
+If you find our work, this repository, or pretrained models useful, please consider giving a star and citation.
+```
+@article{wu2025representation,
+  title={Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think},
+  author={Wu, Ge and Zhang, Shen and Shi, Ruijing and Gao, Shanghua and Chen, Zhenyuan and Wang, Lei and Chen, Zhaowei and Gao, Hongcheng and Tang, Yao and Yang, Jian and others},
+  journal={arXiv preprint arXiv:2507.01467},
+  year={2025}
+}
+```
+## Contact
+If you have any questions, please create an issue on this repository, contact at gewu.nku@gmail.com or wechat(wg1158848).
+## Acknowledgements
+Our code is based on [REPA](https://github.com/sihyun-yu/REPA), along with [SiT](https://github.com/willisma/SiT), [DINOv2](https://github.com/facebookresearch/dinov2), [ADM](https://github.com/openai/guided-diffusion) and [U-ViT](https://github.com/baofff/U-ViT) repositories. We thank the authors for releasing their code. If you use our model and code, please consider citing these works as well.

back/eval.sh ADDED Viewed

	@@ -0,0 +1,52 @@

+random_number=$((RANDOM % 100 + 1200))
+NUM_GPUS=8
+STEP="4000000"
+SAVE_PATH="your_path/reg_xlarge_dinov2_base_align_8_cls/linear-dinov2-b-enc8"
+VAE_PATH="your_vae_path/"
+NUM_STEP=250
+MODEL_SIZE='XL'
+CFG_SCALE=2.3
+CLS_CFG_SCALE=2.3
+GH=0.85
+export NCCL_P2P_DISABLE=1
+python -m torch.distributed.launch --master_port=$random_number --nproc_per_node=$NUM_GPUS generate.py \
+  --model SiT-XL/2 \
+  --num-fid-samples 50000 \
+  --ckpt ${SAVE_PATH}/checkpoints/${STEP}.pt \
+  --path-type=linear \
+  --encoder-depth=8 \
+  --projector-embed-dims=768 \
+  --per-proc-batch-size=64 \
+  --mode=sde \
+  --num-steps=${NUM_STEP} \
+  --cfg-scale=${CFG_SCALE} \
+  --cls-cfg-scale=${CLS_CFG_SCALE} \
+  --guidance-high=${GH} \
+  --sample-dir ${SAVE_PATH}/checkpoints \
+  --cls=768
+python ./evaluations/evaluator.py \
+    --ref_batch your_path/VIRTUAL_imagenet256_labeled.npz \
+    --sample_batch ${SAVE_PATH}/checkpoints/SiT-${MODEL_SIZE}-2-${STEP}-size-256-vae-ema-cfg-${CFG_SCALE}-seed-0-sde-${GH}-${CLS_CFG_SCALE}.npz \
+    --save_path ${SAVE_PATH}/checkpoints \
+    --cfg_cond 1 \
+    --step ${STEP} \
+    --num_steps ${NUM_STEP} \
+    --cfg ${CFG_SCALE} \
+    --cls_cfg ${CLS_CFG_SCALE} \
+    --gh ${GH}

back/loss.py ADDED Viewed

	@@ -0,0 +1,168 @@

+import torch
+import numpy as np
+import torch.nn.functional as F
+try:
+    from scipy.optimize import linear_sum_assignment
+except ImportError:
+    linear_sum_assignment = None
+def ot_pair_noise_to_cls(noise_cls, cls_gt):
+    """
+    Minibatch OT（与 conditional-flow-matching / torchcfm 中 sample_plan_with_scipy 一致）：
+    在 batch 内用平方欧氏代价重排 noise，使 noise_ot[i] 与 cls_gt[i] 构成近似最优传输配对。
+    noise_cls, cls_gt: (N, D) 或任意可在最后一维展平为 D 的形状。
+    """
+    n = noise_cls.shape[0]
+    if n <= 1:
+        return noise_cls, cls_gt
+    if linear_sum_assignment is None:
+        return noise_cls, cls_gt
+    x0 = noise_cls.detach().float().reshape(n, -1)
+    x1 = cls_gt.detach().float().reshape(n, -1)
+    M = torch.cdist(x0, x1) ** 2
+    _, j = linear_sum_assignment(M.cpu().numpy())
+    j = torch.as_tensor(j, device=noise_cls.device, dtype=torch.long)
+    return noise_cls[j], cls_gt
+def mean_flat(x):
+    """
+    Take the mean over all non-batch dimensions.
+    """
+    return torch.mean(x, dim=list(range(1, len(x.size()))))
+def sum_flat(x):
+    """
+    Take the mean over all non-batch dimensions.
+    """
+    return torch.sum(x, dim=list(range(1, len(x.size()))))
+class SILoss:
+    def __init__(
+            self,
+            prediction='v',
+            path_type="linear",
+            weighting="uniform",
+            encoders=[],
+            accelerator=None,
+            latents_scale=None,
+            latents_bias=None,
+            t_c=0.5,
+            ot_cls=True,
+            ):
+        self.prediction = prediction
+        self.weighting = weighting
+        self.path_type = path_type
+        self.encoders = encoders
+        self.accelerator = accelerator
+        self.latents_scale = latents_scale
+        self.latents_bias = latents_bias
+        # t 与 train.py / JsFlow 一致：t=0 为干净 latent，t=1 为纯噪声。
+        # t ∈ (t_c, 1]：语义 cls 沿 OT 配对后的路径从噪声演化为 cls_gt（生成语义通道）；
+        # t ∈ [0, t_c]：cls 恒为真实 cls_gt，目标速度为 0（通道不再插值）。
+        tc = float(t_c)
+        self.t_c = min(max(tc, 1e-4), 1.0 - 1e-4)
+        self.ot_cls = bool(ot_cls)
+    def interpolant(self, t):
+        if self.path_type == "linear":
+            alpha_t = 1 - t
+            sigma_t = t
+            d_alpha_t = -1
+            d_sigma_t =  1
+        elif self.path_type == "cosine":
+            alpha_t = torch.cos(t * np.pi / 2)
+            sigma_t = torch.sin(t * np.pi / 2)
+            d_alpha_t = -np.pi / 2 * torch.sin(t * np.pi / 2)
+            d_sigma_t =  np.pi / 2 * torch.cos(t * np.pi / 2)
+        else:
+            raise NotImplementedError()
+        return alpha_t, sigma_t, d_alpha_t, d_sigma_t
+    def __call__(self, model, images, model_kwargs=None, zs=None, cls_token=None,
+                 time_input=None, noises=None,):
+        if model_kwargs == None:
+            model_kwargs = {}
+        # sample timesteps
+        if time_input is None:
+            if self.weighting == "uniform":
+                time_input = torch.rand((images.shape[0], 1, 1, 1))
+            elif self.weighting == "lognormal":
+                # sample timestep according to log-normal distribution of sigmas following EDM
+                rnd_normal = torch.randn((images.shape[0], 1 ,1, 1))
+                sigma = rnd_normal.exp()
+                if self.path_type == "linear":
+                    time_input = sigma / (1 + sigma)
+                elif self.path_type == "cosine":
+                    time_input = 2 / np.pi * torch.atan(sigma)
+        time_input = time_input.to(device=images.device, dtype=torch.float32)
+        cls_token = cls_token.to(device=images.device, dtype=torch.float32)
+        if noises is None:
+            noises = torch.randn_like(images)
+            noises_cls = torch.randn_like(cls_token)
+        else:
+            if isinstance(noises, (tuple, list)) and len(noises) == 2:
+                noises, noises_cls = noises
+            else:
+                noises_cls = torch.randn_like(cls_token)
+        alpha_t, sigma_t, d_alpha_t, d_sigma_t = self.interpolant(time_input)
+        model_input = alpha_t * images + sigma_t * noises
+        if self.prediction == 'v':
+            model_target = d_alpha_t * images + d_sigma_t * noises
+        else:
+            raise NotImplementedError()
+        N = images.shape[0]
+        t_flat = time_input.view(-1).float()
+        high_noise_mask = (t_flat > self.t_c).float().view(N, *([1] * (cls_token.dim() - 1)))
+        low_noise_mask = 1.0 - high_noise_mask
+        noise_cls_raw = noises_cls
+        if self.ot_cls:
+            noise_cls_paired, cls_gt_paired = ot_pair_noise_to_cls(noise_cls_raw, cls_token)
+        else:
+            noise_cls_paired, cls_gt_paired = noise_cls_raw, cls_token
+        tau_shape = (N,) + (1,) * max(0, cls_token.dim() - 1)
+        tau = (time_input.reshape(tau_shape) - self.t_c) / (1.0 - self.t_c + 1e-8)
+        tau = torch.clamp(tau, 0.0, 1.0)
+        alpha_sem = 1.0 - tau
+        sigma_sem = tau
+        cls_t_high = alpha_sem * cls_gt_paired + sigma_sem * noise_cls_paired
+        cls_t = high_noise_mask * cls_t_high + low_noise_mask * cls_token
+        cls_t = torch.nan_to_num(cls_t, nan=0.0, posinf=1e4, neginf=-1e4)
+        cls_t = torch.clamp(cls_t, -1e4, 1e4)
+        cls_for_model = cls_t * high_noise_mask + cls_t.detach() * low_noise_mask
+        inv_scale = 1.0 / (1.0 - self.t_c + 1e-8)
+        v_cls_high = (noise_cls_paired - cls_gt_paired) * inv_scale
+        v_cls_target = high_noise_mask * v_cls_high
+        model_output, zs_tilde, cls_output = model(
+            model_input, time_input.flatten(), **model_kwargs, cls_token=cls_for_model
+        )
+        #denoising_loss
+        denoising_loss = mean_flat((model_output - model_target) ** 2)
+        denoising_loss_cls = mean_flat((cls_output - v_cls_target) ** 2)
+        # projection loss
+        proj_loss = 0.
+        bsz = zs[0].shape[0]
+        for i, (z, z_tilde) in enumerate(zip(zs, zs_tilde)):
+            for j, (z_j, z_tilde_j) in enumerate(zip(z, z_tilde)):
+                z_tilde_j = torch.nn.functional.normalize(z_tilde_j, dim=-1)
+                z_j = torch.nn.functional.normalize(z_j, dim=-1)
+                proj_loss += mean_flat(-(z_j * z_tilde_j).sum(dim=-1))
+        proj_loss /= (len(zs) * bsz)
+        return denoising_loss, proj_loss, time_input, noises, denoising_loss_cls

back/requirements.txt ADDED Viewed

	@@ -0,0 +1,97 @@

+  - pip:
+  absl-py==2.2.2
+  accelerate==1.2.1
+  aiohappyeyeballs==2.6.1
+  aiohttp==3.11.16
+  aiosignal==1.3.2
+  astunparse==1.6.3
+  async-timeout==5.0.1
+  attrs==25.3.0
+  certifi==2022.12.7
+  charset-normalizer==2.1.1
+  click==8.1.8
+  datasets==2.20.0
+  diffusers==0.32.1
+  dill==0.3.8
+  docker-pycreds==0.4.0
+  einops==0.8.1
+  filelock==3.13.1
+  flatbuffers==25.2.10
+  frozenlist==1.5.0
+  fsspec==2024.5.0
+  ftfy==6.3.1
+  gast==0.6.0
+  gitdb==4.0.12
+  gitpython==3.1.44
+  google-pasta==0.2.0
+  grpcio==1.71.0
+  h5py==3.13.0
+  huggingface-hub==0.27.1
+  idna==3.4
+  importlib-metadata==8.6.1
+  jinja2==3.1.4
+  joblib==1.4.2
+  keras==3.9.2
+  libclang==18.1.1
+  markdown==3.8
+  markdown-it-py==3.0.0
+  markupsafe==2.1.5
+  mdurl==0.1.2
+  ml-dtypes==0.3.2
+  mpmath==1.3.0
+  multidict==6.4.3
+  multiprocess==0.70.16
+  namex==0.0.8
+  networkx==3.3
+  numpy==1.26.4
+  opt-einsum==3.4.0
+  optree==0.15.0
+  packaging==24.2
+  pandas==2.2.3
+  pillow==11.0.0
+  platformdirs==4.3.7
+  propcache==0.3.1
+  protobuf==4.25.6
+  psutil==7.0.0
+  pyarrow==19.0.1
+  pyarrow-hotfix==0.6
+  pygments==2.19.1
+  python-dateutil==2.9.0.post0
+  pytz==2025.2
+  pyyaml==6.0.2
+  regex==2024.11.6
+  requests==2.32.3
+  rich==14.0.0
+  safetensors==0.5.3
+  scikit-learn==1.5.1
+  scipy==1.15.2
+  sentry-sdk==2.26.1
+  setproctitle==1.3.5
+  six==1.17.0
+  smmap==5.0.2
+  sympy==1.13.1
+  tensorboard==2.16.1
+  tensorboard-data-server==0.7.2
+  tensorflow==2.16.1
+  tensorflow-io-gcs-filesystem==0.37.1
+  termcolor==3.0.1
+  tf-keras==2.16.0
+  threadpoolctl==3.6.0
+  timm==1.0.12
+  tokenizers==0.21.0
+  tqdm==4.67.1
+  transformers==4.47.0
+  triton==2.1.0
+  typing-extensions==4.12.2
+  tzdata==2025.2
+  urllib3==1.26.13
+  wandb==0.17.6
+  wcwidth==0.2.13
+  werkzeug==3.1.3
+  wrapt==1.17.2
+  xformer==1.0.1
+  xformers==0.0.23
+  xxhash==3.5.0
+  yarl==1.20.0
+  zipp==3.21.0

back/sample_from_checkpoint.py ADDED Viewed

	@@ -0,0 +1,596 @@

+#!/usr/bin/env python3
+"""
+从 REG/train.py 保存的检查点加载权重，在指定目录生成若干 PNG。
+示例：
+  python sample_from_checkpoint.py \\
+    --ckpt exps/jsflow-experiment/checkpoints/0050000.pt \\
+    --out-dir ./samples_gen \\
+    --num-images 64 \\
+    --batch-size 8
+  # 按训练 t_c 分段分配步数（t=1→t_c 与 t_c→0；--t-c 可省略若检查点含 t_c）：
+  python sample_from_checkpoint.py ... \\
+    --steps-before-tc 150 --steps-after-tc 100 --t-c 0.5
+  # 同一批初始噪声连跑两种 t_c 后段步数（输出到 out-dir 下子目录）：
+  python sample_from_checkpoint.py ... \\
+    --steps-before-tc 150 --steps-after-tc 5 --dual-compare-after
+  # 分段时会在 at_tc/（或 at_tc/after_input、at_tc/after_equal_before）额外保存 t≈t_c 的解码图。
+检查点需包含 train.py 写入的键：ema（或 model）、args（推荐，用于自动还原结构）。
+若缺少 args，需通过命令行显式传入 --model、--resolution、--enc-type 等。
+"""
+from __future__ import annotations
+import argparse
+import os
+import sys
+import types
+import numpy as np
+import torch
+from diffusers.models import AutoencoderKL
+from PIL import Image
+from tqdm import tqdm
+from models.sit import SiT_models
+from samplers import (
+    euler_maruyama_image_noise_before_tc_sampler,
+    euler_maruyama_image_noise_sampler,
+    euler_maruyama_sampler,
+    euler_ode_sampler,
+)
+def semantic_dim_from_enc_type(enc_type):
+    """与 train.py 一致：按 enc_type 推断语义/class token 维度。"""
+    if enc_type is None:
+        return 768
+    s = str(enc_type).lower()
+    if "vit-g" in s or "vitg" in s:
+        return 1536
+    if "vit-l" in s or "vitl" in s:
+        return 1024
+    if "vit-s" in s or "vits" in s:
+        return 384
+    return 768
+def load_train_args_from_ckpt(ckpt: dict) -> argparse.Namespace | None:
+    a = ckpt.get("args")
+    if a is None:
+        return None
+    if isinstance(a, argparse.Namespace):
+        return a
+    if isinstance(a, dict):
+        return argparse.Namespace(**a)
+    if isinstance(a, types.SimpleNamespace):
+        return argparse.Namespace(**vars(a))
+    return None
+def load_vae(device: torch.device):
+    """与 train.py 相同策略：优先本地 diffusers 缓存中的 sd-vae-ft-mse。"""
+    try:
+        from preprocessing import dnnlib
+        cache_dir = dnnlib.make_cache_dir_path("diffusers")
+        os.environ.setdefault("HF_HUB_DISABLE_SYMLINKS_WARNING", "1")
+        os.environ.setdefault("HF_HUB_DISABLE_PROGRESS_BARS", "1")
+        os.environ["HF_HOME"] = cache_dir
+        try:
+            vae = AutoencoderKL.from_pretrained(
+                "stabilityai/sd-vae-ft-mse",
+                cache_dir=cache_dir,
+                local_files_only=True,
+            ).to(device)
+            vae.eval()
+            print(f"Loaded VAE from local cache: {cache_dir}")
+            return vae
+        except Exception:
+            pass
+        candidate_dir = None
+        for root_dir in [
+            cache_dir,
+            os.path.join(os.path.expanduser("~"), ".cache", "dnnlib", "diffusers"),
+            os.path.join(os.path.expanduser("~"), ".cache", "diffusers"),
+            os.path.join(os.path.expanduser("~"), ".cache", "huggingface", "hub"),
+        ]:
+            if not os.path.isdir(root_dir):
+                continue
+            for root, _, files in os.walk(root_dir):
+                if "config.json" in files and "sd-vae-ft-mse" in root.replace("\\", "/"):
+                    candidate_dir = root
+                    break
+            if candidate_dir is not None:
+                break
+        if candidate_dir is not None:
+            vae = AutoencoderKL.from_pretrained(candidate_dir, local_files_only=True).to(device)
+            vae.eval()
+            print(f"Loaded VAE from {candidate_dir}")
+            return vae
+    except Exception as e:
+        print(f"VAE local cache search failed: {e}", file=sys.stderr)
+    try:
+        vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse").to(device)
+        vae.eval()
+        print("Loaded VAE from Hub: stabilityai/sd-vae-ft-mse")
+        return vae
+    except Exception as e:
+        raise RuntimeError(
+            "无法加载 VAE stabilityai/sd-vae-ft-mse，请确认已下载或网络可用。"
+        ) from e
+def build_model_from_train_args(ta: argparse.Namespace, device: torch.device):
+    res = int(getattr(ta, "resolution", 256))
+    latent_size = res // 8
+    enc_type = getattr(ta, "enc_type", "dinov2-vit-b")
+    z_dims = [semantic_dim_from_enc_type(enc_type)]
+    block_kwargs = {
+        "fused_attn": getattr(ta, "fused_attn", True),
+        "qk_norm": getattr(ta, "qk_norm", False),
+    }
+    cfg_prob = float(getattr(ta, "cfg_prob", 0.1))
+    if ta.model not in SiT_models:
+        raise ValueError(f"未知 model={ta.model!r}，可选：{list(SiT_models.keys())}")
+    model = SiT_models[ta.model](
+        input_size=latent_size,
+        num_classes=int(getattr(ta, "num_classes", 1000)),
+        use_cfg=(cfg_prob > 0),
+        z_dims=z_dims,
+        encoder_depth=int(getattr(ta, "encoder_depth", 8)),
+        **block_kwargs,
+    ).to(device)
+    return model, z_dims[0]
+def resolve_tc_schedule(cli, ta):
+    """
+    若同时给出 --steps-before-tc 与 --steps-after-tc：在 t_c 处分段（--t-c 缺省则用检查点 args.t_c）。
+    否则使用均匀 --num-steps（与旧版一致）。
+    """
+    sb = cli.steps_before_tc
+    sa = cli.steps_after_tc
+    tc = cli.t_c
+    if sb is None and sa is None:
+        return None, None, None
+    if sb is None or sa is None:
+        print(
+            "使用分段步数时必须同时指定 --steps-before-tc 与 --steps-after-tc。",
+            file=sys.stderr,
+        )
+        sys.exit(1)
+    if tc is None:
+        tc = getattr(ta, "t_c", None) if ta is not None else None
+    if tc is None:
+        print(
+            "分段采样需要 --t-c，或检查点 args 中含 t_c。",
+            file=sys.stderr,
+        )
+        sys.exit(1)
+    return float(tc), int(sb), int(sa)
+def parse_cli():
+    p = argparse.ArgumentParser(description="REG 检查点采样出图（可选 ODE/EM/EM-图像噪声）")
+    p.add_argument("--ckpt", type=str, required=True, help="train.py 保存的 .pt 路径")
+    p.add_argument("--out-dir", type=str, required=True, help="输出 PNG 目录（会创建）")
+    p.add_argument("--num-images", type=int, required=True, help="生成图片总数")
+    p.add_argument("--batch-size", type=int, default=16)
+    p.add_argument("--seed", type=int, default=0)
+    p.add_argument(
+        "--weights",
+        type=str,
+        choices=("ema", "model"),
+        default="ema",
+        help="使用检查点中的 ema 或 model 权重",
+    )
+    p.add_argument("--device", type=str, default="cuda", help="如 cuda 或 cuda:0")
+    p.add_argument(
+        "--num-steps",
+        type=int,
+        default=50,
+        help="均匀时间网格时的欧拉步数（未使用 --steps-before-tc/--steps-after-tc 时生效）",
+    )
+    p.add_argument(
+        "--t-c",
+        type=float,
+        default=None,
+        help="分段时刻：t∈(t_c,1] 与 t∈[0,t_c] 两段；缺省可用检查点 args.t_c（需配合两段步数）",
+    )
+    p.add_argument(
+        "--steps-before-tc",
+        type=int,
+        default=None,
+        help="从 t=1 积分到 t=t_c 的步数（与 --steps-after-tc 成对使用）",
+    )
+    p.add_argument(
+        "--steps-after-tc",
+        type=int,
+        default=None,
+        help="从 t=t_c 积分到 t=0（经 t_floor=0.04）的步数",
+    )
+    p.add_argument("--cfg-scale", type=float, default=1.0)
+    p.add_argument("--cls-cfg-scale", type=float, default=0.0, help="cls 分支 CFG（>0 时需 cfg-scale>1）")
+    p.add_argument("--guidance-low", type=float, default=0.0)
+    p.add_argument("--guidance-high", type=float, default=1.0)
+    p.add_argument(
+        "--path-type",
+        type=str,
+        default=None,
+        choices=["linear", "cosine"],
+        help="默认从检查点 args 读取；可覆盖",
+    )
+    p.add_argument("--legacy", action=argparse.BooleanOptionalAction, default=False)
+    # 无 args 时的兜底
+    p.add_argument("--model", type=str, default=None, help="无检查点 args 时必填；与 SiT_models 键一致，如 SiT-XL/2")
+    p.add_argument("--resolution", type=int, default=None, choices=[256, 512])
+    p.add_argument("--num-classes", type=int, default=None)
+    p.add_argument("--encoder-depth", type=int, default=None)
+    p.add_argument("--enc-type", type=str, default=None)
+    p.add_argument("--fused-attn", action=argparse.BooleanOptionalAction, default=None)
+    p.add_argument("--qk-norm", action=argparse.BooleanOptionalAction, default=None)
+    p.add_argument("--cfg-prob", type=float, default=None)
+    p.add_argument(
+        "--sampler",
+        type=str,
+        default="em_image_noise",
+        choices=["ode", "em", "em_image_noise", "em_image_noise_before_tc"],
+        help="采样器：ode=euler_sampler 确定性漂移（linspace 1→0 或 t_c 分段直连 0，无 t_floor；与 EM 网格不同），"
+             "em=标准EM(含图像+cls噪声)，em_image_noise=仅图像噪声，"
+             "em_image_noise_before_tc=t<=t_c时图像去随机+cls全程去随机",
+    )
+    p.add_argument(
+        "--dual-compare-after",
+        action="store_true",
+        help="需配合分段步数：同批 z/y/cls 连跑两次；after_input 用 --steps-after-tc，"
+        "after_equal_before 将 after 步数设为与 --steps-before-tc 相同",
+    )
+    p.add_argument(
+        "--save-fixed-trajectory",
+        action="store_true",
+        help="保存固定步采样轨迹（npy）；仅对非 em 采样器启用，输出在 out-dir/trajectory",
+    )
+    return p.parse_args()
+def _decode_to_uint8_hwc(latents, latents_bias, latents_scale, vae):
+    imgs = vae.decode((latents - latents_bias) / latents_scale).sample
+    imgs = (imgs + 1) / 2.0
+    imgs = torch.clamp(imgs, 0, 1)
+    return (
+        (imgs * 255.0)
+        .round()
+        .to(torch.uint8)
+        .permute(0, 2, 3, 1)
+        .cpu()
+        .numpy()
+    )
+def main():
+    cli = parse_cli()
+    device = torch.device(cli.device if torch.cuda.is_available() else "cpu")
+    if device.type == "cuda":
+        torch.backends.cuda.matmul.allow_tf32 = True
+    try:
+        ckpt = torch.load(cli.ckpt, map_location="cpu", weights_only=False)
+    except TypeError:
+        ckpt = torch.load(cli.ckpt, map_location="cpu")
+    ta = load_train_args_from_ckpt(ckpt)
+    if ta is None:
+        if cli.model is None or cli.resolution is None or cli.enc_type is None:
+            print(
+                "检查点中无 args，请至少指定：--model --resolution --enc-type "
+                "（以及按需 --num-classes --encoder-depth）",
+                file=sys.stderr,
+            )
+            sys.exit(1)
+        ta = argparse.Namespace(
+            model=cli.model,
+            resolution=cli.resolution,
+            num_classes=cli.num_classes if cli.num_classes is not None else 1000,
+            encoder_depth=cli.encoder_depth if cli.encoder_depth is not None else 8,
+            enc_type=cli.enc_type,
+            fused_attn=cli.fused_attn if cli.fused_attn is not None else True,
+            qk_norm=cli.qk_norm if cli.qk_norm is not None else False,
+            cfg_prob=cli.cfg_prob if cli.cfg_prob is not None else 0.1,
+        )
+    else:
+        if cli.model is not None:
+            ta.model = cli.model
+        if cli.resolution is not None:
+            ta.resolution = cli.resolution
+        if cli.num_classes is not None:
+            ta.num_classes = cli.num_classes
+        if cli.encoder_depth is not None:
+            ta.encoder_depth = cli.encoder_depth
+        if cli.enc_type is not None:
+            ta.enc_type = cli.enc_type
+        if cli.fused_attn is not None:
+            ta.fused_attn = cli.fused_attn
+        if cli.qk_norm is not None:
+            ta.qk_norm = cli.qk_norm
+        if cli.cfg_prob is not None:
+            ta.cfg_prob = cli.cfg_prob
+    path_type = cli.path_type if cli.path_type is not None else getattr(ta, "path_type", "linear")
+    tc_split = resolve_tc_schedule(cli, ta)
+    if cli.dual_compare_after and tc_split[0] is None:
+        print("--dual-compare-after 必须配合 --steps-before-tc 与 --steps-after-tc（分段采样）", file=sys.stderr)
+        sys.exit(1)
+    if tc_split[0] is not None:
+        if cli.dual_compare_after:
+            print(
+                f"双次对比：t_c={tc_split[0]}, before={tc_split[1]}, "
+                f"after_input={tc_split[2]}, after_equal_before={tc_split[1]}"
+            )
+        else:
+            print(
+                f"时间网格：t_c={tc_split[0]}, 步数 (1→t_c)={tc_split[1]}, (t_c→0)={tc_split[2]} "
+                f"（总模型前向约 {tc_split[1] + tc_split[2] + 1} 次）"
+            )
+    else:
+        print(f"时间网格：均匀 num_steps={cli.num_steps}")
+    if cli.sampler == "ode":
+        sampler_fn = euler_ode_sampler
+    elif cli.sampler == "em":
+        sampler_fn = euler_maruyama_sampler
+    elif cli.sampler == "em_image_noise_before_tc":
+        sampler_fn = euler_maruyama_image_noise_before_tc_sampler
+    else:
+        sampler_fn = euler_maruyama_image_noise_sampler
+    model, cls_dim = build_model_from_train_args(ta, device)
+    wkey = cli.weights
+    if wkey not in ckpt:
+        raise KeyError(f"检查点中无 '{wkey}' 键，现有键：{list(ckpt.keys())}")
+    state = ckpt[wkey]
+    if cli.legacy:
+        from utils import load_legacy_checkpoints
+        state = load_legacy_checkpoints(
+            state_dict=state, encoder_depth=int(getattr(ta, "encoder_depth", 8))
+        )
+    model.load_state_dict(state, strict=True)
+    model.eval()
+    vae = load_vae(device)
+    latents_scale = torch.tensor([0.18215] * 4, device=device).view(1, 4, 1, 1)
+    latents_bias = torch.tensor([0.0] * 4, device=device).view(1, 4, 1, 1)
+    sampler_args = argparse.Namespace(cls_cfg_scale=float(cli.cls_cfg_scale))
+    at_tc_dir = at_tc_a = at_tc_b = None
+    traj_dir = traj_a = traj_b = None
+    if cli.dual_compare_after:
+        out_a = os.path.join(cli.out_dir, "after_input")
+        out_b = os.path.join(cli.out_dir, "after_equal_before")
+        os.makedirs(out_a, exist_ok=True)
+        os.makedirs(out_b, exist_ok=True)
+        if tc_split[0] is not None:
+            at_tc_a = os.path.join(cli.out_dir, "at_tc", "after_input")
+            at_tc_b = os.path.join(cli.out_dir, "at_tc", "after_equal_before")
+            os.makedirs(at_tc_a, exist_ok=True)
+            os.makedirs(at_tc_b, exist_ok=True)
+        if cli.save_fixed_trajectory and cli.sampler != "em":
+            traj_a = os.path.join(cli.out_dir, "trajectory", "after_input")
+            traj_b = os.path.join(cli.out_dir, "trajectory", "after_equal_before")
+            os.makedirs(traj_a, exist_ok=True)
+            os.makedirs(traj_b, exist_ok=True)
+    else:
+        os.makedirs(cli.out_dir, exist_ok=True)
+        if tc_split[0] is not None:
+            at_tc_dir = os.path.join(cli.out_dir, "at_tc")
+            os.makedirs(at_tc_dir, exist_ok=True)
+        if cli.save_fixed_trajectory and cli.sampler != "em":
+            traj_dir = os.path.join(cli.out_dir, "trajectory")
+            os.makedirs(traj_dir, exist_ok=True)
+    latent_size = int(getattr(ta, "resolution", 256)) // 8
+    n_total = int(cli.num_images)
+    b = max(1, int(cli.batch_size))
+    torch.manual_seed(cli.seed)
+    if device.type == "cuda":
+        torch.cuda.manual_seed_all(cli.seed)
+    written = 0
+    pbar = tqdm(total=n_total, desc="sampling")
+    while written < n_total:
+        cur = min(b, n_total - written)
+        z = torch.randn(cur, model.in_channels, latent_size, latent_size, device=device)
+        y = torch.randint(0, int(ta.num_classes), (cur,), device=device)
+        cls_z = torch.randn(cur, cls_dim, device=device)
+        with torch.no_grad():
+            base_kw = dict(
+                num_steps=cli.num_steps,
+                cfg_scale=cli.cfg_scale,
+                guidance_low=cli.guidance_low,
+                guidance_high=cli.guidance_high,
+                path_type=path_type,
+                cls_latents=cls_z,
+                args=sampler_args,
+            )
+            if cli.dual_compare_after:
+                tc_v, sb, sa_in = tc_split
+                # 两次完整采样会各自消耗 RNG；不重置则第二条的 1→t_c 噪声与第一条不同，z_tc/at_tc 会对不齐。
+                # 在固定 z/y/cls_z 之后打快照，第二条运行前恢复，使 t_c 中间态一致（仅后段步数不同）。
+                _rng_cpu_dual = torch.random.get_rng_state()
+                _rng_cuda_dual = (
+                    torch.cuda.get_rng_state_all()
+                    if device.type == "cuda"
+                    else None
+                )
+                for _run_i, (subdir, sa, tc_save_dir) in enumerate(
+                    (
+                        (out_a, sa_in, at_tc_a),
+                        (out_b, sb, at_tc_b),
+                    )
+                ):
+                    if _run_i > 0:
+                        torch.random.set_rng_state(_rng_cpu_dual)
+                        if _rng_cuda_dual is not None:
+                            torch.cuda.set_rng_state_all(_rng_cuda_dual)
+                    em_kw = dict(base_kw)
+                    em_kw["t_c"] = tc_v
+                    em_kw["num_steps_before_tc"] = sb
+                    em_kw["num_steps_after_tc"] = sa
+                    if cli.sampler == "em_image_noise_before_tc":
+                        if cli.save_fixed_trajectory and cli.sampler != "em":
+                            latents, z_tc, cls_tc, cls_t0, traj = sampler_fn(
+                                model,
+                                z,
+                                y,
+                                **em_kw,
+                                return_mid_state=True,
+                                t_mid=float(tc_v),
+                                return_cls_final=True,
+                                return_trajectory=True,
+                            )
+                        else:
+                            latents, z_tc, cls_tc, cls_t0 = sampler_fn(
+                                model,
+                                z,
+                                y,
+                                **em_kw,
+                                return_mid_state=True,
+                                t_mid=float(tc_v),
+                                return_cls_final=True,
+                            )
+                            traj = None
+                    else:
+                        if cli.save_fixed_trajectory and cli.sampler != "em":
+                            latents, z_tc, cls_tc, traj = sampler_fn(
+                                model,
+                                z,
+                                y,
+                                **em_kw,
+                                return_mid_state=True,
+                                t_mid=float(tc_v),
+                                return_trajectory=True,
+                            )
+                        else:
+                            latents, z_tc, cls_tc = sampler_fn(
+                                model,
+                                z,
+                                y,
+                                **em_kw,
+                                return_mid_state=True,
+                                t_mid=float(tc_v),
+                            )
+                            traj = None
+                        cls_t0 = None
+                    latents = latents.to(torch.float32)
+                    imgs = _decode_to_uint8_hwc(latents, latents_bias, latents_scale, vae)
+                    for i in range(cur):
+                        Image.fromarray(imgs[i]).save(
+                            os.path.join(subdir, f"{written + i:06d}.png")
+                        )
+                    if tc_save_dir is not None and z_tc is not None:
+                        imgs_tc = _decode_to_uint8_hwc(
+                            z_tc.to(torch.float32), latents_bias, latents_scale, vae
+                        )
+                        for i in range(cur):
+                            Image.fromarray(imgs_tc[i]).save(
+                                os.path.join(tc_save_dir, f"{written + i:06d}.png")
+                            )
+                    if traj is not None:
+                        traj_np = torch.stack(traj, dim=0).to(torch.float32).cpu().numpy()
+                        save_traj_dir = traj_a if subdir == out_a else traj_b
+                        np.save(os.path.join(save_traj_dir, f"{written:06d}_traj.npy"), traj_np)
+            else:
+                em_kw = dict(base_kw)
+                if tc_split[0] is not None:
+                    em_kw["t_c"] = tc_split[0]
+                    em_kw["num_steps_before_tc"] = tc_split[1]
+                    em_kw["num_steps_after_tc"] = tc_split[2]
+                    if cli.sampler == "em_image_noise_before_tc":
+                        if cli.save_fixed_trajectory and cli.sampler != "em":
+                            latents, z_tc, cls_tc, cls_t0, traj = sampler_fn(
+                                model,
+                                z,
+                                y,
+                                **em_kw,
+                                return_mid_state=True,
+                                t_mid=float(tc_split[0]),
+                                return_cls_final=True,
+                                return_trajectory=True,
+                            )
+                        else:
+                            latents, z_tc, cls_tc, cls_t0 = sampler_fn(
+                                model,
+                                z,
+                                y,
+                                **em_kw,
+                                return_mid_state=True,
+                                t_mid=float(tc_split[0]),
+                                return_cls_final=True,
+                            )
+                            traj = None
+                    else:
+                        if cli.save_fixed_trajectory and cli.sampler != "em":
+                            latents, z_tc, cls_tc, traj = sampler_fn(
+                                model,
+                                z,
+                                y,
+                                **em_kw,
+                                return_mid_state=True,
+                                t_mid=float(tc_split[0]),
+                                return_trajectory=True,
+                            )
+                        else:
+                            latents, z_tc, cls_tc = sampler_fn(
+                                model,
+                                z,
+                                y,
+                                **em_kw,
+                                return_mid_state=True,
+                                t_mid=float(tc_split[0]),
+                            )
+                            traj = None
+                        cls_t0 = None
+                    latents = latents.to(torch.float32)
+                    if z_tc is not None and at_tc_dir is not None:
+                        imgs_tc = _decode_to_uint8_hwc(
+                            z_tc.to(torch.float32), latents_bias, latents_scale, vae
+                        )
+                        for i in range(cur):
+                            Image.fromarray(imgs_tc[i]).save(
+                                os.path.join(at_tc_dir, f"{written + i:06d}.png")
+                            )
+                    if traj is not None and traj_dir is not None:
+                        traj_np = torch.stack(traj, dim=0).to(torch.float32).cpu().numpy()
+                        np.save(os.path.join(traj_dir, f"{written:06d}_traj.npy"), traj_np)
+                else:
+                    latents = sampler_fn(model, z, y, **em_kw).to(torch.float32)
+                imgs = _decode_to_uint8_hwc(latents, latents_bias, latents_scale, vae)
+                for i in range(cur):
+                    Image.fromarray(imgs[i]).save(
+                        os.path.join(cli.out_dir, f"{written + i:06d}.png")
+                    )
+        written += cur
+        pbar.update(cur)
+    pbar.close()
+    if cli.dual_compare_after:
+        msg = (
+            f"Done. Saved {written} images per run under {out_a} and {out_b} "
+            f"(parent: {cli.out_dir})"
+        )
+        if tc_split[0] is not None and at_tc_a is not None:
+            msg += f"; t≈t_c decoded under {at_tc_a} and {at_tc_b}"
+        print(msg)
+    else:
+        msg = f"Done. Saved {written} images under {cli.out_dir}"
+        if tc_split[0] is not None and at_tc_dir is not None:
+            msg += f"; t≈t_c decoded under {at_tc_dir}"
+        print(msg)
+if __name__ == "__main__":
+    main()

back/samples.sh ADDED Viewed

	@@ -0,0 +1,15 @@

+#!/usr/bin/env bash
+# 双次对比步数请用 --dual-compare-after（见 sample_from_checkpoint.py），输出在 out-dir 子目录。
+CUDA_VISIBLE_DEVICES=1 python sample_from_checkpoint.py \
+  --ckpt /gemini/space/zhaozy/guzhenyu/UAVFlow/UAV_Flow_base/exps/jsflow-experiment/samples/REG/exps/jsflow-experiment-0.75/checkpoints/0500000.pt \
+  --out-dir ./my_samples_test \
+  --num-images 24 \
+  --batch-size 4 \
+  --seed 0 \
+  --t-c 0.75 \
+  --steps-before-tc 50 \
+  --steps-after-tc 5 \
+  --sampler ode \
+  --cfg-scale 1.0 \
+  --dual-compare-after \

back/samples_0.5.log ADDED Viewed

The diff for this file is too large to render. See raw diff

back/samples_ddp.sh ADDED Viewed

	@@ -0,0 +1,32 @@

+#!/usr/bin/env bash
+# 4 卡 DDP 单路径采样（不做 dual-compare，不保存 at_tc 中间图）
+CUDA_VISIBLE_DEVICES=0,1,2,3 nohup nohup torchrun \
+  --nnodes=1 \
+  --nproc_per_node=4 \
+  --rdzv_endpoint=localhost:29110 \
+  sample_from_checkpoint_ddp.py \
+  --ckpt /gemini/space/zhaozy/guzhenyu/UAVFlow/UAV_Flow_base/exps/jsflow-experiment/samples/REG/exps/jsflow-experiment-0.75/checkpoints/0600000.pt \
+  --out-dir ./my_samples_600k_new \
+  --num-images 40000 \
+  --batch-size 64 \
+  --seed 0 \
+  --t-c 0.75 \
+  --steps-before-tc 100 \
+  --steps-after-tc 50 \
+  --sampler em_image_noise_before_tc \
+  --cfg-scale 1.0 \
+  > samples_0.75_new.log 2>&1 &
+# nohup python sample_from_checkpoint_ddp.py \
+#   --ckpt /gemini/space/zhaozy/guzhenyu/UAVFlow/UAV_Flow_base/exps/jsflow-experiment/samples/REG/exps/jsflow-experiment-0.5/checkpoints/0250000.pt \
+#   --out-dir ./my_samples_5 \
+#   --num-images 20000 \
+#   --batch-size 16 \
+#   --seed 0 \
+#   --t-c 0.5 \
+#   --steps-before-tc 100 \
+#   --steps-after-tc 50 \
+#   --sampler em_image_noise_before_tc \
+#   --cfg-scale 1.0 \
+#   > samples_0.5.log 2>&1 &

back/train.py ADDED Viewed

	@@ -0,0 +1,670 @@

+import argparse
+import copy
+from copy import deepcopy
+import logging
+import os
+from pathlib import Path
+from collections import OrderedDict
+import json
+import numpy as np
+import torch
+import torch.nn.functional as F
+import torch.utils.checkpoint
+from tqdm.auto import tqdm
+from torch.utils.data import DataLoader
+from accelerate import Accelerator, DistributedDataParallelKwargs
+from accelerate.logging import get_logger
+from accelerate.utils import ProjectConfiguration, set_seed
+from models.sit import SiT_models
+from loss import SILoss
+from utils import load_encoders
+from dataset import CustomDataset
+from diffusers.models import AutoencoderKL
+# import wandb_utils
+import wandb
+import math
+from torchvision.utils import make_grid
+from timm.data import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD
+from torchvision.transforms import Normalize
+from PIL import Image
+logger = get_logger(__name__)
+def semantic_dim_from_enc_type(enc_type):
+    """DINOv2 等 enc_type 字符串推断 class token 维度（与预处理特征一致）。"""
+    if enc_type is None:
+        return 768
+    s = str(enc_type).lower()
+    if "vit-g" in s or "vitg" in s:
+        return 1536
+    if "vit-l" in s or "vitl" in s:
+        return 1024
+    if "vit-s" in s or "vits" in s:
+        return 384
+    return 768
+CLIP_DEFAULT_MEAN = (0.48145466, 0.4578275, 0.40821073)
+CLIP_DEFAULT_STD = (0.26862954, 0.26130258, 0.27577711)
+def preprocess_raw_image(x, enc_type):
+    resolution = x.shape[-1]
+    if 'clip' in enc_type:
+        x = x / 255.
+        x = torch.nn.functional.interpolate(x, 224 * (resolution // 256), mode='bicubic')
+        x = Normalize(CLIP_DEFAULT_MEAN, CLIP_DEFAULT_STD)(x)
+    elif 'mocov3' in enc_type or 'mae' in enc_type:
+        x = x / 255.
+        x = Normalize(IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD)(x)
+    elif 'dinov2' in enc_type:
+        x = x / 255.
+        x = Normalize(IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD)(x)
+        x = torch.nn.functional.interpolate(x, 224 * (resolution // 256), mode='bicubic')
+    elif 'dinov1' in enc_type:
+        x = x / 255.
+        x = Normalize(IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD)(x)
+    elif 'jepa' in enc_type:
+        x = x / 255.
+        x = Normalize(IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD)(x)
+        x = torch.nn.functional.interpolate(x, 224 * (resolution // 256), mode='bicubic')
+    return x
+def array2grid(x):
+    nrow = round(math.sqrt(x.size(0)))
+    x = make_grid(x.clamp(0, 1), nrow=nrow, value_range=(0, 1))
+    x = x.mul(255).add_(0.5).clamp_(0, 255).permute(1, 2, 0).to('cpu', torch.uint8).numpy()
+    return x
+@torch.no_grad()
+def sample_posterior(moments, latents_scale=1., latents_bias=0.):
+    device = moments.device
+    mean, std = torch.chunk(moments, 2, dim=1)
+    z = mean + std * torch.randn_like(mean)
+    z = (z * latents_scale + latents_bias)
+    return z
+@torch.no_grad()
+def update_ema(ema_model, model, decay=0.9999):
+    """
+    Step the EMA model towards the current model.
+    """
+    ema_params = OrderedDict(ema_model.named_parameters())
+    model_params = OrderedDict(model.named_parameters())
+    for name, param in model_params.items():
+        name = name.replace("module.", "")
+        # TODO: Consider applying only to params that require_grad to avoid small numerical changes of pos_embed
+        ema_params[name].mul_(decay).add_(param.data, alpha=1 - decay)
+def create_logger(logging_dir):
+    """
+    Create a logger that writes to a log file and stdout.
+    """
+    logging.basicConfig(
+        level=logging.INFO,
+        format='[\033[34m%(asctime)s\033[0m] %(message)s',
+        datefmt='%Y-%m-%d %H:%M:%S',
+        handlers=[logging.StreamHandler(), logging.FileHandler(f"{logging_dir}/log.txt")]
+    )
+    logger = logging.getLogger(__name__)
+    return logger
+def requires_grad(model, flag=True):
+    """
+    Set requires_grad flag for all parameters in a model.
+    """
+    for p in model.parameters():
+        p.requires_grad = flag
+#################################################################################
+#                                  Training Loop                                #
+#################################################################################
+def main(args):
+    # set accelerator
+    logging_dir = Path(args.output_dir, args.logging_dir)
+    accelerator_project_config = ProjectConfiguration(
+        project_dir=args.output_dir, logging_dir=logging_dir
+        )
+    accelerator = Accelerator(
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
+        mixed_precision=args.mixed_precision,
+        log_with=args.report_to,
+        project_config=accelerator_project_config,
+        kwargs_handlers=[DistributedDataParallelKwargs(find_unused_parameters=True)]
+    )
+    if accelerator.is_main_process:
+        os.makedirs(args.output_dir, exist_ok=True)  # Make results folder (holds all experiment subfolders)
+        save_dir = os.path.join(args.output_dir, args.exp_name)
+        os.makedirs(save_dir, exist_ok=True)
+        args_dict = vars(args)
+        # Save to a JSON file
+        json_dir = os.path.join(save_dir, "args.json")
+        with open(json_dir, 'w') as f:
+            json.dump(args_dict, f, indent=4)
+        checkpoint_dir = f"{save_dir}/checkpoints"  # Stores saved model checkpoints
+        os.makedirs(checkpoint_dir, exist_ok=True)
+        logger = create_logger(save_dir)
+        logger.info(f"Experiment directory created at {save_dir}")
+    device = accelerator.device
+    if torch.backends.mps.is_available():
+        accelerator.native_amp = False
+    if args.seed is not None:
+        set_seed(args.seed + accelerator.process_index)
+    # Create model:
+    assert args.resolution % 8 == 0, "Image size must be divisible by 8 (for the VAE encoder)."
+    latent_size = args.resolution // 8
+    train_dataset = CustomDataset(
+        args.data_dir, semantic_features_dir=args.semantic_features_dir
+    )
+    use_preprocessed_semantic = train_dataset.use_preprocessed_semantic
+    if use_preprocessed_semantic:
+        encoders, encoder_types, architectures = [], [], []
+        z_dims = [semantic_dim_from_enc_type(args.enc_type)]
+        if accelerator.is_main_process:
+            logger.info(
+                f"Preprocessed semantic features: skip loading online encoder, z_dims={z_dims}"
+            )
+    elif args.enc_type is not None:
+        encoders, encoder_types, architectures = load_encoders(
+            args.enc_type, device, args.resolution
+        )
+        z_dims = [encoder.embed_dim for encoder in encoders]
+    else:
+        raise NotImplementedError()
+    block_kwargs = {"fused_attn": args.fused_attn, "qk_norm": args.qk_norm}
+    model = SiT_models[args.model](
+        input_size=latent_size,
+        num_classes=args.num_classes,
+        use_cfg = (args.cfg_prob > 0),
+        z_dims = z_dims,
+        encoder_depth=args.encoder_depth,
+        **block_kwargs
+    )
+    model = model.to(device)
+    ema = deepcopy(model).to(device)  # Create an EMA of the model for use after training
+    requires_grad(ema, False)
+    latents_scale = torch.tensor(
+        [0.18215, 0.18215, 0.18215, 0.18215]
+        ).view(1, 4, 1, 1).to(device)
+    latents_bias = torch.tensor(
+        [0., 0., 0., 0.]
+        ).view(1, 4, 1, 1).to(device)
+    # VAE decoder：采样阶段将 latent 解码为图像（与根目录 train.py / 预处理一致：sd-vae-ft-mse）
+    try:
+        from preprocessing import dnnlib
+        cache_dir = dnnlib.make_cache_dir_path("diffusers")
+        os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
+        os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
+        os.environ["HF_HOME"] = cache_dir
+        try:
+            vae = AutoencoderKL.from_pretrained(
+                "stabilityai/sd-vae-ft-mse",
+                cache_dir=cache_dir,
+                local_files_only=True,
+            ).to(device)
+            vae.eval()
+            if accelerator.is_main_process:
+                logger.info(
+                    "Loaded VAE 'stabilityai/sd-vae-ft-mse' from local diffusers cache "
+                    f"at '{cache_dir}' for intermediate sampling."
+                )
+        except Exception as e_main:
+            vae = None
+            candidate_dir = None
+            possible_roots = [
+                cache_dir,
+                os.path.join(os.path.expanduser("~"), ".cache", "dnnlib", "diffusers"),
+                os.path.join(os.path.expanduser("~"), ".cache", "diffusers"),
+                os.path.join(os.path.expanduser("~"), ".cache", "huggingface", "hub"),
+            ]
+            checked_roots = []
+            for root_dir in possible_roots:
+                if not os.path.isdir(root_dir):
+                    continue
+                checked_roots.append(root_dir)
+                for root, dirs, files in os.walk(root_dir):
+                    if "config.json" in files and "sd-vae-ft-mse" in root.replace("\\", "/"):
+                        candidate_dir = root
+                        break
+                if candidate_dir is not None:
+                    break
+            if candidate_dir is not None:
+                try:
+                    vae = AutoencoderKL.from_pretrained(
+                        candidate_dir,
+                        local_files_only=True,
+                    ).to(device)
+                    vae.eval()
+                    if accelerator.is_main_process:
+                        logger.info(
+                            "Loaded VAE 'stabilityai/sd-vae-ft-mse' from discovered local path "
+                            f"'{candidate_dir}'. Searched roots: {checked_roots}"
+                        )
+                except Exception as e_fallback:
+                    if accelerator.is_main_process:
+                        logger.warning(
+                            "Tried to load VAE from discovered local path "
+                            f"'{candidate_dir}' but failed: {e_fallback}"
+                        )
+            if vae is None and accelerator.is_main_process:
+                logger.warning(
+                    "Could not load VAE 'stabilityai/sd-vae-ft-mse' via repo name or local search. "
+                    f"Last repo-level error: {e_main}"
+                )
+    except Exception as e:
+        vae = None
+        if accelerator.is_main_process:
+            logger.warning(
+                f"Failed to initialize VAE loading logic (will skip image decoding): {e}"
+            )
+    # create loss function
+    loss_fn = SILoss(
+        prediction=args.prediction,
+        path_type=args.path_type,
+        encoders=encoders,
+        accelerator=accelerator,
+        latents_scale=latents_scale,
+        latents_bias=latents_bias,
+        weighting=args.weighting,
+        t_c=args.t_c,
+        ot_cls=args.ot_cls,
+    )
+    if accelerator.is_main_process:
+        logger.info(f"SiT Parameters: {sum(p.numel() for p in model.parameters()):,}")
+    # Setup optimizer (we used default Adam betas=(0.9, 0.999) and a constant learning rate of 1e-4 in our paper):
+    if args.allow_tf32:
+        torch.backends.cuda.matmul.allow_tf32 = True
+        torch.backends.cudnn.allow_tf32 = True
+    optimizer = torch.optim.AdamW(
+        model.parameters(),
+        lr=args.learning_rate,
+        betas=(args.adam_beta1, args.adam_beta2),
+        weight_decay=args.adam_weight_decay,
+        eps=args.adam_epsilon,
+    )
+    # Setup data（train_dataset 已在上方创建）
+    local_batch_size = int(args.batch_size // accelerator.num_processes)
+    train_dataloader = DataLoader(
+        train_dataset,
+        batch_size=local_batch_size,
+        shuffle=True,
+        num_workers=args.num_workers,
+        pin_memory=True,
+        drop_last=True
+    )
+    if accelerator.is_main_process:
+        logger.info(f"Dataset contains {len(train_dataset):,} images ({args.data_dir})")
+    # Prepare models for training:
+    update_ema(ema, model, decay=0)  # Ensure EMA is initialized with synced weights
+    model.train()  # important! This enables embedding dropout for classifier-free guidance
+    ema.eval()  # EMA model should always be in eval mode
+    # resume:
+    global_step = 0
+    if args.resume_step > 0:
+        ckpt_name = str(args.resume_step).zfill(7) +'.pt'
+        ckpt = torch.load(
+            f'{os.path.join(args.output_dir, args.exp_name)}/checkpoints/{ckpt_name}',
+            map_location='cpu',
+            )
+        model.load_state_dict(ckpt['model'])
+        ema.load_state_dict(ckpt['ema'])
+        optimizer.load_state_dict(ckpt['opt'])
+        global_step = ckpt['steps']
+    model, optimizer, train_dataloader = accelerator.prepare(
+        model, optimizer, train_dataloader
+    )
+    if accelerator.is_main_process:
+        tracker_config = vars(copy.deepcopy(args))
+        accelerator.init_trackers(
+            project_name="REG",
+            config=tracker_config,
+            init_kwargs={
+                "wandb": {"name": f"{args.exp_name}"}
+            },
+        )
+    progress_bar = tqdm(
+        range(0, args.max_train_steps),
+        initial=global_step,
+        desc="Steps",
+        # Only show the progress bar once on each machine.
+        disable=not accelerator.is_local_main_process,
+    )
+    # Labels to condition the model with (feel free to change):
+    sample_batch_size = 64 // accelerator.num_processes
+    first_batch = next(iter(train_dataloader))
+    if len(first_batch) == 4:
+        gt_raw_images, gt_xs, _, _ = first_batch
+    else:
+        gt_raw_images, gt_xs, _ = first_batch
+        assert gt_raw_images.shape[-1] == args.resolution
+    gt_xs = gt_xs[:sample_batch_size]
+    gt_xs = sample_posterior(
+        gt_xs.to(device), latents_scale=latents_scale, latents_bias=latents_bias
+        )
+    ys = torch.randint(1000, size=(sample_batch_size,), device=device)
+    ys = ys.to(device)
+    # Create sampling noise:
+    n = ys.size(0)
+    xT = torch.randn((n, 4, latent_size, latent_size), device=device)
+    for epoch in range(args.epochs):
+        model.train()
+        for batch in train_dataloader:
+            if len(batch) == 4:
+                raw_image, x, r_preprocessed, y = batch
+                use_sem_file = True
+            else:
+                raw_image, x, y = batch
+                r_preprocessed = None
+                use_sem_file = False
+            raw_image = raw_image.to(device)
+            x = x.squeeze(dim=1).to(device).float()
+            y = y.to(device)
+            if args.legacy:
+                # In our early experiments, we accidentally apply label dropping twice:
+                # once in train.py and once in sit.py.
+                # We keep this option for exact reproducibility with previous runs.
+                drop_ids = torch.rand(y.shape[0], device=y.device) < args.cfg_prob
+                labels = torch.where(drop_ids, args.num_classes, y)
+            else:
+                labels = y
+            with torch.no_grad():
+                x = sample_posterior(x, latents_scale=latents_scale, latents_bias=latents_bias)
+                zs = []
+                if use_sem_file and r_preprocessed is not None:
+                    cls_token = r_preprocessed.to(device).float()
+                    if cls_token.dim() == 1:
+                        cls_token = cls_token.unsqueeze(0)
+                    while cls_token.dim() > 2:
+                        cls_token = cls_token.squeeze(1)
+                    base_m = model.module if hasattr(model, "module") else model
+                    n_pad = base_m.x_embedder.num_patches
+                    zs = [
+                        torch.cat(
+                            [
+                                cls_token.unsqueeze(1),
+                                cls_token.unsqueeze(1).expand(-1, n_pad, -1),
+                            ],
+                            dim=1,
+                        )
+                    ]
+                else:
+                    with accelerator.autocast():
+                        for encoder, encoder_type, arch in zip(
+                            encoders, encoder_types, architectures
+                        ):
+                            raw_image_ = preprocess_raw_image(raw_image, encoder_type)
+                            z = encoder.forward_features(raw_image_)
+                            if 'dinov2' in encoder_type:
+                                dense_z = z['x_norm_patchtokens']
+                                cls_token = z['x_norm_clstoken']
+                                dense_z = torch.cat([cls_token.unsqueeze(1), dense_z], dim=1)
+                            else:
+                                exit()
+                            zs.append(dense_z)
+            with accelerator.accumulate(model):
+                model_kwargs = dict(y=labels)
+                loss1, proj_loss1, time_input, noises, loss2 = loss_fn(model, x, model_kwargs, zs=zs,
+                                                                       cls_token=cls_token,
+                                                                       time_input=None, noises=None)
+                loss_mean = loss1.mean()
+                loss_mean_cls = loss2.mean() * args.cls
+                proj_loss_mean = proj_loss1.mean() * args.proj_coeff
+                loss = loss_mean + proj_loss_mean + loss_mean_cls
+                ## optimization
+                accelerator.backward(loss)
+                if accelerator.sync_gradients:
+                    params_to_clip = model.parameters()
+                    grad_norm = accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
+                optimizer.step()
+                optimizer.zero_grad(set_to_none=True)
+                if accelerator.sync_gradients:
+                    update_ema(ema, model) # change ema function
+            ### enter
+            if accelerator.sync_gradients:
+                progress_bar.update(1)
+                global_step += 1
+            if global_step % args.checkpointing_steps == 0 and global_step > 0:
+                if accelerator.is_main_process:
+                    checkpoint = {
+                        "model": model.module.state_dict(),
+                        "ema": ema.state_dict(),
+                        "opt": optimizer.state_dict(),
+                        "args": args,
+                        "steps": global_step,
+                    }
+                    checkpoint_path = f"{checkpoint_dir}/{global_step:07d}.pt"
+                    torch.save(checkpoint, checkpoint_path)
+                    logger.info(f"Saved checkpoint to {checkpoint_path}")
+            if (global_step == 1 or (global_step % args.sampling_steps == 0 and global_step > 0)):
+                t_mid_vis = float(args.t_c)
+                tc_tag = f"{t_mid_vis:.4f}".rstrip("0").rstrip(".").replace(".", "_")
+                logging.info(
+                    f"Generating EMA samples (Euler-Maruyama; t≈{t_mid_vis:g} → t=0)..."
+                )
+                ema.eval()
+                with torch.no_grad():
+                    latent_size = args.resolution // 8
+                    n_samples = min(16, args.batch_size)
+                    base_model = model.module if hasattr(model, "module") else model
+                    cls_dim = base_model.z_dims[0]
+                    shared_seed = torch.randint(0, 2**32, (1,), device=device).item()
+                    torch.manual_seed(shared_seed)
+                    z_init = torch.randn(n_samples, base_model.in_channels, latent_size, latent_size, device=device)
+                    torch.manual_seed(shared_seed)
+                    cls_init = torch.randn(n_samples, cls_dim, device=device)
+                    y_samples = torch.randint(0, args.num_classes, (n_samples,), device=device)
+                    from samplers import euler_maruyama_sampler
+                    z_0, z_mid, _ = euler_maruyama_sampler(
+                        ema,
+                        z_init,
+                        y_samples,
+                        num_steps=50,
+                        cfg_scale=1.0,
+                        guidance_low=0.0,
+                        guidance_high=1.0,
+                        path_type=args.path_type,
+                        cls_latents=cls_init,
+                        args=args,
+                        return_mid_state=True,
+                        t_mid=t_mid_vis,
+                    )
+                    samples_root = os.path.join(args.output_dir, args.exp_name, "samples")
+                    t0_dir = os.path.join(samples_root, "t0")
+                    t_mid_dir = os.path.join(samples_root, f"t0_{tc_tag}")
+                    os.makedirs(t0_dir, exist_ok=True)
+                    os.makedirs(t_mid_dir, exist_ok=True)
+                    if vae is not None:
+                        z_f = z_0.to(dtype=torch.float32)
+                        samples_final = vae.decode((z_f - latents_bias) / latents_scale).sample
+                        samples_final = (samples_final + 1) / 2.0
+                        samples_final = samples_final.clamp(0, 1)
+                        grid_final = array2grid(samples_final)
+                        Image.fromarray(grid_final).save(
+                            os.path.join(t0_dir, f"step_{global_step:07d}_t0.png")
+                        )
+                        if z_mid is not None:
+                            z_m = z_mid.to(dtype=torch.float32)
+                            samples_mid = vae.decode((z_m - latents_bias) / latents_scale).sample
+                            samples_mid = (samples_mid + 1) / 2.0
+                            samples_mid = samples_mid.clamp(0, 1)
+                            grid_mid = array2grid(samples_mid)
+                            Image.fromarray(grid_mid).save(
+                                os.path.join(t_mid_dir, f"step_{global_step:07d}_t0_{tc_tag}.png")
+                            )
+                        else:
+                            logging.warning(
+                                f"Sampling time grid did not bracket t_mid={t_mid_vis:g}; "
+                                f"skip t0_{tc_tag} image this step."
+                            )
+                    del z_init, cls_init, y_samples, z_0
+                    if z_mid is not None:
+                        del z_mid
+                    if vae is not None:
+                        del samples_final, grid_final
+                        if "samples_mid" in locals():
+                            del samples_mid, grid_mid
+                    torch.cuda.empty_cache()
+            logs = {
+                "loss_final": accelerator.gather(loss).mean().detach().item(),
+                "loss_mean": accelerator.gather(loss_mean).mean().detach().item(),
+                "proj_loss": accelerator.gather(proj_loss_mean).mean().detach().item(),
+                "loss_mean_cls": accelerator.gather(loss_mean_cls).mean().detach().item(),
+                "grad_norm": accelerator.gather(grad_norm).mean().detach().item()
+            }
+            log_message = ", ".join(f"{key}: {value:.6f}" for key, value in logs.items())
+            logging.info(f"Step: {global_step}, Training Logs: {log_message}")
+            progress_bar.set_postfix(**logs)
+            accelerator.log(logs, step=global_step)
+            if global_step >= args.max_train_steps:
+                break
+        if global_step >= args.max_train_steps:
+            break
+    model.eval()  # important! This disables randomized embedding dropout
+    # do any sampling/FID calculation/etc. with ema (or model) in eval mode ...
+    accelerator.wait_for_everyone()
+    if accelerator.is_main_process:
+        logger.info("Done!")
+    accelerator.end_training()
+def parse_args(input_args=None):
+    parser = argparse.ArgumentParser(description="Training")
+    # logging:
+    parser.add_argument("--output-dir", type=str, default="exps")
+    parser.add_argument("--exp-name", type=str, required=True)
+    parser.add_argument("--logging-dir", type=str, default="logs")
+    parser.add_argument("--report-to", type=str, default="wandb")
+    parser.add_argument("--sampling-steps", type=int, default=2000)
+    parser.add_argument("--resume-step", type=int, default=0)
+    # model
+    parser.add_argument("--model", type=str)
+    parser.add_argument("--num-classes", type=int, default=1000)
+    parser.add_argument("--encoder-depth", type=int, default=8)
+    parser.add_argument("--fused-attn", action=argparse.BooleanOptionalAction, default=True)
+    parser.add_argument("--qk-norm",  action=argparse.BooleanOptionalAction, default=False)
+    parser.add_argument("--ops-head", type=int, default=16)
+    # dataset
+    parser.add_argument("--data-dir", type=str, default="../data/imagenet256")
+    parser.add_argument(
+        "--semantic-features-dir",
+        type=str,
+        default=None,
+        help="预处理 DINOv2 class token 等特征目录（含 dataset.json）。"
+        "默认 None 时若存在 data-dir/imagenet_256_features/dinov2-vit-b_tmp/gpu0 则自动使用。",
+    )
+    parser.add_argument("--resolution", type=int, choices=[256, 512], default=256)
+    parser.add_argument("--batch-size", type=int, default=256)#256
+    # precision
+    parser.add_argument("--allow-tf32", action="store_true")
+    parser.add_argument("--mixed-precision", type=str, default="fp16", choices=["no", "fp16", "bf16"])
+    # optimization
+    parser.add_argument("--epochs", type=int, default=1400)
+    parser.add_argument("--max-train-steps", type=int, default=1000000)
+    parser.add_argument("--checkpointing-steps", type=int, default=10000)
+    parser.add_argument("--gradient-accumulation-steps", type=int, default=1)
+    parser.add_argument("--learning-rate", type=float, default=1e-4)
+    parser.add_argument("--adam-beta1", type=float, default=0.9, help="The beta1 parameter for the Adam optimizer.")
+    parser.add_argument("--adam-beta2", type=float, default=0.999, help="The beta2 parameter for the Adam optimizer.")
+    parser.add_argument("--adam-weight-decay", type=float, default=0., help="Weight decay to use.")
+    parser.add_argument("--adam-epsilon", type=float, default=1e-08, help="Epsilon value for the Adam optimizer")
+    parser.add_argument("--max-grad-norm", default=1.0, type=float, help="Max gradient norm.")
+    # seed
+    parser.add_argument("--seed", type=int, default=0)
+    # cpu
+    parser.add_argument("--num-workers", type=int, default=4)
+    # loss
+    parser.add_argument("--path-type", type=str, default="linear", choices=["linear", "cosine"])
+    parser.add_argument("--prediction", type=str, default="v", choices=["v"]) # currently we only support v-prediction
+    parser.add_argument("--cfg-prob", type=float, default=0.1)
+    parser.add_argument("--enc-type", type=str, default='dinov2-vit-b')
+    parser.add_argument("--proj-coeff", type=float, default=0.5)
+    parser.add_argument("--weighting", default="uniform", type=str, help="Max gradient norm.")
+    parser.add_argument("--legacy", action=argparse.BooleanOptionalAction, default=False)
+    parser.add_argument("--cls", type=float, default=0.03)
+    parser.add_argument(
+        "--t-c",
+        type=float,
+        default=0.5,
+        help="语义分界时刻（与脚本内 t 约定一致：t=1 噪声→t=0 数据）。"
+        "t∈(t_c,1]：cls 沿 OT 配对后的路径插值（CFM/OT-CFM 式 minibatch OT）；"
+        "t∈[0,t_c]：cls 固定为真实 encoder cls，目标 cls 速度为 0。",
+    )
+    parser.add_argument(
+        "--ot-cls",
+        action=argparse.BooleanOptionalAction,
+        default=True,
+        help="在 t>t_c 段对 cls 噪声与 batch 内 cls_gt 做 minibatch 最优传输配对（需 scipy）；关闭则退化为独立高斯噪声配对。",
+    )
+    if input_args is not None:
+        args = parser.parse_args(input_args)
+    else:
+        args = parser.parse_args()
+    return args
+if __name__ == "__main__":
+    args = parse_args()
+    main(args)

back/train.sh ADDED Viewed

	@@ -0,0 +1,43 @@

+#!/usr/bin/env bash
+# REG/train.py：与主仓库类似，可单独指定数据根目录与预处理 cls 特征目录。
+# 数据布局：${DATA_DIR}/imagenet_256_vae/ 下 VAE latent；
+#          ${SEMANTIC_FEATURES_DIR}/ 下 img-feature-*.npy + dataset.json（与 parallel_encode 一致）。
+NUM_GPUS=4
+# ------------ 按本机路径修改 ------------
+DATA_DIR="/gemini/space/zhaozy/dataset/Imagenet/imagenet_256"
+SEMANTIC_FEATURES_DIR="/gemini/space/zhaozy/dataset/Imagenet/imagenet_256/imagenet_256_features/dinov2-vit-b_tmp/gpu0"
+# 后台示例（与主实验脚本风格一致）：
+# nohup bash train.sh > jsflow-experiment.log 2>&1 &
+nohup accelerate launch --multi_gpu --num_processes "${NUM_GPUS}" --mixed_precision bf16 train.py \
+    --report-to wandb \
+    --allow-tf32 \
+    --mixed-precision bf16 \
+    --seed 0 \
+    --path-type linear \
+    --prediction v \
+    --weighting uniform \
+    --model SiT-XL/2 \
+    --enc-type dinov2-vit-b \
+    --encoder-depth 8 \
+    --proj-coeff 0.5 \
+    --output-dir exps \
+    --exp-name jsflow-experiment-0.75 \
+    --batch-size 256 \
+    --data-dir "${DATA_DIR}" \
+    --semantic-features-dir "${SEMANTIC_FEATURES_DIR}" \
+    --learning-rate 0.00005 \
+    --t-c 0.75 \
+    --cls 0.05 \
+    --ot-cls \
+    > jsflow-experiment.log 2>&1 &
+# 说明：
+# - 不使用预处理特征、改在线抽 DINO 时：去掉 --semantic-features-dir，并保证 data-dir 为 REG 原布局
+#  （imagenet_256_vae + vae-sd）。
+# - 关闭 minibatch OT：追加 --no-ot-cls。
+# - 主仓库 train.py 中的 --weight-ratio / --semantic-reg-coeff / --repa-* 等为本 REG 脚本未实现项；
+#   投影强度请用 --proj-coeff，cls 流损失权重用 --cls。

back/utils.py ADDED Viewed

	@@ -0,0 +1,225 @@

+import os
+from torchvision.datasets.utils import download_url
+import torch
+import torchvision.models as torchvision_models
+import timm
+from models import mocov3_vit
+import math
+import warnings
+# code from SiT repository
+pretrained_models = {'last.pt'}
+def download_model(model_name):
+    """
+    Downloads a pre-trained SiT model from the web.
+    """
+    assert model_name in pretrained_models
+    local_path = f'pretrained_models/{model_name}'
+    if not os.path.isfile(local_path):
+        os.makedirs('pretrained_models', exist_ok=True)
+        web_path = f'https://www.dl.dropboxusercontent.com/scl/fi/cxedbs4da5ugjq5wg3zrg/last.pt?rlkey=8otgrdkno0nd89po3dpwngwcc&st=apcc645o&dl=0'
+        download_url(web_path, 'pretrained_models', filename=model_name)
+    model = torch.load(local_path, map_location=lambda storage, loc: storage)
+    return model
+def fix_mocov3_state_dict(state_dict):
+    for k in list(state_dict.keys()):
+        # retain only base_encoder up to before the embedding layer
+        if k.startswith('module.base_encoder'):
+            # fix naming bug in checkpoint
+            new_k = k[len("module.base_encoder."):]
+            if "blocks.13.norm13" in new_k:
+                new_k = new_k.replace("norm13", "norm1")
+            if "blocks.13.mlp.fc13" in k:
+                new_k = new_k.replace("fc13", "fc1")
+            if "blocks.14.norm14" in k:
+                new_k = new_k.replace("norm14", "norm2")
+            if "blocks.14.mlp.fc14" in k:
+                new_k = new_k.replace("fc14", "fc2")
+            # remove prefix
+            if 'head' not in new_k and new_k.split('.')[0] != 'fc':
+                state_dict[new_k] = state_dict[k]
+        # delete renamed or unused k
+        del state_dict[k]
+    if 'pos_embed' in state_dict.keys():
+        state_dict['pos_embed'] = timm.layers.pos_embed.resample_abs_pos_embed(
+            state_dict['pos_embed'], [16, 16],
+        )
+    return state_dict
+@torch.no_grad()
+def load_encoders(enc_type, device, resolution=256):
+    assert (resolution == 256) or (resolution == 512)
+    enc_names = enc_type.split(',')
+    encoders, architectures, encoder_types = [], [], []
+    for enc_name in enc_names:
+        encoder_type, architecture, model_config = enc_name.split('-')
+        # Currently, we only support 512x512 experiments with DINOv2 encoders.
+        if resolution == 512:
+            if encoder_type != 'dinov2':
+                raise NotImplementedError(
+                    "Currently, we only support 512x512 experiments with DINOv2 encoders."
+                    )
+        architectures.append(architecture)
+        encoder_types.append(encoder_type)
+        if encoder_type == 'mocov3':
+            if architecture == 'vit':
+                if model_config == 's':
+                    encoder = mocov3_vit.vit_small()
+                elif model_config == 'b':
+                    encoder = mocov3_vit.vit_base()
+                elif model_config == 'l':
+                    encoder = mocov3_vit.vit_large()
+                ckpt = torch.load(f'./ckpts/mocov3_vit{model_config}.pth')
+                state_dict = fix_mocov3_state_dict(ckpt['state_dict'])
+                del encoder.head
+                encoder.load_state_dict(state_dict, strict=True)
+                encoder.head = torch.nn.Identity()
+            elif architecture == 'resnet':
+                raise NotImplementedError()
+            encoder = encoder.to(device)
+            encoder.eval()
+        elif 'dinov2' in encoder_type:
+            import timm
+            if 'reg' in encoder_type:
+                try:
+                    encoder = torch.hub.load('your_path/.cache/torch/hub/facebookresearch_dinov2_main',
+                                            f'dinov2_vit{model_config}14_reg', source='local')
+                except:
+                    encoder = torch.hub.load('facebookresearch/dinov2', f'dinov2_vit{model_config}14_reg')
+            else:
+                try:
+                    encoder = torch.hub.load('your_path/.cache/torch/hub/facebookresearch_dinov2_main',
+                                             f'dinov2_vit{model_config}14', source='local')
+                except:
+                    encoder = torch.hub.load('facebookresearch/dinov2', f'dinov2_vit{model_config}14')
+            print(f"Now you are using the {enc_name} as the aligning model")
+            del encoder.head
+            patch_resolution = 16 * (resolution // 256)
+            encoder.pos_embed.data = timm.layers.pos_embed.resample_abs_pos_embed(
+                encoder.pos_embed.data, [patch_resolution, patch_resolution],
+            )
+            encoder.head = torch.nn.Identity()
+            encoder = encoder.to(device)
+            encoder.eval()
+        elif 'dinov1' == encoder_type:
+            import timm
+            from models import dinov1
+            encoder = dinov1.vit_base()
+            ckpt =  torch.load(f'./ckpts/dinov1_vit{model_config}.pth')
+            if 'pos_embed' in ckpt.keys():
+                ckpt['pos_embed'] = timm.layers.pos_embed.resample_abs_pos_embed(
+                    ckpt['pos_embed'], [16, 16],
+                )
+            del encoder.head
+            encoder.head = torch.nn.Identity()
+            encoder.load_state_dict(ckpt, strict=True)
+            encoder = encoder.to(device)
+            encoder.forward_features = encoder.forward
+            encoder.eval()
+        elif encoder_type == 'clip':
+            import clip
+            from models.clip_vit import UpdatedVisionTransformer
+            encoder_ = clip.load(f"ViT-{model_config}/14", device='cpu')[0].visual
+            encoder = UpdatedVisionTransformer(encoder_).to(device)
+             #.to(device)
+            encoder.embed_dim = encoder.model.transformer.width
+            encoder.forward_features = encoder.forward
+            encoder.eval()
+        elif encoder_type == 'mae':
+            from models.mae_vit import vit_large_patch16
+            import timm
+            kwargs = dict(img_size=256)
+            encoder = vit_large_patch16(**kwargs).to(device)
+            with open(f"ckpts/mae_vit{model_config}.pth", "rb") as f:
+                state_dict = torch.load(f)
+            if 'pos_embed' in state_dict["model"].keys():
+                state_dict["model"]['pos_embed'] = timm.layers.pos_embed.resample_abs_pos_embed(
+                    state_dict["model"]['pos_embed'], [16, 16],
+                )
+            encoder.load_state_dict(state_dict["model"])
+            encoder.pos_embed.data = timm.layers.pos_embed.resample_abs_pos_embed(
+                encoder.pos_embed.data, [16, 16],
+            )
+        elif encoder_type == 'jepa':
+            from models.jepa import vit_huge
+            kwargs = dict(img_size=[224, 224], patch_size=14)
+            encoder = vit_huge(**kwargs).to(device)
+            with open(f"ckpts/ijepa_vit{model_config}.pth", "rb") as f:
+                state_dict = torch.load(f, map_location=device)
+            new_state_dict = dict()
+            for key, value in state_dict['encoder'].items():
+                new_state_dict[key[7:]] = value
+            encoder.load_state_dict(new_state_dict)
+            encoder.forward_features = encoder.forward
+        encoders.append(encoder)
+    return encoders, encoder_types, architectures
+def _no_grad_trunc_normal_(tensor, mean, std, a, b):
+    # Cut & paste from PyTorch official master until it's in a few official releases - RW
+    # Method based on https://people.sc.fsu.edu/~jburkardt/presentations/truncated_normal.pdf
+    def norm_cdf(x):
+        # Computes standard normal cumulative distribution function
+        return (1. + math.erf(x / math.sqrt(2.))) / 2.
+    if (mean < a - 2 * std) or (mean > b + 2 * std):
+        warnings.warn("mean is more than 2 std from [a, b] in nn.init.trunc_normal_. "
+                      "The distribution of values may be incorrect.",
+                      stacklevel=2)
+    with torch.no_grad():
+        # Values are generated by using a truncated uniform distribution and
+        # then using the inverse CDF for the normal distribution.
+        # Get upper and lower cdf values
+        l = norm_cdf((a - mean) / std)
+        u = norm_cdf((b - mean) / std)
+        # Uniformly fill tensor with values from [l, u], then translate to
+        # [2l-1, 2u-1].
+        tensor.uniform_(2 * l - 1, 2 * u - 1)
+        # Use inverse cdf transform for normal distribution to get truncated
+        # standard normal
+        tensor.erfinv_()
+        # Transform to proper mean, std
+        tensor.mul_(std * math.sqrt(2.))
+        tensor.add_(mean)
+        # Clamp to ensure it's in the proper range
+        tensor.clamp_(min=a, max=b)
+        return tensor
+def trunc_normal_(tensor, mean=0., std=1., a=-2., b=2.):
+    return _no_grad_trunc_normal_(tensor, mean, std, a, b)
+def load_legacy_checkpoints(state_dict, encoder_depth):
+    new_state_dict = dict()
+    for key, value in state_dict.items():
+        if 'decoder_blocks' in key:
+            parts =key.split('.')
+            new_idx = int(parts[1]) + encoder_depth
+            parts[0] = 'blocks'
+            parts[1] = str(new_idx)
+            new_key = '.'.join(parts)
+            new_state_dict[new_key] = value
+        else:
+            new_state_dict[key] = value
+    return new_state_dict