RAE-DiT-S ep14 Diffusers conversion

This is a Diffusers-format conversion of the public RAE Stage-2 ImageNet-256 checkpoint DiTDH-S_ep14, bundled with the public Stage-1 RAE nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08.

It is intended as a lightweight test artifact for the Diffusers RAE-DiT PR: https://github.com/huggingface/diffusers/pull/13231

Source assets

  • Stage-1 RAE: nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08
  • Stage-2 upstream weights: nyu-visionx/RAE-collections, file DiTs/Dinov2/wReg_base/ImageNet256/DiTDH-S_ep14/stage2_model.pt
  • Upstream code/configs: https://github.com/bytetriper/RAE, config configs/stage2/training/ImageNet256/DiTDH-S_DINOv2-B.yaml

Usage

Until PR #13231 is merged, install Diffusers from the PR branch first:

pip install git+https://github.com/plugyawn/diffusers.git@rae-dit-training

Then run:

import torch
from diffusers import RAEDiTPipeline

repo_id = "plugyawn/rae-dit-s-ep14-diffusers"
pipe = RAEDiTPipeline.from_pretrained(repo_id, torch_dtype=torch.bfloat16).to("cuda")

generator = torch.Generator(device="cuda").manual_seed(0)
image = pipe(
    class_labels=207,
    num_inference_steps=25,
    guidance_scale=1.0,
    generator=generator,
).images[0]
image.save("rae_dit_class207.png")

class_labels are ImageNet-1k class ids.

Validation

The conversion was validated against the upstream implementation on an A100. With matched initial latent noise, class label, and schedule, the converted model matched upstream with approximately max_abs_error=1.10e-5 on transformer outputs and max_abs_error=6.46e-5 on a fixed-seed 25-step decoded sample.

Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support