data-archetype/dinac_ae_d2

DINAC-AE-D2 is a close variant of DINAC-AE. It keeps the same patch-16 spatial latent interface, VP diffusion decoder, class-token prediction API, and one-step default reconstruction path, but changes the teacher alignment and encoder capacity:

  • DINO alignment target: DINOv2 ViT-B/14 feature space.
  • Encoder: 8 ViT/DiT-style transformer blocks instead of DINAC-AE's 6.
  • Decoder: unchanged 8-block FCDM decoder.

DINOv2-B is empirically less spatially smooth than DINOv3-B and preserves more high-frequency information. In downstream diffusion experiments, this variant has shown faster early convergence than the original DINAC-AE latent space.

2k PSNR Benchmark

Model Mean PSNR (dB) Std (dB) Median (dB) P5 (dB) P95 (dB)
dinac_ae_d2 35.59 4.87 35.40 27.89 43.51
dinac_ae 35.19 4.53 35.06 28.02 42.43
FLUX.2 VAE 36.28 4.53 36.07 28.89 43.63

Evaluated on the same 2000 validation images as DINAC-AE. FLUX.2 numbers are reused from the existing DINAC-AE 2k benchmark and were not recomputed for this export.

DINAC-AE-D2 keeps DINAC-AE's reconstruction-focused autoencoder interface while using KL-like variance expansion and DINOv2 alignment to produce a learnable latent space that has shown faster downstream diffusion convergence.

Results viewer shows the 39-image reconstruction set with DINAC-AE-D2 and FLUX.2 VAE reconstructions, RGB differences, and latent PCA. The 39-image set gives 35.46 dB mean PSNR (25.61 min, 46.69 max).

DINAC-AE technical report describes the training recipe used for this model. DINAC-AE-D2 follows the same autoencoder training setup, with the teacher alignment changed to DINOv2 ViT-B/14 and the encoder depth increased from 6 to 8 blocks.

Encode Throughput

Measured on an NVIDIA GeForce RTX 5090 in bfloat16, averaging repeated batches per resolution.

Resolution Batch Size Model Encode (ms/batch) ms/image Images/s Peak VRAM (MiB) Speedup vs FLUX.2 Peak VRAM Reduction vs FLUX.2
256x256 128 dinac_ae_d2 69.56 0.543 1840.0 1606.5 4.92x 87.2%
256x256 128 dinac_ae 50.25 0.393 2547.4 1569.7 6.80x 87.5%
256x256 128 FLUX.2 VAE 341.94 2.671 374.3 12533.8 1.00x 0.0%
512x512 32 dinac_ae_d2 75.09 2.347 426.2 1606.7 4.74x 87.2%
512x512 32 dinac_ae 53.09 1.659 602.7 1570.0 6.70x 87.5%
512x512 32 FLUX.2 VAE 355.64 11.114 90.0 12533.8 1.00x 0.0%

The DINOv2-aligned encoder is slower than DINAC-AE's DINOv3-aligned encoder because it uses 8 transformer blocks instead of 6, but remains much faster and much smaller than the FLUX.2 VAE encoder.

Latent Interface

  • encode() returns DINAC-AE-D2's own whitened latent space.
  • decode() expects that same whitened latent space and dewhitens internally.
  • predict_class() expects the same whitened latent space, dewhitens internally, and predicts a DINOv2-B class-token feature.
  • whiten() and dewhiten() are exposed for explicit control.
  • encode_posterior() returns the raw exported posterior before whitening.
  • DinacAEInferenceConfig.num_steps counts decoder evaluations directly: num_steps=1 means one NFE.

The export ships weights in float32. The recommended runtime path is bfloat16 for the main encoder, decoder, and class-token path, with float32 retained for whitening/dewhitening, normalization math, RoPE frequency construction, and VP diffusion schedule helpers.

Usage

import torch

from dinac_ae import DinacAE, DinacAEInferenceConfig


device = "cuda"
model = DinacAE.from_pretrained(
    "data-archetype/dinac_ae_d2",
    device=device,
    dtype=torch.bfloat16,
)

image = ...  # [1, 3, H, W] in [-1, 1], H and W divisible by 16

with torch.inference_mode():
    latents = model.encode(image.to(device=device, dtype=torch.bfloat16))
    class_token = model.predict_class(latents)
    recon = model.decode(
        latents,
        height=int(image.shape[-2]),
        width=int(image.shape[-1]),
        inference_config=DinacAEInferenceConfig(num_steps=1),
    )

Details

  • DINAC-AE-D2 uses an 8-block ViT/DiT-style transformer encoder and an 8-block FCDM decoder.
  • Patch size is 16, model width is 896, and latent width is 128.
  • Total parameter count is 154.22M: 78.02M encoder, 61.93M decoder, and 14.26M DINO token/class alignment head.
  • The DINO alignment head predicts spatial patch tokens and a class-token output in DINOv2 ViT-B/14 feature space.
  • predict_class(latents) exposes the DINOv2 ViT-B/14 class-token feature directly from latents.
  • DINOv2-B is empirically less spatially smooth than DINOv3-B and preserves more high-frequency information.
  • Results viewer: https://huggingface.co/spaces/data-archetype/dinac_ae_d2-results
  • Related: DINAC-AE, SemDisDiffAE, full_capacitor, capacitor_decoder

Citation

@misc{dinac_ae_d2,
  title   = {DINAC-AE-D2: a DINOv2-aligned class-token diffusion autoencoder},
  author  = {data-archetype},
  email   = {data-archetype@proton.me},
  year    = {2026},
  month   = jun,
  url     = {https://huggingface.co/data-archetype/dinac_ae_d2},
}
Downloads last month
21
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support