data-archetype/dinac_ae_d2

DINAC-AE-D2 is a close variant of DINAC-AE. It keeps the same patch-16 spatial latent interface, VP diffusion decoder, class-token prediction API, and one-step default reconstruction path, but changes the teacher alignment and encoder capacity:

DINO alignment target: DINOv2 ViT-B/14 feature space.
Encoder: 8 ViT/DiT-style transformer blocks instead of DINAC-AE's 6.
Decoder: unchanged 8-block FCDM decoder.

DINOv2-B is empirically less spatially smooth than DINOv3-B and preserves more high-frequency information. In downstream diffusion experiments, this variant has shown faster early convergence than the original DINAC-AE latent space.

2k PSNR Benchmark

Model	Mean PSNR (dB)	Std (dB)	Median (dB)	P5 (dB)	P95 (dB)
dinac_ae_d2	`35.59`	`4.87`	`35.40`	`27.89`	`43.51`
dinac_ae	`35.19`	`4.53`	`35.06`	`28.02`	`42.43`
FLUX.2 VAE	`36.28`	`4.53`	`36.07`	`28.89`	`43.63`

Evaluated on the same 2000 validation images as DINAC-AE. FLUX.2 numbers are reused from the existing DINAC-AE 2k benchmark and were not recomputed for this export.

DINAC-AE-D2 keeps DINAC-AE's reconstruction-focused autoencoder interface while using KL-like variance expansion and DINOv2 alignment to produce a learnable latent space that has shown faster downstream diffusion convergence.

Results viewer shows the 39-image reconstruction set with DINAC-AE-D2 and FLUX.2 VAE reconstructions, RGB differences, and latent PCA. The 39-image set gives 35.46 dB mean PSNR (25.61 min, 46.69 max).

DINAC-AE technical report describes the training recipe used for this model. DINAC-AE-D2 follows the same autoencoder training setup, with the teacher alignment changed to DINOv2 ViT-B/14 and the encoder depth increased from 6 to 8 blocks.

Encode Throughput

Measured on an NVIDIA GeForce RTX 5090 in bfloat16, averaging repeated batches per resolution.

Resolution	Batch Size	Model	Encode (ms/batch)	ms/image	Images/s	Peak VRAM (MiB)	Speedup vs FLUX.2	Peak VRAM Reduction vs FLUX.2
`256x256`	`128`	dinac_ae_d2	`69.56`	`0.543`	`1840.0`	`1606.5`	`4.92x`	`87.2%`
`256x256`	`128`	dinac_ae	`50.25`	`0.393`	`2547.4`	`1569.7`	`6.80x`	`87.5%`
`256x256`	`128`	FLUX.2 VAE	`341.94`	`2.671`	`374.3`	`12533.8`	`1.00x`	`0.0%`
`512x512`	`32`	dinac_ae_d2	`75.09`	`2.347`	`426.2`	`1606.7`	`4.74x`	`87.2%`
`512x512`	`32`	dinac_ae	`53.09`	`1.659`	`602.7`	`1570.0`	`6.70x`	`87.5%`
`512x512`	`32`	FLUX.2 VAE	`355.64`	`11.114`	`90.0`	`12533.8`	`1.00x`	`0.0%`

The DINOv2-aligned encoder is slower than DINAC-AE's DINOv3-aligned encoder because it uses 8 transformer blocks instead of 6, but remains much faster and much smaller than the FLUX.2 VAE encoder.

Latent Interface

encode() returns DINAC-AE-D2's own whitened latent space.
decode() expects that same whitened latent space and dewhitens internally.
predict_class() expects the same whitened latent space, dewhitens internally, and predicts a DINOv2-B class-token feature.
whiten() and dewhiten() are exposed for explicit control.
encode_posterior() returns the raw exported posterior before whitening.
DinacAEInferenceConfig.num_steps counts decoder evaluations directly: num_steps=1 means one NFE.

The export ships weights in float32. The recommended runtime path is bfloat16 for the main encoder, decoder, and class-token path, with float32 retained for whitening/dewhitening, normalization math, RoPE frequency construction, and VP diffusion schedule helpers.

Usage

import torch

from dinac_ae import DinacAE, DinacAEInferenceConfig


device = "cuda"
model = DinacAE.from_pretrained(
    "data-archetype/dinac_ae_d2",
    device=device,
    dtype=torch.bfloat16,
)

image = ...  # [1, 3, H, W] in [-1, 1], H and W divisible by 16

with torch.inference_mode():
    latents = model.encode(image.to(device=device, dtype=torch.bfloat16))
    class_token = model.predict_class(latents)
    recon = model.decode(
        latents,
        height=int(image.shape[-2]),
        width=int(image.shape[-1]),
        inference_config=DinacAEInferenceConfig(num_steps=1),
    )

Details

DINAC-AE-D2 uses an 8-block ViT/DiT-style transformer encoder and an 8-block FCDM decoder.
Patch size is 16, model width is 896, and latent width is 128.
Total parameter count is 154.22M: 78.02M encoder, 61.93M decoder, and 14.26M DINO token/class alignment head.
The DINO alignment head predicts spatial patch tokens and a class-token output in DINOv2 ViT-B/14 feature space.
predict_class(latents) exposes the DINOv2 ViT-B/14 class-token feature directly from latents.
DINOv2-B is empirically less spatially smooth than DINOv3-B and preserves more high-frequency information.
Results viewer: https://huggingface.co/spaces/data-archetype/dinac_ae_d2-results
Related: DINAC-AE, SemDisDiffAE, full_capacitor, capacitor_decoder

Citation

@misc{dinac_ae_d2,
  title   = {DINAC-AE-D2: a DINOv2-aligned class-token diffusion autoencoder},
  author  = {data-archetype},
  email   = {data-archetype@proton.me},
  year    = {2026},
  month   = jun,
  url     = {https://huggingface.co/data-archetype/dinac_ae_d2},
}

Downloads last month: 21

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support