data-archetype/dinac_ae_d2
DINAC-AE-D2 is a close variant of DINAC-AE. It keeps the same patch-16 spatial latent interface, VP diffusion decoder, class-token prediction API, and one-step default reconstruction path, but changes the teacher alignment and encoder capacity:
- DINO alignment target: DINOv2 ViT-B/14 feature space.
- Encoder: 8 ViT/DiT-style transformer blocks instead of DINAC-AE's 6.
- Decoder: unchanged 8-block FCDM decoder.
DINOv2-B is empirically less spatially smooth than DINOv3-B and preserves more high-frequency information. In downstream diffusion experiments, this variant has shown faster early convergence than the original DINAC-AE latent space.
2k PSNR Benchmark
| Model | Mean PSNR (dB) | Std (dB) | Median (dB) | P5 (dB) | P95 (dB) |
|---|---|---|---|---|---|
| dinac_ae_d2 | 35.59 |
4.87 |
35.40 |
27.89 |
43.51 |
| dinac_ae | 35.19 |
4.53 |
35.06 |
28.02 |
42.43 |
| FLUX.2 VAE | 36.28 |
4.53 |
36.07 |
28.89 |
43.63 |
Evaluated on the same 2000 validation images as DINAC-AE. FLUX.2 numbers are
reused from the existing DINAC-AE 2k benchmark and were not recomputed for this
export.
DINAC-AE-D2 keeps DINAC-AE's reconstruction-focused autoencoder interface while using KL-like variance expansion and DINOv2 alignment to produce a learnable latent space that has shown faster downstream diffusion convergence.
Results viewer
shows the 39-image reconstruction set with DINAC-AE-D2 and FLUX.2 VAE
reconstructions, RGB differences, and latent PCA.
The 39-image set gives 35.46 dB mean PSNR (25.61 min, 46.69 max).
DINAC-AE technical report describes the training recipe used for this model. DINAC-AE-D2 follows the same autoencoder training setup, with the teacher alignment changed to DINOv2 ViT-B/14 and the encoder depth increased from 6 to 8 blocks.
Encode Throughput
Measured on an NVIDIA GeForce RTX 5090 in bfloat16, averaging repeated
batches per resolution.
| Resolution | Batch Size | Model | Encode (ms/batch) | ms/image | Images/s | Peak VRAM (MiB) | Speedup vs FLUX.2 | Peak VRAM Reduction vs FLUX.2 |
|---|---|---|---|---|---|---|---|---|
256x256 |
128 |
dinac_ae_d2 | 69.56 |
0.543 |
1840.0 |
1606.5 |
4.92x |
87.2% |
256x256 |
128 |
dinac_ae | 50.25 |
0.393 |
2547.4 |
1569.7 |
6.80x |
87.5% |
256x256 |
128 |
FLUX.2 VAE | 341.94 |
2.671 |
374.3 |
12533.8 |
1.00x |
0.0% |
512x512 |
32 |
dinac_ae_d2 | 75.09 |
2.347 |
426.2 |
1606.7 |
4.74x |
87.2% |
512x512 |
32 |
dinac_ae | 53.09 |
1.659 |
602.7 |
1570.0 |
6.70x |
87.5% |
512x512 |
32 |
FLUX.2 VAE | 355.64 |
11.114 |
90.0 |
12533.8 |
1.00x |
0.0% |
The DINOv2-aligned encoder is slower than DINAC-AE's DINOv3-aligned encoder because it uses 8 transformer blocks instead of 6, but remains much faster and much smaller than the FLUX.2 VAE encoder.
Latent Interface
encode()returns DINAC-AE-D2's own whitened latent space.decode()expects that same whitened latent space and dewhitens internally.predict_class()expects the same whitened latent space, dewhitens internally, and predicts a DINOv2-B class-token feature.whiten()anddewhiten()are exposed for explicit control.encode_posterior()returns the raw exported posterior before whitening.DinacAEInferenceConfig.num_stepscounts decoder evaluations directly:num_steps=1means one NFE.
The export ships weights in float32. The recommended runtime path is
bfloat16 for the main encoder, decoder, and class-token path, with float32
retained for whitening/dewhitening, normalization math, RoPE frequency
construction, and VP diffusion schedule helpers.
Usage
import torch
from dinac_ae import DinacAE, DinacAEInferenceConfig
device = "cuda"
model = DinacAE.from_pretrained(
"data-archetype/dinac_ae_d2",
device=device,
dtype=torch.bfloat16,
)
image = ... # [1, 3, H, W] in [-1, 1], H and W divisible by 16
with torch.inference_mode():
latents = model.encode(image.to(device=device, dtype=torch.bfloat16))
class_token = model.predict_class(latents)
recon = model.decode(
latents,
height=int(image.shape[-2]),
width=int(image.shape[-1]),
inference_config=DinacAEInferenceConfig(num_steps=1),
)
Details
- DINAC-AE-D2 uses an
8-block ViT/DiT-style transformer encoder and an8-block FCDM decoder. - Patch size is
16, model width is896, and latent width is128. - Total parameter count is
154.22M:78.02Mencoder,61.93Mdecoder, and14.26MDINO token/class alignment head. - The DINO alignment head predicts spatial patch tokens and a class-token output in DINOv2 ViT-B/14 feature space.
predict_class(latents)exposes the DINOv2 ViT-B/14 class-token feature directly from latents.- DINOv2-B is empirically less spatially smooth than DINOv3-B and preserves more high-frequency information.
- Results viewer: https://huggingface.co/spaces/data-archetype/dinac_ae_d2-results
- Related: DINAC-AE, SemDisDiffAE, full_capacitor, capacitor_decoder
Citation
@misc{dinac_ae_d2,
title = {DINAC-AE-D2: a DINOv2-aligned class-token diffusion autoencoder},
author = {data-archetype},
email = {data-archetype@proton.me},
year = {2026},
month = jun,
url = {https://huggingface.co/data-archetype/dinac_ae_d2},
}
- Downloads last month
- 21