data-archetype
/

semdisdiffae

@@ -196,62 +196,54 @@ SiD2 with an x-prediction objective.
 ### 2.1 Forward Process
-Given a clean image x₀, the forward process constructs a noisy sample at
-continuous time t ∈ [0, 1]:
-```
-x_t = α_t · x₀ + σ_t · ε,    ε ~ N(0, s²I)
-```
-where s = 0.558 is the pixel-space noise standard deviation (estimated from
-the dataset image distribution) and the VP constraint holds: α²_t + σ²_t = 1.
 ### 2.2 Log Signal-to-Noise Ratio
 The schedule is parameterized through the log signal-to-noise ratio:
-```
-λ_t = log(α²_t / σ²_t)
-```
-which monotonically decreases as t → 1 (pure noise). From λ_t we recover
-α_t and σ_t via the sigmoid function:
-```
-α_t = √σ(λ_t),    σ_t = √σ(-λ_t)
-```
 ### 2.3 Cosine-Interpolated Schedule
 Following SiD2, the logSNR schedule uses cosine interpolation:
-```
-λ(t) = -2 log tan(a·t + b)
-```
-where a and b are computed to satisfy the boundary conditions
-λ(0) = λ_max = 10 and λ(1) = λ_min = -10.
 ### 2.4 X-Prediction Objective
-The model predicts the clean image x̂₀ = f_θ(x_t, t, z) conditioned on
-encoder latents z.
 **Schedule-invariant loss.** Following SiD2, the training loss is defined as
-an integral over logSNR λ, making it invariant to the choice of noise schedule.
-Since timesteps are sampled uniformly t ~ U(0,1), the change of variable
-introduces a Jacobian factor:
-```
-L = E_{t ~ U(0,1)} [ (-dλ/dt) · w(λ(t)) · ||x₀ - x̂₀||² ]
-```
 **Sigmoid weighting.** The weighting function uses a sigmoid centered at bias
-b = -2.0, converting from ε-prediction to x-prediction form:
-```
-weight(t) = -(1/2) · (dλ/dt) · e^b · σ(λ(t) - b)
-```
 ### 2.5 Sampling
@@ -276,24 +268,25 @@ as an alternative to the traditional VAE KL penalty.
 The encoder outputs two sets of 128 channels:
-- **μ** — the clean signal (posterior mean)
-- **λ** — per-element log signal-to-noise ratio
 The posterior distribution is:
-```
-z = α(λ) · μ + σ(λ) · ε,    ε ~ N(0, I)
-```
-where α = √σ(λ) and σ = √σ(-λ) (sigmoid parameterization). This is
-equivalent to a Gaussian with mean α·μ and variance σ².
 Using a VP interpolation rather than simple additive noise decouples token
-scale from stochasticity. With additive noise (`z = μ + σε`), the encoder
-faces gradient pressure to scale latents up to counter the noise — the SNR
-depends on the magnitude of μ. The VP formulation (`z = α·μ + σ·ε` with
-`α² + σ² = 1`) removes this coupling: the noise level is controlled
-entirely by the predicted log-SNR, independent of the latent magnitude.
 ### 3.2 Variance Expansion Loss
@@ -302,19 +295,17 @@ ignore the stochastic component entirely), we adopt a **variance expansion
 loss** inspired by VEL (Li et al., 2026,
 [arXiv:2603.21085](https://arxiv.org/abs/2603.21085)):
-```
-L_var = -mean(log(σ² + δ))
-```
-where σ² is the posterior variance derived from the predicted log-SNR and
-δ is a small epsilon (1e-6) for numerical stability. This loss encourages
-non-zero posterior variance by penalizing small σ².
-VEL proposes the form `1/(σ² + δ)` for variance expansion. We found this to
-be too aggressive — the `1/σ²` gradient pushes variance up very rapidly,
-leading to excessive high-frequency noise in the latent space. We use the
-`-log(σ² + δ)` form instead, which provides a gentler, logarithmic penalty
-that stabilizes training.
 **For this checkpoint:** the variance expansion loss is active with weight
 **1e-5**.
@@ -481,20 +472,18 @@ three purposes:
 The total training loss is:
-```
-L_total = L_recon + 0.01 · L_semantic + 0.0001 · L_scale + 1e-5 · L_var
-```
 | Loss | Weight | Description |
 |------|--------|-------------|
-| **Reconstruction** (L_recon) | 1.0 | SiD2 sigmoid-weighted x-prediction MSE (bias b = -2.0). Per-pixel `(x̂₀ - x₀)²` averaged over (C, H, W) per sample, multiplied by the SiD2 per-sample weight `w(t) = -½ · dλ/dt · e^b · σ(λ-b)`, then averaged over the batch |
-| **Semantic alignment** (L_semantic) | 0.01 | Per-token `(1 - cosine(student, teacher))` averaged over all tokens and batch (see §4) |
-| **Latent scale penalty** (L_scale) | 0.0001 | Per-channel variance `var_c` estimated over the batch and spatial dims (B, H, W), then `(log(var_c + ε) - log(target))²` averaged over channels. Target variance = 1.0 |
-| **Posterior variance expansion** (L_var) | 1e-5 | Per-element `-log(σ² + δ)` where σ² is the posterior variance derived from the predicted log-SNR, averaged over all dims (B, C, H, W). See §3.2 |
 **Note on loss scales:** The decoder reconstruction loss has a small
 effective magnitude due to the SiD2 VP x-prediction weighting (the Jacobian
-dλ/dt and sigmoid weighting compress the per-sample loss scale). As a
 result, all auxiliary loss weights must be kept correspondingly small to
 avoid dominating the reconstruction objective.

 ### 2.1 Forward Process
+Given a clean image \\(x_0\\), the forward process constructs a noisy sample at
+continuous time \\(t \in [0, 1]\\):
+$$x_t = \alpha_t \, x_0 + \sigma_t \, \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, s^2 I)$$
+where \\(s = 0.558\\) is the pixel-space noise standard deviation (estimated from
+the dataset image distribution) and the VP constraint holds:
+\\(\alpha_t^2 + \sigma_t^2 = 1\\).
 ### 2.2 Log Signal-to-Noise Ratio
 The schedule is parameterized through the log signal-to-noise ratio:
+$$\lambda_t = \log \frac{\alpha_t^2}{\sigma_t^2}$$
+which monotonically decreases as \\(t \to 1\\) (pure noise). From \\(\lambda_t\\)
+we recover \\(\alpha_t\\) and \\(\sigma_t\\) via the sigmoid function:
+$$\alpha_t = \sqrt{\sigma(\lambda_t)}, \qquad \sigma_t = \sqrt{\sigma(-\lambda_t)}$$
 ### 2.3 Cosine-Interpolated Schedule
 Following SiD2, the logSNR schedule uses cosine interpolation:
+$$\lambda(t) = -2 \log \tan(a \cdot t + b)$$
+where \\(a\\) and \\(b\\) are computed to satisfy the boundary conditions
+\\(\lambda(0) = \lambda_\text{max} = 10\\) and
+\\(\lambda(1) = \lambda_\text{min} = -10\\).
 ### 2.4 X-Prediction Objective
+The model predicts the clean image \\(\hat{x}_0 = f_\theta(x_t, t, z)\\)
+conditioned on encoder latents \\(z\\).
 **Schedule-invariant loss.** Following SiD2, the training loss is defined as
+an integral over logSNR \\(\lambda\\), making it invariant to the choice of
+noise schedule. Since timesteps are sampled uniformly
+\\(t \sim \mathcal{U}(0,1)\\), the change of variable introduces a Jacobian
+factor:
+$$\mathcal{L} = \mathbb{E}_{t \sim \mathcal{U}(0,1)} \left[ \left(-\frac{d\lambda}{dt}\right) \cdot w(\lambda(t)) \cdot \| x_0 - \hat{x}_0 \|^2 \right]$$
 **Sigmoid weighting.** The weighting function uses a sigmoid centered at bias
+\\(b = -2.0\\), converting from \\(\varepsilon\\)-prediction to
+\\(x\\)-prediction form:
+$$\text{weight}(t) = -\frac{1}{2} \frac{d\lambda}{dt} \cdot e^b \cdot \sigma(\lambda(t) - b)$$
 ### 2.5 Sampling
 The encoder outputs two sets of 128 channels:
+- \\(\mu\\) — the clean signal (posterior mean)
+- \\(\lambda\\) — per-element log signal-to-noise ratio
 The posterior distribution is:
+$$z = \alpha(\lambda) \, \mu + \sigma(\lambda) \, \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, I)$$
+where \\(\alpha = \sqrt{\sigma(\lambda)}\\) and
+\\(\sigma = \sqrt{\sigma(-\lambda)}\\) (sigmoid parameterization). This is
+equivalent to a Gaussian with mean \\(\alpha \mu\\) and variance
+\\(\sigma^2\\).
 Using a VP interpolation rather than simple additive noise decouples token
+scale from stochasticity. With additive noise (\\(z = \mu + \sigma\varepsilon\\)),
+the encoder faces gradient pressure to scale latents up to counter the noise
+— the SNR depends on the magnitude of \\(\mu\\). The VP formulation
+(\\(z = \alpha\mu + \sigma\varepsilon\\) with \\(\alpha^2 + \sigma^2 = 1\\))
+removes this coupling: the noise level is controlled entirely by the predicted
+log-SNR, independent of the latent magnitude.
 ### 3.2 Variance Expansion Loss
 loss** inspired by VEL (Li et al., 2026,
 [arXiv:2603.21085](https://arxiv.org/abs/2603.21085)):
+$$\mathcal{L}_\text{var} = -\operatorname{mean}\!\bigl(\log(\sigma^2 + \delta)\bigr)$$
+where \\(\sigma^2\\) is the posterior variance derived from the predicted
+log-SNR and \\(\delta = 10^{-6}\\) for numerical stability. This loss
+encourages non-zero posterior variance by penalizing small \\(\sigma^2\\).
+VEL proposes the form \\(1/(\sigma^2 + \delta)\\) for variance expansion. We
+found this to be too aggressive — the \\(1/\sigma^2\\) gradient pushes variance
+up very rapidly, leading to excessive high-frequency noise in the latent
+space. We use the \\(-\log(\sigma^2 + \delta)\\) form instead, which provides
+a gentler, logarithmic penalty that stabilizes training.
 **For this checkpoint:** the variance expansion loss is active with weight
 **1e-5**.
 The total training loss is:
+$$\mathcal{L}_\text{total} = \mathcal{L}_\text{recon} + 0.01 \cdot \mathcal{L}_\text{semantic} + 10^{-4} \cdot \mathcal{L}_\text{scale} + 10^{-5} \cdot \mathcal{L}_\text{var}$$
 | Loss | Weight | Description |
 |------|--------|-------------|
+| \\(\mathcal{L}_\text{recon}\\) | 1.0 | SiD2 sigmoid-weighted x-prediction MSE (\\(b = -2.0\\)). Per-pixel \\((\\hat{x}_0 - x_0)^2\\) averaged over (C, H, W) per sample, multiplied by \\(w(t) = -\tfrac{1}{2} \tfrac{d\lambda}{dt} e^b \sigma(\lambda - b)\\), then averaged over the batch |
+| \\(\mathcal{L}_\text{semantic}\\) | 0.01 | Per-token \\(1 - \cos(\text{student}, \text{teacher})\\) averaged over all tokens and batch (see §4) |
+| \\(\mathcal{L}_\text{scale}\\) | 0.0001 | Per-channel variance \\(\text{var}_c\\) estimated over (B, H, W), then \\((\log(\text{var}_c + \varepsilon) - \log(\text{target}))^2\\) averaged over channels. Target variance = 1.0 |
+| \\(\mathcal{L}_\text{var}\\) | 1e-5 | Per-element \\(-\log(\sigma^2 + \delta)\\) where \\(\sigma^2\\) is the posterior variance, averaged over all dims (B, C, H, W). See §3.2 |
 **Note on loss scales:** The decoder reconstruction loss has a small
 effective magnitude due to the SiD2 VP x-prediction weighting (the Jacobian
+\\(d\lambda/dt\\) and sigmoid weighting compress the per-sample loss scale). As a
 result, all auxiliary loss weights must be kept correspondingly small to
 avoid dominating the reconstruction objective.