Upload technical_report_semantic.md with huggingface_hub
Browse files- technical_report_semantic.md +52 -63
technical_report_semantic.md
CHANGED
|
@@ -196,62 +196,54 @@ SiD2 with an x-prediction objective.
|
|
| 196 |
|
| 197 |
### 2.1 Forward Process
|
| 198 |
|
| 199 |
-
Given a clean image
|
| 200 |
-
continuous time t
|
| 201 |
|
| 202 |
-
|
| 203 |
-
x_t = α_t · x₀ + σ_t · ε, ε ~ N(0, s²I)
|
| 204 |
-
```
|
| 205 |
|
| 206 |
-
where s = 0.558 is the pixel-space noise standard deviation (estimated from
|
| 207 |
-
the dataset image distribution) and the VP constraint holds:
|
|
|
|
| 208 |
|
| 209 |
### 2.2 Log Signal-to-Noise Ratio
|
| 210 |
|
| 211 |
The schedule is parameterized through the log signal-to-noise ratio:
|
| 212 |
|
| 213 |
-
|
| 214 |
-
λ_t = log(α²_t / σ²_t)
|
| 215 |
-
```
|
| 216 |
|
| 217 |
-
which monotonically decreases as t
|
| 218 |
-
|
| 219 |
|
| 220 |
-
|
| 221 |
-
α_t = √σ(λ_t), σ_t = √σ(-λ_t)
|
| 222 |
-
```
|
| 223 |
|
| 224 |
### 2.3 Cosine-Interpolated Schedule
|
| 225 |
|
| 226 |
Following SiD2, the logSNR schedule uses cosine interpolation:
|
| 227 |
|
| 228 |
-
|
| 229 |
-
λ(t) = -2 log tan(a·t + b)
|
| 230 |
-
```
|
| 231 |
|
| 232 |
-
where a and b are computed to satisfy the boundary conditions
|
| 233 |
-
|
|
|
|
| 234 |
|
| 235 |
### 2.4 X-Prediction Objective
|
| 236 |
|
| 237 |
-
The model predicts the clean image x
|
| 238 |
-
encoder latents z.
|
| 239 |
|
| 240 |
**Schedule-invariant loss.** Following SiD2, the training loss is defined as
|
| 241 |
-
an integral over logSNR
|
| 242 |
-
Since timesteps are sampled uniformly
|
| 243 |
-
introduces a Jacobian
|
|
|
|
| 244 |
|
| 245 |
-
|
| 246 |
-
L = E_{t ~ U(0,1)} [ (-dλ/dt) · w(λ(t)) · ||x₀ - x̂₀||² ]
|
| 247 |
-
```
|
| 248 |
|
| 249 |
**Sigmoid weighting.** The weighting function uses a sigmoid centered at bias
|
| 250 |
-
b = -2.0, converting from
|
|
|
|
| 251 |
|
| 252 |
-
|
| 253 |
-
weight(t) = -(1/2) · (dλ/dt) · e^b · σ(λ(t) - b)
|
| 254 |
-
```
|
| 255 |
|
| 256 |
### 2.5 Sampling
|
| 257 |
|
|
@@ -276,24 +268,25 @@ as an alternative to the traditional VAE KL penalty.
|
|
| 276 |
|
| 277 |
The encoder outputs two sets of 128 channels:
|
| 278 |
|
| 279 |
-
-
|
| 280 |
-
-
|
| 281 |
|
| 282 |
The posterior distribution is:
|
| 283 |
|
| 284 |
-
|
| 285 |
-
z = α(λ) · μ + σ(λ) · ε, ε ~ N(0, I)
|
| 286 |
-
```
|
| 287 |
|
| 288 |
-
where
|
| 289 |
-
|
|
|
|
|
|
|
| 290 |
|
| 291 |
Using a VP interpolation rather than simple additive noise decouples token
|
| 292 |
-
scale from stochasticity. With additive noise (
|
| 293 |
-
faces gradient pressure to scale latents up to counter the noise
|
| 294 |
-
depends on the magnitude of
|
| 295 |
-
|
| 296 |
-
|
|
|
|
| 297 |
|
| 298 |
### 3.2 Variance Expansion Loss
|
| 299 |
|
|
@@ -302,19 +295,17 @@ ignore the stochastic component entirely), we adopt a **variance expansion
|
|
| 302 |
loss** inspired by VEL (Li et al., 2026,
|
| 303 |
[arXiv:2603.21085](https://arxiv.org/abs/2603.21085)):
|
| 304 |
|
| 305 |
-
|
| 306 |
-
L_var = -mean(log(σ² + δ))
|
| 307 |
-
```
|
| 308 |
|
| 309 |
-
where
|
| 310 |
-
|
| 311 |
-
non-zero posterior variance by penalizing small
|
| 312 |
|
| 313 |
-
VEL proposes the form
|
| 314 |
-
be too aggressive — the
|
| 315 |
-
leading to excessive high-frequency noise in the latent
|
| 316 |
-
|
| 317 |
-
that stabilizes training.
|
| 318 |
|
| 319 |
**For this checkpoint:** the variance expansion loss is active with weight
|
| 320 |
**1e-5**.
|
|
@@ -481,20 +472,18 @@ three purposes:
|
|
| 481 |
|
| 482 |
The total training loss is:
|
| 483 |
|
| 484 |
-
|
| 485 |
-
L_total = L_recon + 0.01 · L_semantic + 0.0001 · L_scale + 1e-5 · L_var
|
| 486 |
-
```
|
| 487 |
|
| 488 |
| Loss | Weight | Description |
|
| 489 |
|------|--------|-------------|
|
| 490 |
-
|
|
| 491 |
-
|
|
| 492 |
-
|
|
| 493 |
-
|
|
| 494 |
|
| 495 |
**Note on loss scales:** The decoder reconstruction loss has a small
|
| 496 |
effective magnitude due to the SiD2 VP x-prediction weighting (the Jacobian
|
| 497 |
-
d
|
| 498 |
result, all auxiliary loss weights must be kept correspondingly small to
|
| 499 |
avoid dominating the reconstruction objective.
|
| 500 |
|
|
|
|
| 196 |
|
| 197 |
### 2.1 Forward Process
|
| 198 |
|
| 199 |
+
Given a clean image \\(x_0\\), the forward process constructs a noisy sample at
|
| 200 |
+
continuous time \\(t \in [0, 1]\\):
|
| 201 |
|
| 202 |
+
$$x_t = \alpha_t \, x_0 + \sigma_t \, \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, s^2 I)$$
|
|
|
|
|
|
|
| 203 |
|
| 204 |
+
where \\(s = 0.558\\) is the pixel-space noise standard deviation (estimated from
|
| 205 |
+
the dataset image distribution) and the VP constraint holds:
|
| 206 |
+
\\(\alpha_t^2 + \sigma_t^2 = 1\\).
|
| 207 |
|
| 208 |
### 2.2 Log Signal-to-Noise Ratio
|
| 209 |
|
| 210 |
The schedule is parameterized through the log signal-to-noise ratio:
|
| 211 |
|
| 212 |
+
$$\lambda_t = \log \frac{\alpha_t^2}{\sigma_t^2}$$
|
|
|
|
|
|
|
| 213 |
|
| 214 |
+
which monotonically decreases as \\(t \to 1\\) (pure noise). From \\(\lambda_t\\)
|
| 215 |
+
we recover \\(\alpha_t\\) and \\(\sigma_t\\) via the sigmoid function:
|
| 216 |
|
| 217 |
+
$$\alpha_t = \sqrt{\sigma(\lambda_t)}, \qquad \sigma_t = \sqrt{\sigma(-\lambda_t)}$$
|
|
|
|
|
|
|
| 218 |
|
| 219 |
### 2.3 Cosine-Interpolated Schedule
|
| 220 |
|
| 221 |
Following SiD2, the logSNR schedule uses cosine interpolation:
|
| 222 |
|
| 223 |
+
$$\lambda(t) = -2 \log \tan(a \cdot t + b)$$
|
|
|
|
|
|
|
| 224 |
|
| 225 |
+
where \\(a\\) and \\(b\\) are computed to satisfy the boundary conditions
|
| 226 |
+
\\(\lambda(0) = \lambda_\text{max} = 10\\) and
|
| 227 |
+
\\(\lambda(1) = \lambda_\text{min} = -10\\).
|
| 228 |
|
| 229 |
### 2.4 X-Prediction Objective
|
| 230 |
|
| 231 |
+
The model predicts the clean image \\(\hat{x}_0 = f_\theta(x_t, t, z)\\)
|
| 232 |
+
conditioned on encoder latents \\(z\\).
|
| 233 |
|
| 234 |
**Schedule-invariant loss.** Following SiD2, the training loss is defined as
|
| 235 |
+
an integral over logSNR \\(\lambda\\), making it invariant to the choice of
|
| 236 |
+
noise schedule. Since timesteps are sampled uniformly
|
| 237 |
+
\\(t \sim \mathcal{U}(0,1)\\), the change of variable introduces a Jacobian
|
| 238 |
+
factor:
|
| 239 |
|
| 240 |
+
$$\mathcal{L} = \mathbb{E}_{t \sim \mathcal{U}(0,1)} \left[ \left(-\frac{d\lambda}{dt}\right) \cdot w(\lambda(t)) \cdot \| x_0 - \hat{x}_0 \|^2 \right]$$
|
|
|
|
|
|
|
| 241 |
|
| 242 |
**Sigmoid weighting.** The weighting function uses a sigmoid centered at bias
|
| 243 |
+
\\(b = -2.0\\), converting from \\(\varepsilon\\)-prediction to
|
| 244 |
+
\\(x\\)-prediction form:
|
| 245 |
|
| 246 |
+
$$\text{weight}(t) = -\frac{1}{2} \frac{d\lambda}{dt} \cdot e^b \cdot \sigma(\lambda(t) - b)$$
|
|
|
|
|
|
|
| 247 |
|
| 248 |
### 2.5 Sampling
|
| 249 |
|
|
|
|
| 268 |
|
| 269 |
The encoder outputs two sets of 128 channels:
|
| 270 |
|
| 271 |
+
- \\(\mu\\) — the clean signal (posterior mean)
|
| 272 |
+
- \\(\lambda\\) — per-element log signal-to-noise ratio
|
| 273 |
|
| 274 |
The posterior distribution is:
|
| 275 |
|
| 276 |
+
$$z = \alpha(\lambda) \, \mu + \sigma(\lambda) \, \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, I)$$
|
|
|
|
|
|
|
| 277 |
|
| 278 |
+
where \\(\alpha = \sqrt{\sigma(\lambda)}\\) and
|
| 279 |
+
\\(\sigma = \sqrt{\sigma(-\lambda)}\\) (sigmoid parameterization). This is
|
| 280 |
+
equivalent to a Gaussian with mean \\(\alpha \mu\\) and variance
|
| 281 |
+
\\(\sigma^2\\).
|
| 282 |
|
| 283 |
Using a VP interpolation rather than simple additive noise decouples token
|
| 284 |
+
scale from stochasticity. With additive noise (\\(z = \mu + \sigma\varepsilon\\)),
|
| 285 |
+
the encoder faces gradient pressure to scale latents up to counter the noise
|
| 286 |
+
— the SNR depends on the magnitude of \\(\mu\\). The VP formulation
|
| 287 |
+
(\\(z = \alpha\mu + \sigma\varepsilon\\) with \\(\alpha^2 + \sigma^2 = 1\\))
|
| 288 |
+
removes this coupling: the noise level is controlled entirely by the predicted
|
| 289 |
+
log-SNR, independent of the latent magnitude.
|
| 290 |
|
| 291 |
### 3.2 Variance Expansion Loss
|
| 292 |
|
|
|
|
| 295 |
loss** inspired by VEL (Li et al., 2026,
|
| 296 |
[arXiv:2603.21085](https://arxiv.org/abs/2603.21085)):
|
| 297 |
|
| 298 |
+
$$\mathcal{L}_\text{var} = -\operatorname{mean}\!\bigl(\log(\sigma^2 + \delta)\bigr)$$
|
|
|
|
|
|
|
| 299 |
|
| 300 |
+
where \\(\sigma^2\\) is the posterior variance derived from the predicted
|
| 301 |
+
log-SNR and \\(\delta = 10^{-6}\\) for numerical stability. This loss
|
| 302 |
+
encourages non-zero posterior variance by penalizing small \\(\sigma^2\\).
|
| 303 |
|
| 304 |
+
VEL proposes the form \\(1/(\sigma^2 + \delta)\\) for variance expansion. We
|
| 305 |
+
found this to be too aggressive — the \\(1/\sigma^2\\) gradient pushes variance
|
| 306 |
+
up very rapidly, leading to excessive high-frequency noise in the latent
|
| 307 |
+
space. We use the \\(-\log(\sigma^2 + \delta)\\) form instead, which provides
|
| 308 |
+
a gentler, logarithmic penalty that stabilizes training.
|
| 309 |
|
| 310 |
**For this checkpoint:** the variance expansion loss is active with weight
|
| 311 |
**1e-5**.
|
|
|
|
| 472 |
|
| 473 |
The total training loss is:
|
| 474 |
|
| 475 |
+
$$\mathcal{L}_\text{total} = \mathcal{L}_\text{recon} + 0.01 \cdot \mathcal{L}_\text{semantic} + 10^{-4} \cdot \mathcal{L}_\text{scale} + 10^{-5} \cdot \mathcal{L}_\text{var}$$
|
|
|
|
|
|
|
| 476 |
|
| 477 |
| Loss | Weight | Description |
|
| 478 |
|------|--------|-------------|
|
| 479 |
+
| \\(\mathcal{L}_\text{recon}\\) | 1.0 | SiD2 sigmoid-weighted x-prediction MSE (\\(b = -2.0\\)). Per-pixel \\((\\hat{x}_0 - x_0)^2\\) averaged over (C, H, W) per sample, multiplied by \\(w(t) = -\tfrac{1}{2} \tfrac{d\lambda}{dt} e^b \sigma(\lambda - b)\\), then averaged over the batch |
|
| 480 |
+
| \\(\mathcal{L}_\text{semantic}\\) | 0.01 | Per-token \\(1 - \cos(\text{student}, \text{teacher})\\) averaged over all tokens and batch (see §4) |
|
| 481 |
+
| \\(\mathcal{L}_\text{scale}\\) | 0.0001 | Per-channel variance \\(\text{var}_c\\) estimated over (B, H, W), then \\((\log(\text{var}_c + \varepsilon) - \log(\text{target}))^2\\) averaged over channels. Target variance = 1.0 |
|
| 482 |
+
| \\(\mathcal{L}_\text{var}\\) | 1e-5 | Per-element \\(-\log(\sigma^2 + \delta)\\) where \\(\sigma^2\\) is the posterior variance, averaged over all dims (B, C, H, W). See §3.2 |
|
| 483 |
|
| 484 |
**Note on loss scales:** The decoder reconstruction loss has a small
|
| 485 |
effective magnitude due to the SiD2 VP x-prediction weighting (the Jacobian
|
| 486 |
+
\\(d\lambda/dt\\) and sigmoid weighting compress the per-sample loss scale). As a
|
| 487 |
result, all auxiliary loss weights must be kept correspondingly small to
|
| 488 |
avoid dominating the reconstruction objective.
|
| 489 |
|