data-archetype commited on
Commit
332dbaf
·
verified ·
1 Parent(s): af4eb82

Upload technical_report_semantic.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. technical_report_semantic.md +52 -63
technical_report_semantic.md CHANGED
@@ -196,62 +196,54 @@ SiD2 with an x-prediction objective.
196
 
197
  ### 2.1 Forward Process
198
 
199
- Given a clean image x₀, the forward process constructs a noisy sample at
200
- continuous time t [0, 1]:
201
 
202
- ```
203
- x_t = α_t · x₀ + σ_t · ε, ε ~ N(0, s²I)
204
- ```
205
 
206
- where s = 0.558 is the pixel-space noise standard deviation (estimated from
207
- the dataset image distribution) and the VP constraint holds: α²_t + σ²_t = 1.
 
208
 
209
  ### 2.2 Log Signal-to-Noise Ratio
210
 
211
  The schedule is parameterized through the log signal-to-noise ratio:
212
 
213
- ```
214
- λ_t = log(α²_t / σ²_t)
215
- ```
216
 
217
- which monotonically decreases as t 1 (pure noise). From λ_t we recover
218
- α_t and σ_t via the sigmoid function:
219
 
220
- ```
221
- α_t = √σ(λ_t), σ_t = √σ(-λ_t)
222
- ```
223
 
224
  ### 2.3 Cosine-Interpolated Schedule
225
 
226
  Following SiD2, the logSNR schedule uses cosine interpolation:
227
 
228
- ```
229
- λ(t) = -2 log tan(a·t + b)
230
- ```
231
 
232
- where a and b are computed to satisfy the boundary conditions
233
- λ(0) = λ_max = 10 and λ(1) = λ_min = -10.
 
234
 
235
  ### 2.4 X-Prediction Objective
236
 
237
- The model predicts the clean image x̂₀ = f_θ(x_t, t, z) conditioned on
238
- encoder latents z.
239
 
240
  **Schedule-invariant loss.** Following SiD2, the training loss is defined as
241
- an integral over logSNR λ, making it invariant to the choice of noise schedule.
242
- Since timesteps are sampled uniformly t ~ U(0,1), the change of variable
243
- introduces a Jacobian factor:
 
244
 
245
- ```
246
- L = E_{t ~ U(0,1)} [ (-dλ/dt) · w(λ(t)) · ||x₀ - x̂₀||² ]
247
- ```
248
 
249
  **Sigmoid weighting.** The weighting function uses a sigmoid centered at bias
250
- b = -2.0, converting from ε-prediction to x-prediction form:
 
251
 
252
- ```
253
- weight(t) = -(1/2) · (dλ/dt) · e^b · σ(λ(t) - b)
254
- ```
255
 
256
  ### 2.5 Sampling
257
 
@@ -276,24 +268,25 @@ as an alternative to the traditional VAE KL penalty.
276
 
277
  The encoder outputs two sets of 128 channels:
278
 
279
- - **μ** — the clean signal (posterior mean)
280
- - **λ** — per-element log signal-to-noise ratio
281
 
282
  The posterior distribution is:
283
 
284
- ```
285
- z = α(λ) · μ + σ(λ) · ε, ε ~ N(0, I)
286
- ```
287
 
288
- where α = √σ(λ) and σ = √σ() (sigmoid parameterization). This is
289
- equivalent to a Gaussian with mean α·μ and variance σ².
 
 
290
 
291
  Using a VP interpolation rather than simple additive noise decouples token
292
- scale from stochasticity. With additive noise (`z = μ + σε`), the encoder
293
- faces gradient pressure to scale latents up to counter the noise — the SNR
294
- depends on the magnitude of μ. The VP formulation (`z = α·μ + σ·ε` with
295
- `α² + σ² = 1`) removes this coupling: the noise level is controlled
296
- entirely by the predicted log-SNR, independent of the latent magnitude.
 
297
 
298
  ### 3.2 Variance Expansion Loss
299
 
@@ -302,19 +295,17 @@ ignore the stochastic component entirely), we adopt a **variance expansion
302
  loss** inspired by VEL (Li et al., 2026,
303
  [arXiv:2603.21085](https://arxiv.org/abs/2603.21085)):
304
 
305
- ```
306
- L_var = -mean(log(σ² + δ))
307
- ```
308
 
309
- where σ² is the posterior variance derived from the predicted log-SNR and
310
- δ is a small epsilon (1e-6) for numerical stability. This loss encourages
311
- non-zero posterior variance by penalizing small σ².
312
 
313
- VEL proposes the form `1/(σ² + δ)` for variance expansion. We found this to
314
- be too aggressive — the `1/σ²` gradient pushes variance up very rapidly,
315
- leading to excessive high-frequency noise in the latent space. We use the
316
- `-log(σ² + δ)` form instead, which provides a gentler, logarithmic penalty
317
- that stabilizes training.
318
 
319
  **For this checkpoint:** the variance expansion loss is active with weight
320
  **1e-5**.
@@ -481,20 +472,18 @@ three purposes:
481
 
482
  The total training loss is:
483
 
484
- ```
485
- L_total = L_recon + 0.01 · L_semantic + 0.0001 · L_scale + 1e-5 · L_var
486
- ```
487
 
488
  | Loss | Weight | Description |
489
  |------|--------|-------------|
490
- | **Reconstruction** (L_recon) | 1.0 | SiD2 sigmoid-weighted x-prediction MSE (bias b = -2.0). Per-pixel `(x̂₀ - x₀)²` averaged over (C, H, W) per sample, multiplied by the SiD2 per-sample weight `w(t) = -½ · dλ/dt · e^b · σ(λ-b)`, then averaged over the batch |
491
- | **Semantic alignment** (L_semantic) | 0.01 | Per-token `(1 - cosine(student, teacher))` averaged over all tokens and batch (see §4) |
492
- | **Latent scale penalty** (L_scale) | 0.0001 | Per-channel variance `var_c` estimated over the batch and spatial dims (B, H, W), then `(log(var_c + ε) - log(target))²` averaged over channels. Target variance = 1.0 |
493
- | **Posterior variance expansion** (L_var) | 1e-5 | Per-element `-log(σ² + δ)` where σ² is the posterior variance derived from the predicted log-SNR, averaged over all dims (B, C, H, W). See §3.2 |
494
 
495
  **Note on loss scales:** The decoder reconstruction loss has a small
496
  effective magnitude due to the SiD2 VP x-prediction weighting (the Jacobian
497
- dλ/dt and sigmoid weighting compress the per-sample loss scale). As a
498
  result, all auxiliary loss weights must be kept correspondingly small to
499
  avoid dominating the reconstruction objective.
500
 
 
196
 
197
  ### 2.1 Forward Process
198
 
199
+ Given a clean image \\(x_0\\), the forward process constructs a noisy sample at
200
+ continuous time \\(t \in [0, 1]\\):
201
 
202
+ $$x_t = \alpha_t \, x_0 + \sigma_t \, \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, s^2 I)$$
 
 
203
 
204
+ where \\(s = 0.558\\) is the pixel-space noise standard deviation (estimated from
205
+ the dataset image distribution) and the VP constraint holds:
206
+ \\(\alpha_t^2 + \sigma_t^2 = 1\\).
207
 
208
  ### 2.2 Log Signal-to-Noise Ratio
209
 
210
  The schedule is parameterized through the log signal-to-noise ratio:
211
 
212
+ $$\lambda_t = \log \frac{\alpha_t^2}{\sigma_t^2}$$
 
 
213
 
214
+ which monotonically decreases as \\(t \to 1\\) (pure noise). From \\(\lambda_t\\)
215
+ we recover \\(\alpha_t\\) and \\(\sigma_t\\) via the sigmoid function:
216
 
217
+ $$\alpha_t = \sqrt{\sigma(\lambda_t)}, \qquad \sigma_t = \sqrt{\sigma(-\lambda_t)}$$
 
 
218
 
219
  ### 2.3 Cosine-Interpolated Schedule
220
 
221
  Following SiD2, the logSNR schedule uses cosine interpolation:
222
 
223
+ $$\lambda(t) = -2 \log \tan(a \cdot t + b)$$
 
 
224
 
225
+ where \\(a\\) and \\(b\\) are computed to satisfy the boundary conditions
226
+ \\(\lambda(0) = \lambda_\text{max} = 10\\) and
227
+ \\(\lambda(1) = \lambda_\text{min} = -10\\).
228
 
229
  ### 2.4 X-Prediction Objective
230
 
231
+ The model predicts the clean image \\(\hat{x}_0 = f_\theta(x_t, t, z)\\)
232
+ conditioned on encoder latents \\(z\\).
233
 
234
  **Schedule-invariant loss.** Following SiD2, the training loss is defined as
235
+ an integral over logSNR \\(\lambda\\), making it invariant to the choice of
236
+ noise schedule. Since timesteps are sampled uniformly
237
+ \\(t \sim \mathcal{U}(0,1)\\), the change of variable introduces a Jacobian
238
+ factor:
239
 
240
+ $$\mathcal{L} = \mathbb{E}_{t \sim \mathcal{U}(0,1)} \left[ \left(-\frac{d\lambda}{dt}\right) \cdot w(\lambda(t)) \cdot \| x_0 - \hat{x}_0 \|^2 \right]$$
 
 
241
 
242
  **Sigmoid weighting.** The weighting function uses a sigmoid centered at bias
243
+ \\(b = -2.0\\), converting from \\(\varepsilon\\)-prediction to
244
+ \\(x\\)-prediction form:
245
 
246
+ $$\text{weight}(t) = -\frac{1}{2} \frac{d\lambda}{dt} \cdot e^b \cdot \sigma(\lambda(t) - b)$$
 
 
247
 
248
  ### 2.5 Sampling
249
 
 
268
 
269
  The encoder outputs two sets of 128 channels:
270
 
271
+ - \\(\mu\\) — the clean signal (posterior mean)
272
+ - \\(\lambda\\) — per-element log signal-to-noise ratio
273
 
274
  The posterior distribution is:
275
 
276
+ $$z = \alpha(\lambda) \, \mu + \sigma(\lambda) \, \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, I)$$
 
 
277
 
278
+ where \\(\alpha = \sqrt{\sigma(\lambda)}\\) and
279
+ \\(\sigma = \sqrt{\sigma(-\lambda)}\\) (sigmoid parameterization). This is
280
+ equivalent to a Gaussian with mean \\(\alpha \mu\\) and variance
281
+ \\(\sigma^2\\).
282
 
283
  Using a VP interpolation rather than simple additive noise decouples token
284
+ scale from stochasticity. With additive noise (\\(z = \mu + \sigma\varepsilon\\)),
285
+ the encoder faces gradient pressure to scale latents up to counter the noise
286
+ — the SNR depends on the magnitude of \\(\mu\\). The VP formulation
287
+ (\\(z = \alpha\mu + \sigma\varepsilon\\) with \\(\alpha^2 + \sigma^2 = 1\\))
288
+ removes this coupling: the noise level is controlled entirely by the predicted
289
+ log-SNR, independent of the latent magnitude.
290
 
291
  ### 3.2 Variance Expansion Loss
292
 
 
295
  loss** inspired by VEL (Li et al., 2026,
296
  [arXiv:2603.21085](https://arxiv.org/abs/2603.21085)):
297
 
298
+ $$\mathcal{L}_\text{var} = -\operatorname{mean}\!\bigl(\log(\sigma^2 + \delta)\bigr)$$
 
 
299
 
300
+ where \\(\sigma^2\\) is the posterior variance derived from the predicted
301
+ log-SNR and \\(\delta = 10^{-6}\\) for numerical stability. This loss
302
+ encourages non-zero posterior variance by penalizing small \\(\sigma^2\\).
303
 
304
+ VEL proposes the form \\(1/(\sigma^2 + \delta)\\) for variance expansion. We
305
+ found this to be too aggressive — the \\(1/\sigma^2\\) gradient pushes variance
306
+ up very rapidly, leading to excessive high-frequency noise in the latent
307
+ space. We use the \\(-\log(\sigma^2 + \delta)\\) form instead, which provides
308
+ a gentler, logarithmic penalty that stabilizes training.
309
 
310
  **For this checkpoint:** the variance expansion loss is active with weight
311
  **1e-5**.
 
472
 
473
  The total training loss is:
474
 
475
+ $$\mathcal{L}_\text{total} = \mathcal{L}_\text{recon} + 0.01 \cdot \mathcal{L}_\text{semantic} + 10^{-4} \cdot \mathcal{L}_\text{scale} + 10^{-5} \cdot \mathcal{L}_\text{var}$$
 
 
476
 
477
  | Loss | Weight | Description |
478
  |------|--------|-------------|
479
+ | \\(\mathcal{L}_\text{recon}\\) | 1.0 | SiD2 sigmoid-weighted x-prediction MSE (\\(b = -2.0\\)). Per-pixel \\((\\hat{x}_0 - x_0)^2\\) averaged over (C, H, W) per sample, multiplied by \\(w(t) = -\tfrac{1}{2} \tfrac{d\lambda}{dt} e^b \sigma(\lambda - b)\\), then averaged over the batch |
480
+ | \\(\mathcal{L}_\text{semantic}\\) | 0.01 | Per-token \\(1 - \cos(\text{student}, \text{teacher})\\) averaged over all tokens and batch (see §4) |
481
+ | \\(\mathcal{L}_\text{scale}\\) | 0.0001 | Per-channel variance \\(\text{var}_c\\) estimated over (B, H, W), then \\((\log(\text{var}_c + \varepsilon) - \log(\text{target}))^2\\) averaged over channels. Target variance = 1.0 |
482
+ | \\(\mathcal{L}_\text{var}\\) | 1e-5 | Per-element \\(-\log(\sigma^2 + \delta)\\) where \\(\sigma^2\\) is the posterior variance, averaged over all dims (B, C, H, W). See §3.2 |
483
 
484
  **Note on loss scales:** The decoder reconstruction loss has a small
485
  effective magnitude due to the SiD2 VP x-prediction weighting (the Jacobian
486
+ \\(d\lambda/dt\\) and sigmoid weighting compress the per-sample loss scale). As a
487
  result, all auxiliary loss weights must be kept correspondingly small to
488
  avoid dominating the reconstruction objective.
489