Dose-Response C1 (8M-0%) with SafeCLIP: All unsafe removed
Text-encoder ablation using SafeCLIP (77 tokens) in place of the default T5-Gemma-2B (256 tokens).
Condition
|
|
| Label |
C1 (8M-0%) |
| Description |
All unsafe images removed (N kept approximately fixed). |
| Training set size N |
7.94M |
| Unsafe fraction p |
0% |
| Unsafe count U |
0 |
Architecture
|
|
| Class |
PRX (rectified-flow DiT) |
| Hidden size |
1792 |
| Depth |
16 |
| Heads |
28 |
| MLP ratio |
3.5 |
| Patch size |
32 px |
| Bottleneck |
256 |
| Resolution |
512×512 |
Text encoder
|
|
| Model |
aimagelab/safeclip_vit-l_14 |
| Max prompt tokens |
77 |
| Dtype |
bfloat16 |
Diffusion scheduler
|
|
| Type |
x-prediction flow matching |
| Train timesteps |
1000 |
| Timestep shift |
3.0 |
Training
|
|
| Iterations |
100,000 |
| Samples seen |
~25.60M |
| Global batch size |
256 |
| Microbatch (per GPU) |
32 |
| Hardware |
8× NVIDIA H200 |
| Precision |
bfloat16 (amp_bf16) |
| Optimizer (transformer blocks) |
Muon (lr=1e-4, momentum=0.95, nesterov, ns_steps=5, weight_decay=0) |
| Optimizer (other params) |
AdamW (lr=1e-4, β=(0.9, 0.95), eps=1e-8, weight_decay=0) |
| LR schedule |
1,000-step linear warmup, constant after |
| EMA |
decay 0.999, started at step 0 |
| Random seed |
42 |
| Trainer |
Composer + FSDP |
Training data sources
The training set combines three image datasets, with per-condition filtering/oversampling:
Files
denoiser.pt — Consolidated EMA-denoiser checkpoint
config.yaml — Full training configuration
Framework
Trained with the PRX framework (Composer + FSDP). The full config.yaml is included for reproducibility.