krystv
/

liquid-diffusion

Model card Files Files and versions

xet

Community

krystv commited on 6 days ago

Commit

20523ee

verified ·

1 Parent(s): c2b4760

Update README with VAE integration and verified datasets

Browse files

Files changed (1) hide show

README.md +58 -51

README.md CHANGED Viewed

@@ -9,84 +9,92 @@ LiquidDiffusion is a **first-of-its-kind** image generation model that replaces
 ### Key Properties
 - ✅ **Zero attention layers** — fully convolutional + liquid time-gating
 - ✅ **Fully parallelizable** — no ODE solvers, no sequential scanning, no recurrence
-- ✅ **Pretrained VAE** — uses `stabilityai/sd-vae-ft-mse` for efficient latent-space training
 - ✅ **Fits 16GB VRAM** — tiny config runs 256px at batch=8 on T4 GPU
 - ✅ **Simple training** — Rectified Flow (MSE velocity prediction, no noise schedule)
-- ✅ **6 verified datasets** ready to use
-## Quick Start
-Open the Colab notebook, pick your dataset from the dropdown, run all cells:
-**`LiquidDiffusion_Training.ipynb`**
-### Verified Datasets (all tested ✓)
-| Dataset | Size | Content |
-|---------|------|---------|
-| `nielsr/CelebA-faces` | 202K | Celebrity faces |
-| `huggan/flowers-102-categories` | 8K | Flowers |
-| `reach-vb/pokemon-blip-captions` | 833 | Pokemon art |
-| `huggan/anime-faces` | 21K | Anime faces |
-| `huggan/AFHQv2` | 16K | Cat/dog/wild animals |
-| `Norod78/cartoon-blip-captions` | 2K | Cartoon characters |
 ## Architecture
 ```
-Input (noisy latent 4ch) → Conv Stem
-    → Encoder [LiquidDiffusionBlock × N, with downsampling]
-        → Bottleneck [LiquidDiffusionBlock × 2]
-    → Decoder [LiquidDiffusionBlock × N, with upsampling + skip fusion]
-→ Conv Head → Velocity prediction
 ```
-### VAE Integration
-- **Encoder**: `stabilityai/sd-vae-ft-mse` (83M params, frozen)
-- **Latent space**: 4 channels, 8× spatial downscale
-- **256px image → 32×32×4 latent** (64× fewer pixels to process!)
-- **Pre-caching**: Encode dataset once, then train without VAE on GPU (saves ~160MB VRAM)
-### ParallelCfCBlock (Novel Contribution)
-Based on CfC Eq.10: `x(t) = σ(-f·t) ⊙ g + (1 - σ(-f·t)) ⊙ h`
 ```python
-# Three CfC heads from shared backbone
-gate = sigmoid(time_a(t_emb) * f(features) - time_b(t_emb))
-cfc_out = gate * g(features) + (1 - gate) * h(features)
-# Liquid relaxation residual
-α = exp(-softplus(ρ) * |t_emb_mean|)
-output = α * input + (1 - α) * cfc_out
 ```
-**Key insight**: Diffusion timestep `t` IS the liquid time constant. CfC gate naturally adapts to noise level.
 ## Model Configs
-| Config | Channels | Blocks | Params | 256px VRAM | Best For |
-|--------|----------|--------|--------|------------|----------|
-| tiny | [64, 128, 256] | [2, 2, 4] | ~23M | ~6 GB | Quick experiments, T4 |
-| small | [96, 192, 384] | [2, 3, 6] | ~69M | ~10 GB | Quality 256px, T4/A10G |
-## Training Objective: Rectified Flow
 ```python
-x_t = (1 - t) * x0 + t * noise      # linear interpolation
-v_target = noise - x0                 # constant velocity
-loss = MSE(model(x_t, t), v_target)  # simple MSE — no noise schedule!
 ```
 ## References
 | Paper | Contribution |
 |-------|-------------|
 | [CfC Networks (Nature MI 2022)](https://arxiv.org/abs/2106.13898) | CfC Eq.10, parallelizable closed-form |
-| [LTC Networks (AAAI 2021)](https://arxiv.org/abs/2006.04439) | Liquid time-constant ODE, stability |
 | [LiquidTAD (2024)](https://arxiv.org/abs/2604.18274) | Parallel liquid relaxation |
 | [USM (CVPR 2025)](https://arxiv.org/abs/2504.13499) | U-Net + SSM for diffusion |
-| [DiffuSSM (2023)](https://arxiv.org/abs/2311.18257) | SSM beats attention in diffusion |
 | [Rectified Flow (ICLR 2023)](https://arxiv.org/abs/2209.03003) | Simple velocity training |
 ## Files
@@ -95,12 +103,11 @@ loss = MSE(model(x_t, t), v_target)  # simple MSE — no noise schedule!
 ├── liquid_diffusion/
 │   ├── __init__.py
 │   ├── model.py             # Full model architecture
-│   └── trainer.py           # Rectified Flow trainer + dataset utils
-├── LiquidDiffusion_Training.ipynb  # Complete Colab notebook (VAE + 6 datasets)
 ├── test_model.py
 └── README.md
 ```
 ## License
 MIT

 ### Key Properties
 - ✅ **Zero attention layers** — fully convolutional + liquid time-gating
 - ✅ **Fully parallelizable** — no ODE solvers, no sequential scanning, no recurrence
+- ✅ **Latent space training** — uses pretrained SD-VAE (stabilityai/sd-vae-ft-mse, 83.7M frozen)
 - ✅ **Fits 16GB VRAM** — tiny config runs 256px at batch=8 on T4 GPU
 - ✅ **Simple training** — Rectified Flow (MSE velocity prediction, no noise schedule)
+- ✅ **6 verified datasets** — all tested and working with streaming support
+## Quick Start (Colab)
+1. Open `LiquidDiffusion_Training.ipynb` in Colab
+2. Select GPU runtime (T4)
+3. Pick a dataset from the dropdown (default: huggan/AFHQv2 — animal faces)
+4. Run all cells → training starts, samples generated every 500 steps
 ## Architecture
 ```
+Pixel Image (3×256×256)
+    → [Frozen SD-VAE Encode] → Latent (4×32×32)
+    → [LiquidDiffusion U-Net] → Velocity prediction (4×32×32)
+    → [Frozen SD-VAE Decode] → Generated Image (3×256×256)
 ```
+Each **LiquidDiffusionBlock** contains:
+1. **AdaLN** — timestep conditioning via learned scale/shift
+2. **ParallelCfCBlock** — the core liquid neural network layer (CfC Eq.10)
+3. **MultiScaleSpatialMix** — 3×3+5×5+7×7 depthwise conv + global pooling (replaces attention)
+4. **FeedForward** — channel mixing via 1×1 conv
+### The ParallelCfC Block
 ```python
+# CfC Eq.10 adapted for images:
+gate = σ(time_a(t_emb) · f(features) - time_b(t_emb))   # liquid time-gating
+out = gate · g(features) + (1 - gate) · h(features)       # CfC interpolation
+α = exp(-λ · |t_emb|)                                     # liquid relaxation
+output = α · input + (1 - α) · out                         # time-aware residual
 ```
+## Verified Datasets
+All tested and working (with streaming support):
+| Dataset | Images | Description | Native Resolution |
+|---------|--------|-------------|-------------------|
+| `huggan/AFHQv2` | 16K | Animal faces (cats, dogs, wildlife) | 512×512 |
+| `nielsr/CelebA-faces` | 202K | Celebrity faces | 178×218 |
+| `huggan/flowers-102-categories` | 8K | Flower photographs | Variable |
+| `reach-vb/pokemon-blip-captions` | 833 | Pokemon illustrations | 1280×1280 |
+| `huggan/anime-faces` | 63K | Anime faces | 64×64 |
+| `Norod78/cartoon-blip-captions` | ~3K | Cartoon characters | 512×512 |
+## VAE
+Uses **stabilityai/sd-vae-ft-mse** (83.7M params, frozen during training):
+- 4 latent channels, 8× spatial downscale
+- PSNR 27.3 on LAION-Aesthetics (excellent reconstruction)
+- ~160MB VRAM in fp16
+- Scaling factor: 0.18215
 ## Model Configs
+| Config | Params | 256px VRAM (w/ VAE) | 512px VRAM |
+|--------|--------|---------------------|------------|
+| tiny | ~23M | ~6 GB | ~12 GB |
+| small | ~69M | ~10 GB | ~20 GB |
+| base | ~154M | ~16 GB | ~30 GB |
+## Training
+**Objective**: Rectified Flow — simple MSE on velocity
 ```python
+x_t = (1 - t) · x0 + t · noise     # linear interpolation
+v_target = noise - x0                # constant velocity
+loss = MSE(model(x_t, t), v_target)  # that's it!
 ```
+**Sampling**: Euler ODE integration, 25-50 steps
 ## References
 | Paper | Contribution |
 |-------|-------------|
 | [CfC Networks (Nature MI 2022)](https://arxiv.org/abs/2106.13898) | CfC Eq.10, parallelizable closed-form |
+| [LTC Networks (AAAI 2021)](https://arxiv.org/abs/2006.04439) | Liquid time-constant ODE |
 | [LiquidTAD (2024)](https://arxiv.org/abs/2604.18274) | Parallel liquid relaxation |
 | [USM (CVPR 2025)](https://arxiv.org/abs/2504.13499) | U-Net + SSM for diffusion |
+| [DiffuSSM (2023)](https://arxiv.org/abs/2311.18257) | SSM replaces attention in diffusion |
 | [Rectified Flow (ICLR 2023)](https://arxiv.org/abs/2209.03003) | Simple velocity training |
 ## Files
 ├── liquid_diffusion/
 │   ├── __init__.py
 │   ├── model.py             # Full model architecture
+│   └── trainer.py           # Trainer + dataset utilities
+├── LiquidDiffusion_Training.ipynb  # Complete Colab notebook
 ├── test_model.py
 └── README.md
 ```
 ## License
 MIT