rogermt
/

nsgf-plusplus

Model card Files Files and versions

xet

Community

rogermt commited on 13 days ago

Commit

d6ef77d

verified ·

1 Parent(s): 66d3632

Split SKILL.md into SKILL.md (rules) + LEARNING.md (stories/mistakes) + TODO.md (next steps)"

Browse files

Files changed (1) hide show

SKILL.md +69 -382

SKILL.md CHANGED Viewed

@@ -1,453 +1,140 @@
 ---
 name: paper-reproduction
-description: "Skill for reproducing ML research papers from scratch when no official code exists. Use this whenever a user asks to implement, reproduce, or replicate a paper — especially papers involving novel loss functions, custom training loops, or non-standard architectures that aren't covered by existing HF trainers. Also use when the user mentions 'paper reproduction', 'implement this paper', 'no official code', or describes a method from a specific arxiv paper. Covers: reading papers systematically, extracting hyperparameters, building custom training pipelines, handling library-specific gotchas (geomloss, POT, custom UNets), VRAM estimation, checkpointing for multi-session training, and iterating on GPU results."
 ---
 # Paper Reproduction Skill
-A skill for reproducing ML research papers from scratch, learned through the experience of reproducing NSGF++ (arXiv:2401.14069) — a Neural Sinkhorn Gradient Flow paper with no official implementation.
-## When to use this skill
-- User wants to reproduce/implement an ML paper
-- No official code repository exists
-- The paper uses custom training loops, novel losses, or non-standard architectures
-- The method doesn't fit neatly into existing HF Trainer abstractions (SFT, DPO, GRPO)
 ---
-## Phase 1: Read the Paper Properly
-Most reproduction failures trace back to incomplete paper reading. Don't skim — read methodology sections (3, 4, 5) line by line, and read ALL appendices.
-### What to extract (checklist)
 ```
-□ Loss function — exact mathematical form, every symbol defined
-□ Architecture — layer counts, hidden dims, activation functions, normalization
-□ Optimizer — type, learning rate, betas, weight decay, scheduler
 □ Batch size — for each phase/component separately
-□ Training iterations — for each phase/component
 □ Dataset preprocessing — normalization range, image size, augmentation
-□ Evaluation protocol — metrics, number of samples, any special setup
 □ Hyperparameters per experiment — papers often have different configs per dataset
-□ Algorithm pseudocode — if provided, follow it exactly before improvising
 □ GPU hardware used — what the authors trained on (often buried in appendix)
 □ Training time — how long did the authors' runs take?
 ```
-### Mistake I made: Incomplete appendix reading
-I extracted most hyperparameters correctly from the NSGF++ paper but missed a critical detail about how geomloss handles image tensors. The paper says "GeomLoss package" but doesn't spell out that images must be flattened to (N, D) format for the `SamplesLoss` API. This caused the MNIST and CIFAR-10 experiments to crash immediately on GPU.
-**Lesson**: When a paper references a specific library, read that library's documentation and test its API with the exact tensor shapes you'll use BEFORE writing the full pipeline.
----
-## Phase 2: Library API Verification
-### CRITICAL: Test third-party library APIs with your actual tensor shapes
-This is the single biggest mistake pattern in paper reproduction. You read the paper, understand the math, implement everything — then it crashes because a library function expects `(N, D)` but you passed `(N, C, H, W)`.
-**The rule**: Before building ANY training loop that uses a third-party library (geomloss, POT, torchsde, torchdiffeq, etc.), write a 10-line test script:
-```python
-import torch
-from geomloss import SamplesLoss
-# Test with EXACT shapes you'll use in training
-loss_fn = SamplesLoss(loss="sinkhorn", p=2, blur=0.5, potentials=True)
-# 2D case — works fine
-x_2d = torch.randn(256, 2, requires_grad=True)
-y_2d = torch.randn(256, 2)
-F, G = loss_fn(x_2d, y_2d)  # ✅ OK
-# Image case — THIS CRASHES
-x_img = torch.randn(128, 1, 28, 28, requires_grad=True)
-y_img = torch.randn(128, 1, 28, 28)
-F, G = loss_fn(x_img, y_img)  # ❌ ValueError: must be (N,D) or (B,N,D)
-# Image case — FIXED by flattening
-B = x_img.shape[0]
-x_flat = x_img.view(B, -1).requires_grad_(True)
-y_flat = y_img.view(B, -1)
-F, G = loss_fn(x_flat, y_flat)  # ✅ OK
-```
-### Mistake I made: geomloss tensor shape assumption
-The `SamplesLoss` in geomloss requires inputs as `(N, D)` or `(B, N, D)` tensors. For 2D experiments with shape `(256, 2)` this works perfectly. For images with shape `(128, 1, 28, 28)` it crashes with:
-```
-ValueError: Input samples 'x' and 'y' should be encoded as (N,D) or (B,N,D) (batch) tensors.
-```
-**The fix**: Flatten images before passing to geomloss, reshape gradients back after. This pattern — flatten before library call, reshape after — applies to many optimal transport libraries (POT, geomloss, ott-jax).
 ---
-## Phase 3: Architecture Gotchas
-### UNet skip connections
-When building a UNet from scratch (rather than importing from guided-diffusion), the skip connection bookkeeping is the #1 source of shape mismatch errors.
-**The pattern that works**:
-1. During the downward pass, push every intermediate activation onto a `skips` list
-2. During the upward pass, pop from `skips` and concatenate
-3. The number of pops must EXACTLY equal the number of pushes
-**Mistake pattern**: Using a helper like `_get_num_res_blocks()` that infers block count from module list lengths. This is fragile — if the number of levels or blocks per level varies, the inference breaks.
-**Better approach**: Store `num_res_blocks` as an instance variable at init time and use it directly.
-### GroupNorm channel requirements
-`nn.GroupNorm(32, channels)` requires `channels` to be divisible by 32. For small models (e.g., MNIST with `model_channels=32`), this is fine at the first level but may break at deeper levels if `channel_mult` creates channels not divisible by 32.
 ---
-## Phase 4: VRAM Estimation and Memory Management
-### Estimate VRAM BEFORE running — not after OOM
-Papers report batch sizes that worked on their hardware (often A100 80GB or 8×V100). If your user has a T4 (16GB) or even a T4×2 (16GB per GPU, but single-GPU code only uses one), you must recalculate whether the paper's configs will fit.
-### The Sinkhorn VRAM trap
-The `tensorized` backend in geomloss computes a full N×N cost matrix. For N samples of dimension D:
-- Memory ≈ O(N² × D) for the cost matrix + intermediate Sinkhorn iterations
-- With `potentials=True` and `autograd.grad`, add another O(N × D) for gradient storage
-**Concrete examples (fp32, single Sinkhorn call)**:
-| N (batch) | D (flattened dim) | Approx VRAM per call |
-|-----------|-------------------|---------------------|
-| 256 | 2 (2D points) | ~1 MB |
-| 256 | 784 (MNIST 28×28) | ~200 MB |
-| 128 | 3072 (CIFAR 3×32×32) | ~600 MB |
-But pool building calls Sinkhorn **twice per step** (self-potential + cross-potential) × **5 flow steps per batch** = 10 Sinkhorn calls per pool batch. With autograd overhead, 128×3072 easily eats 8+ GB — leaving no room for the 38M-param UNet on a 16GB T4.
-**Mistake I made**: Used the paper's `sinkhorn.batch_size=128` for CIFAR-10. This OOMed immediately on T4. The paper's authors likely used A100s.
-**The fix**: Reduce Sinkhorn batch size for smaller GPUs and increase pool batches to compensate:
-```yaml
-# Paper config (A100 80GB):
-sinkhorn.batch_size: 128
-pool.num_batches: 2500
-# Total pool entries: 128 × 2500 × 5 = 1.6M
-# T4 16GB config:
-sinkhorn.batch_size: 32
-pool.num_batches: 10000
-# Total pool entries: 32 × 10000 × 5 = 1.6M (same!)
-```
-Add a CLI override (`--sinkhorn-batch`) so users can tune without editing config files.
-### Always call `torch.cuda.empty_cache()` between phases
-Pool building uses GPU for Sinkhorn computation. Training uses GPU for the neural network. These are different memory patterns. After pool building, the Sinkhorn computation graph is no longer needed — but PyTorch's CUDA allocator may still hold that memory. Explicitly free it:
-```python
-def build_trajectory_pool(self, ...):
-    # ... build pool ...
-    if self.device != "cpu":
-        torch.cuda.empty_cache()  # Free Sinkhorn memory before training
-    self.pool.finalize()
-```
-### Multi-GPU ≠ automatic parallelism
-If the user has a T4×2 on Kaggle, your single-GPU code will only use ONE of the two GPUs. The second sits idle. Using both requires PyTorch DDP or model parallelism — which is a significant code change.
-**Don't silently assume multi-GPU works.** Document this:
-```
-NOTE: This code uses a single GPU. If you have T4×2, only one GPU is used.
-A single T4 (16GB) is sufficient — the second GPU is wasted without DDP.
-```
-### Trajectory pool memory on CPU vs GPU
-The trajectory pool stores ALL flow trajectories for the entire training. For image experiments this is gigabytes:
-- MNIST: 1.92M entries × 784 dims × 4 bytes = **6 GB** on CPU
-- CIFAR: 1.6M entries × 3072 dims × 4 bytes = **19.6 GB** on CPU
-The pool MUST live on CPU. Only the sampled minibatch (128-256 samples) goes to GPU per training step. This is already how the code works (trajectories stored as CPU tensors, `.to(device)` in `sample()`), but it's worth being explicit about why.
 ---
-## Phase 5: Testing Strategy
-### Always test on CPU first with tiny configs
-Before any GPU run, verify the full pipeline works end-to-end:
-```bash
-# Tiny run — should complete in <30 seconds
-python main.py --experiment 2d --dataset 8gaussians --steps 5 --pool-batches 5 --train-iters 100
-# Slightly larger — should complete in <5 minutes
-python main.py --experiment 2d --dataset 8gaussians --steps 5 --pool-batches 20 --train-iters 2000
-```
-### Test image experiments separately with minimal configs
-```bash
-# MNIST smoke test — 2 pool batches, 5 training iters per phase
-python main.py --experiment mnist --pool-batches 2 --train-iters 5
-# If this crashes, fix before scaling up
-```
-**Mistake I made**: I tested 2D experiments thoroughly on CPU (both tiny and medium runs worked) but shipped the image experiments without testing them at all. The geomloss tensor shape bug affected ONLY the image path, so 2D success gave false confidence. The first GPU test of MNIST crashed immediately.
-**Rule**: Test EVERY experiment type, not just the simplest one. If you have `{2d, mnist, cifar10}` experiments, test all three with minimal configs before declaring the code ready.
-### Test all training phases, not just the first one
-Even after fixing Phase 1, Phase 2 can still crash due to shared state (see DataLoader trap in Phase 6). Run with `--train-iters 5 --pool-batches 2` to verify all 3 phases complete without errors. This takes <60 seconds on CPU for MNIST.
 ---
-## Phase 6: Shared State Across Training Phases
-### The DataLoader trap
-When a single `DatasetLoader` object is shared across multiple training phases, **lazy-initialized internal state** (like a cached DataLoader) will silently break subsequent phases.
-**Mistake I made**: The `DatasetLoader.sample_target()` method lazily creates a PyTorch DataLoader on the first call, caching it with whatever batch size was requested. Phase 1 (pool building) calls `sample_target(256)` → DataLoader created with `batch_size=256, drop_last=True`. Phase 2 (NSF training) calls `sample_target(128)` → but the cached DataLoader still yields batches of 256 → tensor shape mismatch → crash:
-```
-RuntimeError: The size of tensor a (128) must match the size of tensor b (256) at non-singleton dimension 0
-```
-**The fix**: Track the batch size and recreate the DataLoader when it changes:
-```python
-def sample_target(self, n, device="cpu"):
-    if not hasattr(self, "_loader") or self._batch_size != n:
-        self._batch_size = n
-        self._loader = get_image_dataloader(self.dataset_name, batch_size=n, train=True)
-        self._iter = iter(self._loader)
-    # ... sample from self._iter ...
-```
-**General rule**: When sharing a data provider across multiple consumers with different batch sizes, NEVER cache a DataLoader with a fixed batch size. Either recreate it on batch size change, or provide raw dataset access and let each consumer create its own DataLoader.
 ---
-## Phase 7: Checkpointing and Multi-Session Training
-### Why this matters
-Paper reproduction often requires training runs that exceed a single GPU session. Kaggle gives 9 hours per T4 session. MNIST NSGF++ with full paper config (100K+100K+40K iters) needs ~7-8 hours on T4 — tight. CIFAR-10 (200K+200K+40K) is impossible in one session.
-Without checkpointing, a Kaggle timeout = all progress lost.
-### Phase-level checkpointing
-For multi-phase training, save a checkpoint after EACH phase completes:
-```python
-# After Phase 1 completes:
-torch.save({
-    "nsgf_model_state": nsgf_model.state_dict(),
-    "phase": 1,
-}, "checkpoints/phase1_complete.pt")
-# After Phase 2 completes:
-torch.save({
-    "nsgf_model_state": nsgf_model.state_dict(),
-    "nsf_model_state": nsf_model.state_dict(),
-    "phase": 2,
-}, "checkpoints/phase2_complete.pt")
-```
-Then implement `--resume-phase N` that loads the phase N-1 checkpoint and skips completed phases:
-```bash
-# Session 1: Run Phase 1 (gets interrupted or completes)
-python main.py --experiment mnist
-# Session 2: Skip Phase 1, start Phase 2
-python main.py --experiment mnist --resume-phase 2
-# Session 3: Skip Phases 1+2, run Phase 3 + inference
-python main.py --experiment mnist --resume-phase 3
-```
-### Step-level checkpointing within phases
-For long phases (100K+ steps), also save within the phase every N steps:
-```python
-if (step + 1) % checkpoint_every == 0:
-    torch.save({
-        "model_state": model.state_dict(),
-        "optimizer_state": optimizer.state_dict(),
-        "step": step + 1,
-    }, "checkpoints/nsgf_checkpoint.pt")
-```
-### Important: checkpoint persistence on Kaggle
-Kaggle notebooks persist `/kaggle/working/` across cells within the same session, but NOT across sessions. To carry checkpoints between sessions:
-1. Save checkpoints to `/kaggle/working/nsgf-plusplus/checkpoints/`
-2. Before session ends, commit the notebook output or copy checkpoints to a dataset
-3. In the new session, restore checkpoints before running `--resume-phase`
 ---
-## Phase 8: Debugging GPU Runs
-### Common error patterns
-| Error | Cause | Fix |
-|-------|-------|-----|
-| `ValueError: (N,D) or (B,N,D)` | Library expects flat tensors, got images | Flatten before library call |
-| `RuntimeError: size of tensor a (X) must match size of tensor b (Y)` | Shared DataLoader with wrong batch size | Recreate DataLoader when batch size changes |
-| `RuntimeError: shape mismatch` in UNet | Skip connection count wrong | Count pushes and pops manually |
-| `CUDA OOM` during pool building (Sinkhorn) | Sinkhorn batch too large for GPU | Reduce `--sinkhorn-batch` (e.g. 128→32) |
-| `CUDA OOM` during training | Training batch too large or model too big | Reduce training batch, increase grad accum |
-| `CUDA OOM` at phase transition | Memory not freed between phases | Add `torch.cuda.empty_cache()` + `del pool` |
-| Training loss plateaus high | Pool too small or too few iterations | Increase pool batches, more iters |
-| W2 distance too high | Undertrained model | Full paper config: 200 batches, 20k iters |
-| Only 1 of 2 GPUs used | Code is single-GPU, no DDP | Expected — use single GPU or add DDP |
-| `KeyboardInterrupt` mid-training | Training too long at scale | Check `checkpoints/` for latest save |
-### When the user runs on their hardware
-If you're developing code that the user will run on their own GPU (Kaggle, Colab, local):
-1. **Provide exact commands** — don't make them figure out args
-2. **Warn about expected runtimes** — "2D full run: ~20min on T4, MNIST: ~2-4 hours per phase, CIFAR-10: ~4+ hours per phase"
-3. **Include checkpoint saving** — so partial runs aren't wasted
-4. **Document GPU requirements** — "MNIST fits on T4 16GB, CIFAR-10 needs `--sinkhorn-batch 32`"
-5. **Document multi-GPU limitations** — "Single-GPU only. T4×2 wastes the second GPU."
-6. **Test the exact commands yourself** — if you can't run on GPU, at least verify the command parses correctly on CPU
 ---
-## Mistake Catalog
-### Mistakes made during NSGF++ reproduction
-1. **geomloss tensor shape bug** (CRITICAL)
-   - **What**: `SamplesLoss` requires `(N,D)` tensors. Image experiments passed `(N,C,H,W)`.
-   - **Impact**: MNIST and CIFAR-10 experiments crash immediately. 2D works fine, hiding the bug.
-   - **Root cause**: Only tested 2D path. Didn't verify library API with image tensor shapes.
-   - **Prevention**: Write a standalone API test script for every third-party library, testing with ALL tensor shapes you'll use.
-2. **TrajectoryPool sampling performance** (MODERATE)
-   - **What**: `torch.cat` called on entire pool every training step.
-   - **Impact**: Training slower than necessary. At 512K pool entries, the cat+index is the bottleneck (~0.5s per step vs ~0.05s for the actual forward/backward).
-   - **Root cause**: Didn't profile the training loop.
-   - **Prevention**: Pre-concatenate the pool after building it. Profile before shipping.
-3. **Incomplete experiment testing** (CRITICAL)
-   - **What**: Tested 2D experiments only. Shipped MNIST/CIFAR untested.
-   - **Impact**: User's first GPU run crashes. Wasted their Kaggle session time.
-   - **Root cause**: False confidence from 2D success. Assumed same code path.
-   - **Prevention**: Test EVERY experiment type with minimal configs. Different experiment types often exercise different code paths.
-4. **No checkpoint saving** (MODERATE → became CRITICAL at scale)
-   - **What**: No intermediate checkpoints during long training runs.
-   - **Impact**: If training is interrupted (Kaggle timeout, OOM, accidental Ctrl+C), all progress is lost. MNIST full run is ~7 hours — losing that is devastating.
-   - **Prevention**: Save checkpoints every N iterations. Save after each phase. Implement `--resume-phase` flag. Test resume actually works.
-5. **UNet forward pass fragility** (LOW-MODERATE)
-   - **What**: `_get_num_res_blocks()` infers block count from module list length division.
-   - **Impact**: Could break silently with non-standard configs.
-   - **Prevention**: Store config values as instance variables, don't infer from module counts.
-6. **DataLoader batch size mismatch across phases** (CRITICAL)
-   - **What**: Shared `DatasetLoader` caches a DataLoader with batch_size=256 from Phase 1. Phase 2 requests batch_size=128 but gets 256 back → tensor dimension mismatch crash.
-   - **Impact**: Phase 2 (NSF) crashes immediately even after Phase 1 completes successfully.
-   - **Root cause**: Lazy initialization pattern without invalidation.
-   - **Prevention**: When sharing stateful objects across consumers with different configs, track all cached parameters and invalidate on change.
-7. **CLI flag not overriding all training phases** (LOW)
-   - **What**: `--train-iters` flag overrode NSGF and NSF iterations but NOT the phase predictor iterations (40,000 default). Smoke tests would hang on Phase 3 even with `--train-iters 5`.
-   - **Impact**: Tests take much longer than expected.
-   - **Root cause**: Forgot that 3-phase training means 3 iteration counts to override.
-   - **Prevention**: When adding a CLI override, grep the config for ALL fields it should affect.
-8. **CIFAR-10 Sinkhorn OOM on T4** (CRITICAL)
-   - **What**: Paper uses `sinkhorn.batch_size=128` for CIFAR. Sinkhorn on 128 × 3072-dim (flattened 3×32×32) with `tensorized` backend computes a 128×128 cost matrix with 3072-dim vectors, plus autograd for potentials. This OOMs on T4 16GB during pool building.
-   - **Impact**: CIFAR-10 experiment crashes before even starting training. User loses their Kaggle session.
-   - **Root cause**: Used paper's hyperparameters without estimating VRAM for target hardware. Paper authors likely used A100 80GB.
-   - **Prevention**: ALWAYS estimate VRAM before running. Sinkhorn with `tensorized` backend is O(N² × D). For CIFAR: 128² × 3072 × 4 bytes × ~10 (overhead) ≈ 2+ GB per call, ×10 calls per pool batch = too much. Reduce N: 32² × 3072 is 4× cheaper. Add `--sinkhorn-batch` CLI flag so users can tune without editing config.
-9. **No GPU memory freed between phases** (MODERATE)
-   - **What**: After pool building, the Sinkhorn computation graph's CUDA allocations remain cached even though they're no longer needed. Training then starts with less available VRAM.
-   - **Impact**: Training phase might OOM even though pool building finished.
-   - **Root cause**: PyTorch's CUDA allocator doesn't automatically return memory to the OS.
-   - **Prevention**: `torch.cuda.empty_cache()` after pool building completes. Also `del pool` if the pool data was already finalized to separate tensors.
-10. **Multi-GPU assumption** (LOW)
-    - **What**: User has T4×2 on Kaggle. Code is single-GPU. Second GPU sits idle.
-    - **Impact**: User pays for 2 GPUs but only uses 1. They might think the code is broken.
-    - **Root cause**: Didn't document single-GPU limitation.
-    - **Prevention**: Document GPU requirements explicitly. If multi-GPU is needed, implement DDP — but that's a significant scope change, so discuss with user first.
----
-## Pre-flight Checklist (before declaring code ready)
 ```
-□ All experiment types tested with minimal configs (not just the easiest one)
-□ ALL training phases tested end-to-end (not just Phase 1)
-□ Third-party library APIs tested with exact tensor shapes per experiment
-□ Shared state across phases verified (DataLoaders, iterators, caches)
-□ CLI flags override ALL relevant config values (not just some)
-□ VRAM estimated for target hardware — will Sinkhorn/model/pool fit?
-□ Sinkhorn batch size appropriate for target GPU (not just paper's GPU)
-□ torch.cuda.empty_cache() called between memory-intensive phases
-□ Training loop profiled — no O(N) operations per step where O(1) suffices
-□ Memory estimated per experiment (pool size × data dim × 4 bytes)
-□ Checkpointing implemented: every N steps + after each phase
-□ --resume-phase tested and working (load checkpoint → skip phases → continue)
-□ Clear CLI with sensible defaults and override flags for GPU-sensitive params
 □ Expected runtimes documented per hardware tier
 □ Multi-GPU limitations documented
-□ Error messages are clear (not just stack traces)
-□ Results directory created automatically
-□ Requirements.txt includes ALL dependencies with minimum versions
 ```
 ---
-## General Principles for Paper Reproduction
-1. **Read the appendix first.** The appendix contains the actual implementation details. The main paper is the story; the appendix is the recipe.
-2. **Test the boundaries, not just the happy path.** If your code handles 2D, MNIST, and CIFAR-10, test all three. The bug is always in the path you didn't test.
-3. **Library APIs are opaque until tested.** Don't assume a function accepts your tensor shape just because it "makes sense." Write a 10-line test script.
-4. **Pre-concatenate, don't re-concatenate.** Any data structure that's built once and sampled many times should be finalized into a single tensor after building.
-5. **The user's time is more expensive than your time.** A crash on their GPU after 5 minutes of setup is worse than you spending 30 extra minutes testing. Ship code that works on first run.
-6. **Flatten for OT libraries.** Optimal transport libraries (geomloss, POT, ott-jax) almost universally expect `(N, D)` point clouds. Images must be flattened. This is the #1 gotcha in OT-based generative models.
-7. **Store training state on CPU, compute on GPU.** Trajectory pools, replay buffers, and other large data structures should live on CPU. Only the current minibatch goes to GPU.
-8. **Multi-phase training = multiple separate trainers.** Don't try to be clever with a single training loop that switches phases. Each phase is a distinct trainer with its own optimizer. The previous phase's model goes to `eval()`.
-9. **Shared objects across phases are landmines.** When a DataLoader, iterator, or cache is shared across training phases, any phase-specific parameter (batch size, number of workers, shuffle mode) can silently break later phases. Either don't share, or implement proper invalidation. Test by running all phases sequentially with different configs per phase.
-10. **CLI overrides must be exhaustive.** If your config has N copies of a parameter (one per training phase), your CLI override must touch all N. Grep the config file for the parameter name to find all instances.
-11. **Paper hyperparameters assume paper hardware.** If a paper reports batch_size=128 and trained on A100 80GB, that batch size may OOM on your T4 16GB. Always re-derive batch sizes from VRAM constraints, keeping the total samples seen (batch × iterations) the same.
-12. **Estimate VRAM before running, not after OOM.** For Sinkhorn: O(N² × D). For model: count parameters × 4 bytes (fp32) × 3 (params + gradients + optimizer). For pool: stored on CPU but sampled minibatch goes to GPU. Write this down before your first GPU run.
-13. **Checkpoint at phase boundaries, not just step boundaries.** Phase-level checkpoints enable `--resume-phase` which is the minimum viable recovery. Step-level checkpoints within long phases are a bonus. Both together make multi-session training actually work.
-14. **Free GPU memory between phases.** `torch.cuda.empty_cache()` after pool building or any phase that uses different GPU memory patterns than the next phase. Also `del` large objects (pools, computation graphs) that won't be needed again.
-15. **Document what your code does NOT support.** Single-GPU only? No mixed precision? No gradient accumulation? Say so. Users with multi-GPU setups will waste time wondering why only one GPU is active if you don't tell them.

 ---
 name: paper-reproduction
+description: "Skill for reproducing ML research papers from scratch when no official code exists. Use this whenever a user asks to implement, reproduce, or replicate a paper — especially papers involving novel loss functions, custom training loops, or non-standard architectures that aren't covered by existing HF trainers. Also use when the user mentions 'paper reproduction', 'implement this paper', 'no official code', or describes a method from a specific arxiv paper. Covers: reading papers systematically, extracting hyperparameters, building custom training pipelines, handling library-specific gotchas, VRAM estimation, checkpointing for multi-session training, and iterating on GPU results."
 ---
 # Paper Reproduction Skill
+Rules and procedures for reproducing ML research papers from scratch. All concrete mistakes, war stories, and examples live in [LEARNING.md](LEARNING.md). Next steps for this project live in [TODO.md](TODO.md).
 ---
+## 1. Paper Reading
+Read methodology sections (3, 4, 5) line by line. Read ALL appendices — they contain the actual recipe.
+### Extraction checklist
 ```
+□ Loss function — exact math, every symbol defined
+□ Architecture — layers, dims, activations, normalization
+□ Optimizer — type, lr, betas, weight decay, scheduler
 □ Batch size — for each phase/component separately
+□ Training iterations — for each phase/component separately
 □ Dataset preprocessing — normalization range, image size, augmentation
+□ Evaluation protocol — metrics, number of samples, special setup
 □ Hyperparameters per experiment — papers often have different configs per dataset
+□ Algorithm pseudocode — follow exactly before improvising
 □ GPU hardware used — what the authors trained on (often buried in appendix)
 □ Training time — how long did the authors' runs take?
 ```
 ---
+## 2. Library API Verification
+Before building ANY training loop that uses a third-party library (geomloss, POT, torchsde, torchdiffeq, etc.), write a 10-line test script that calls the library with the EXACT tensor shapes you'll use in every experiment. Not just the simplest one — all of them. If you have 2D points, MNIST images, and CIFAR images, test all three shapes.
 ---
+## 3. VRAM Estimation
+Estimate VRAM BEFORE running — not after OOM. Paper hyperparameters assume paper hardware.
+**Formula for Sinkhorn (tensorized backend):** O(N² × D) per call. Pool building does ~10 calls per batch (2 potentials × 5 flow steps). Add model params × 4 bytes × 3 (params + grads + optimizer states).
+**Rule:** If paper used A100 80GB and you have T4 16GB, re-derive batch sizes from VRAM constraints. Keep total samples seen (batch × iterations) constant by increasing iterations when you shrink batch.
+Add CLI override flags (e.g. `--sinkhorn-batch`) so users can tune without editing config.
 ---
+## 4. Architecture
+- UNet skip connections: count pushes during downward pass, pops during upward pass. They must match exactly.
+- Store config values (`num_res_blocks`, `num_levels`) as instance variables at init. Never infer them from module list lengths.
+- `nn.GroupNorm(32, channels)` requires channels divisible by 32. Assert this at init for all levels.
 ---
+## 5. Multi-Phase Training
+Each phase gets its own trainer with its own optimizer. Previous phase's model goes to `eval()`.
+### Shared state rules
+- Never cache a DataLoader with a fixed batch size if different phases use different batch sizes. Track cached params and invalidate on change.
+- `torch.cuda.empty_cache()` between phases. `del` large objects (pools, computation graphs) that won't be needed again.
+- CLI overrides must touch ALL phases. If `--train-iters` should override 3 phases, grep the config for all 3 fields.
 ---
+## 6. Checkpointing
+### Phase-level (mandatory)
+Save checkpoint after each phase completes. Include all model state dicts accumulated so far. Implement `--resume-phase N` that loads phase N-1 checkpoint and skips completed phases.
+### Step-level (strongly recommended for phases > 10 min)
+Save every N steps within a phase. Include model state, optimizer state, step number. Overwrite same file (keep latest only, unless you have disk space).
+### Kaggle persistence
+`/kaggle/working/` persists within a session but NOT across sessions. To carry checkpoints between sessions: commit notebook output, or copy checkpoints to a HF dataset, or download them before session ends.
 ---
+## 7. Memory Management
+- Trajectory pools / replay buffers live on CPU. Only the sampled minibatch goes to GPU via `.to(device)`.
+- Pre-concatenate data structures after building: `finalize()` once → O(1) sampling per step. Never `torch.cat` the entire pool every step.
+- Call `torch.cuda.empty_cache()` after pool building and between any phases with different GPU memory patterns.
 ---
+## 8. Testing
+### Before any GPU run:
+1. Test EVERY experiment type with minimal configs — not just the simplest one
+2. Test ALL training phases end-to-end — not just Phase 1
+3. Test with `--train-iters 5 --pool-batches 2` — should complete in <60 seconds on CPU
+4. Test `--resume-phase` actually works (save checkpoint → load → skip → continue)
+### Before declaring code ready (pre-flight checklist):
 ```
+□ All experiment types tested (2d, mnist, cifar10, etc.)
+□ All training phases tested end-to-end
+□ Library APIs tested with exact tensor shapes per experiment
+□ Shared state across phases verified
+□ CLI flags override ALL relevant config values
+□ VRAM estimated for target hardware
+□ Checkpointing works: save + resume + skip phases
+□ No O(N) operations per training step where O(1) suffices
 □ Expected runtimes documented per hardware tier
 □ Multi-GPU limitations documented
+□ Requirements.txt complete
 ```
 ---
+## 9. Documentation for User
+When the user runs on their own GPU (Kaggle, Colab, local):
+1. Provide exact copy-paste commands
+2. Document expected runtimes per hardware tier
+3. Document GPU requirements and VRAM limits per experiment
+4. Document what the code does NOT support (single-GPU only, no DDP, etc.)
+5. If training exceeds one session, provide session-by-session commands with `--resume-phase`
+---
+## 10. Maintaining LEARNING.md
+When a new mistake happens or a new principle is discovered:
+1. Add the mistake to the **Mistake Catalog** in LEARNING.md with: What, Impact, Root cause, Prevention
+2. If the mistake reveals a general principle, add it to the **Principles** section
+3. If the mistake would have been caught by a pre-flight check, add that check to the checklist in section 8 above
+4. Keep SKILL.md lean (rules only). LEARNING.md holds the stories and evidence.