Revise experimental results documentation

Browse files

Updated the experimental results documentation to reflect new findings and reorganized the structure for clarity.

Files changed (1) hide show

docs/experimental_results.md +187 -130

docs/experimental_results.md CHANGED Viewed

@@ -1,182 +1,239 @@
-# Experimental Results
-## 1. Stress Test — Task Switch (Quantitative)
-### Setup
-- **Model**: DistilBERT-base-uncased + NestedLoRALinear (max_rank=16)
-- **Protocol**: MRPC x 60 steps then SST-2 x 60 steps (shock at step 60)
-- **Seeds**: 0, 1, 2 (same seed = same batch order for baseline and unified)
-- **Baseline**: Same architecture, rank=16 fixed, no controller
-- **Hardware**: Google Colab, T4 GPU
-### Results
-|                        | Baseline (r=16 fixed) | Unified (orbital) | Delta    |
-|------------------------|-----------------------|-------------------|----------|
-| SST-2 Acc (new task)   | 0.736                 | 0.740             | +0.004   |
-| MRPC F1 (retention)    | 0.526                 | 0.515             | -0.011   |
-| Effective rank         | 16.0                  | 13.6              |          |
-| Rank saving            | 0%                    | 15%               |          |
-### Per-seed detail
-| Seed | Baseline SST-2 | Unified SST-2 | Baseline MRPC | Unified MRPC | Eff rank | Transitions |
-|------|----------------|---------------|---------------|--------------|----------|-------------|
-| 0    | 0.759          | 0.760         | 0.588         | 0.595        | 13.7     | 6           |
-| 1    | 0.649          | 0.664         | 0.783         | 0.781        | 13.2     | 6           |
-| 2    | 0.799          | 0.795         | 0.207         | 0.169        | 13.8     | 8           |
-### Rank traces
-**Seed 0:**
-```
-[  0] r4  r4  r4  r4  r8  r8  r16 r16 r16 r16
-[ 10] r16 r16 r16 r16 r16 r16 r16 r16 r16 r16
-...
-[ 60] <<<SHOCK r16 r16 r16 r16 r16 r16 r16 r16 r16 r16
-[ 70] r16 r8  r8  r8  r8  r8  r8  r8  r8  r8
-[ 80] r4  r4  r4  r4  r4  r4  r4  r4  r4  r8
-[ 90] r8  r8  r8  r16 r16 r16 r16 r16 r16 r16
-```
-**Seed 1 (cleanest trajectory):**
-```
-[  0] r4  r4  r4  r8  r8  r8  r8  r16 r16 r16
-[ 10] r16 r16 r16 r16 r16 r16 r16 r16 r16 r16
-...
-[ 60] <<<SHOCK r16 r16 r16 r16 r16 r16 r16 r16 r8  r8
-[ 70] r8  r8  r8  r8  r4  r4  r4  r4  r4  r4
-[ 80] r4  r4  r4  r4  r4  r4  r4  r4  r4  r4
-[ 90] r4  r4  r8  r16 r16 r16 r16 r16 r16 r16
-```
-**Seed 2:**
-```
-[  0] r4  r8  r8  r8  r8  r8  r16 r16 r16 r16
-[ 10] r16 r16 r16 r16 r16 r16 r16 r16 r16 r16
-...
-[ 60] <<<SHOCK r8  r8  r16 r16 r16 r16 r16 r16 r16 r16
-[ 70] r16 r16 r16 r16 r8  r8  r8  r8  r8  r8
-[ 80] r8  r8  r8  r4  r4  r4  r4  r4  r4  r4
-[ 90] r8  r8  r8  r8  r8  r16 r16 r16 r16 r16
-```
-### Interpretation
-All three seeds show the same pattern post-shock:
-1. Controller detects the distribution shift (loss spike after task switch)
-2. Descends through orbitals: r16 to r8 to r4
-3. Stabilizes at ground state for 10-18 steps
-4. Re-ascends when new task complexity demands capacity: r4 to r8 to r16
-The baseline stays at r=16 for all 120 steps regardless of the shock.
-## 2. Stable Task — Single Task Parity (Quantitative)
-### Setup
-- **Model**: DistilBERT-base-uncased + NestedLoRALinear (max_rank=16)
-- **Task**: MRPC only, 120 steps
-- **Seeds**: 0, 1, 2
-- **Baseline**: Same architecture, rank=16 fixed
-### Results
-| Seed | Baseline F1 | Unified F1 | Delta  |
-|------|-------------|------------|--------|
-| 0    | 0.806       | 0.808      | +0.002 |
-| 1    | 0.822       | 0.826      | +0.004 |
-| 2    | 0.824       | 0.824      | +0.000 |
-| **Mean** | **0.818 +/- 0.008** | **0.820 +/- 0.008** | **+0.002** |
-The controller correctly identifies that no intervention is needed on a stable task and remains at r=16 for nearly all steps. Parity confirmed.
-## 3. Rank Dynamics under Disturbance (Qualitative — Tinker)
-### Setup
-- **Model**: Qwen/Qwen3-4B-Instruct-2507
-- **Task**: GLUE CoLA (classification, autoregressive formulation)
-- **Environment**: Tinker (black-box — loss not directly observable)
-- **Hardware**: Cloud GPU (T4-class)
-- **Training length**: ~60 steps per method
-This setup reflects API-based / enterprise fine-tuning, where internal loss signals are not exposed.
-### Methods compared
-| Method               | Category              | Control logic           |
-|----------------------|-----------------------|-------------------------|
-| Standard LoRA        | Baseline              | Fixed rank              |
-| Schedule-free        | Baseline+             | Fixed rank, optimized LR|
-| AdaLoRA-like         | Open-loop adaptive    | Rank = f(step)          |
-| Unified-LoRA         | Closed-loop continuous| Rank = f(stress)        |
-### Observations
-**AdaLoRA-like**: monotonic decreasing trajectory from rank=32 to ~24. No reaction to shocks. Adaptive offline, but blind to real training state.
-**Standard / Schedule-free LoRA**: flat trajectory at fixed rank. No dynamics, no adaptation.
-**Unified-LoRA**: non-monotonic trajectory. Starts from rank=6, grows to ~31, immediate reaction to injected disturbances at steps ~20, ~30, ~45. No unstable oscillations.
-### Disturbance rejection
-| Method                  | Shock reaction | Stability | Recovery  |
-|-------------------------|----------------|-----------|-----------|
-| Standard / Schedule-free| None           | Passive   | —         |
-| AdaLoRA-like            | Indirect       | Partial   | Limited   |
-| Unified-LoRA            | Immediate      | Stable    | Immediate |
-Only Unified-LoRA exhibits disturbance rejection — a property of closed-loop control systems, absent in open-loop approaches.
-## 4. Architecture Evolution — What Didn't Work
-### Separate adapters (V1-V4)
-Four versions of the controller were tested with independent adapter matrices per rank (r=4, r=8, r=16 as separate nn.Linear pairs):
-| Version        | Mean F1 | Delta vs baseline | Saving | Problem                              |
-|----------------|---------|-------------------|--------|--------------------------------------|
-| V1 Homeostatic | 0.850   | +0.002*           | 62%    | No baseline in same run              |
-| V2 State-Aware | 0.812   | -0.036            | 46%    | Cold start on transitions            |
-| V3 State Ctrl  | 0.817   | -0.031            | 47%    | Stuck at r=8 on 2/3 seeds           |
-| V4 Trend-Aware | 0.821   | -0.027            | 14%    | Never activated on 2/3 seeds         |
-*V1 baseline was from a different run, not directly comparable.
-**Root cause**: switching between separate adapters means the new adapter has independent weights that never benefited from training at the previous rank. Every transition is a partial cold start.
-**Solution**: nested orbital architecture (single A/B pair, rank via slicing). This eliminated the cold start entirely and achieved parity with baseline.
-### Other approaches that didn't help on clean data
-- Adaptive rank per-layer (gradient EMA): no performance benefit
-- Fluid dynamics metrics (shock, vorticity, swirl): too conservative
-- Budget redistribution across layers: winner-takes-all problem
-- Fixed-threshold hysteresis: controller either never activated or got stuck
-- Vincolo StabilityController integration: zero shock events on stable training
-## 5. Black-Box Compatibility
-The controller operates without access to:
-- Gradients
-- Internal activations
-- Optimizer state
-- Per-layer information
-It observes only the loss trajectory. This makes it compatible with API-based fine-tuning platforms (Azure OpenAI, Tinker) where the training loop is exposed but model internals are not.
-Computational overhead: O(1) per step. No SVD, no matrix decomposition.
-## Open Questions
-- Scale validation on 7B+ models (Tinker experiments in progress)
-- Minimum shock magnitude required for measurable controller benefit
-- Adaptive LR modulation as black-box analog of rank control (for platforms where rank is fixed at creation)

+Experimental Results
+Core result: parity with baseline performance with ~15% rank reduction and dynamic shock response.
+1. Stress Test — Task Switch
+Setup
+Model: DistilBERT-base-uncased + NestedLoRALinear (max_rank=16)
+Protocol: MRPC x 60 steps → SST-2 x 60 steps (shock at step 60)
+Seeds: 0, 1, 2
+Baseline: same architecture, fixed rank=16
+Hardware: Colab T4
+Results
+Baseline (r=16)
+Orbital LoRA
+SST-2 Accuracy
+0.736
+0.740
+MRPC F1 (retention)
+0.526
+0.515
+Effective rank
+16.0
+13.6
+Parity with ~15% rank saving
+Behavior
+Post-shock:
+detect → descend (r16 → r4)
+stabilize
+re-ascend (r4 → r16)
+Baseline: no reaction (fixed r=16)
+2. Stable Task — Parity
+Setup
+Task: MRPC only (120 steps)
+Seeds: 0, 1, 2
+Baseline: fixed r=16
+Results
+Seed
+Baseline F1
+Orbital F1
+0
+0.806
+0.808
+1
+0.822
+0.826
+2
+0.824
+0.824
+Mean
+0.818
+0.820
+No degradation on stable training
+3. Rank Dynamics (Black-box — Tinker)
+Methods
+Method
+Control
+Standard LoRA
+Fixed rank
+AdaLoRA-like
+Open-loop
+Orbital LoRA
+Closed-loop
+Disturbance response
+Method
+Reaction
+Stability
+Recovery
+Standard
+None
+Passive
+—
+AdaLoRA-like
+Indirect
+Partial
+Limited
+Orbital LoRA
+Immediate
+Stable
+Immediate
+4. Architecture Insight
+Root cause: cold start from separate adapters.
+Fix: nested slicing → no cold start → parity restored.
+5. Black-box compatibility
+Uses only loss signal.
+No gradients required.
+O(1) overhead.
+Next
+7B+ validation (ongoing)
+LR controller integration