Simo76 commited on
Commit
d8f43b7
·
1 Parent(s): 02433bd

Revise experimental results documentation

Browse files

Updated the experimental results documentation to reflect new findings and reorganized the structure for clarity.

Files changed (1) hide show
  1. docs/experimental_results.md +187 -130
docs/experimental_results.md CHANGED
@@ -1,182 +1,239 @@
1
- # Experimental Results
2
 
3
- ## 1. Stress Test — Task Switch (Quantitative)
4
-
5
- ### Setup
6
 
7
- - **Model**: DistilBERT-base-uncased + NestedLoRALinear (max_rank=16)
8
- - **Protocol**: MRPC x 60 steps then SST-2 x 60 steps (shock at step 60)
9
- - **Seeds**: 0, 1, 2 (same seed = same batch order for baseline and unified)
10
- - **Baseline**: Same architecture, rank=16 fixed, no controller
11
- - **Hardware**: Google Colab, T4 GPU
12
 
13
- ### Results
14
 
15
- | | Baseline (r=16 fixed) | Unified (orbital) | Delta |
16
- |------------------------|-----------------------|-------------------|----------|
17
- | SST-2 Acc (new task) | 0.736 | 0.740 | +0.004 |
18
- | MRPC F1 (retention) | 0.526 | 0.515 | -0.011 |
19
- | Effective rank | 16.0 | 13.6 | |
20
- | Rank saving | 0% | 15% | |
21
 
22
- ### Per-seed detail
23
 
24
- | Seed | Baseline SST-2 | Unified SST-2 | Baseline MRPC | Unified MRPC | Eff rank | Transitions |
25
- |------|----------------|---------------|---------------|--------------|----------|-------------|
26
- | 0 | 0.759 | 0.760 | 0.588 | 0.595 | 13.7 | 6 |
27
- | 1 | 0.649 | 0.664 | 0.783 | 0.781 | 13.2 | 6 |
28
- | 2 | 0.799 | 0.795 | 0.207 | 0.169 | 13.8 | 8 |
29
 
30
- ### Rank traces
31
 
32
- **Seed 0:**
33
- ```
34
- [ 0] r4 r4 r4 r4 r8 r8 r16 r16 r16 r16
35
- [ 10] r16 r16 r16 r16 r16 r16 r16 r16 r16 r16
36
- ...
37
- [ 60] <<<SHOCK r16 r16 r16 r16 r16 r16 r16 r16 r16 r16
38
- [ 70] r16 r8 r8 r8 r8 r8 r8 r8 r8 r8
39
- [ 80] r4 r4 r4 r4 r4 r4 r4 r4 r4 r8
40
- [ 90] r8 r8 r8 r16 r16 r16 r16 r16 r16 r16
41
- ```
42
 
43
- **Seed 1 (cleanest trajectory):**
44
- ```
45
- [ 0] r4 r4 r4 r8 r8 r8 r8 r16 r16 r16
46
- [ 10] r16 r16 r16 r16 r16 r16 r16 r16 r16 r16
47
- ...
48
- [ 60] <<<SHOCK r16 r16 r16 r16 r16 r16 r16 r16 r8 r8
49
- [ 70] r8 r8 r8 r8 r4 r4 r4 r4 r4 r4
50
- [ 80] r4 r4 r4 r4 r4 r4 r4 r4 r4 r4
51
- [ 90] r4 r4 r8 r16 r16 r16 r16 r16 r16 r16
52
- ```
53
 
54
- **Seed 2:**
55
- ```
56
- [ 0] r4 r8 r8 r8 r8 r8 r16 r16 r16 r16
57
- [ 10] r16 r16 r16 r16 r16 r16 r16 r16 r16 r16
58
- ...
59
- [ 60] <<<SHOCK r8 r8 r16 r16 r16 r16 r16 r16 r16 r16
60
- [ 70] r16 r16 r16 r16 r8 r8 r8 r8 r8 r8
61
- [ 80] r8 r8 r8 r4 r4 r4 r4 r4 r4 r4
62
- [ 90] r8 r8 r8 r8 r8 r16 r16 r16 r16 r16
63
- ```
64
 
65
- ### Interpretation
66
 
67
- All three seeds show the same pattern post-shock:
68
- 1. Controller detects the distribution shift (loss spike after task switch)
69
- 2. Descends through orbitals: r16 to r8 to r4
70
- 3. Stabilizes at ground state for 10-18 steps
71
- 4. Re-ascends when new task complexity demands capacity: r4 to r8 to r16
72
 
73
- The baseline stays at r=16 for all 120 steps regardless of the shock.
74
 
75
 
76
- ## 2. Stable Task — Single Task Parity (Quantitative)
77
 
78
- ### Setup
79
 
80
- - **Model**: DistilBERT-base-uncased + NestedLoRALinear (max_rank=16)
81
- - **Task**: MRPC only, 120 steps
82
- - **Seeds**: 0, 1, 2
83
- - **Baseline**: Same architecture, rank=16 fixed
84
 
85
- ### Results
86
 
87
- | Seed | Baseline F1 | Unified F1 | Delta |
88
- |------|-------------|------------|--------|
89
- | 0 | 0.806 | 0.808 | +0.002 |
90
- | 1 | 0.822 | 0.826 | +0.004 |
91
- | 2 | 0.824 | 0.824 | +0.000 |
92
- | **Mean** | **0.818 +/- 0.008** | **0.820 +/- 0.008** | **+0.002** |
93
 
94
- The controller correctly identifies that no intervention is needed on a stable task and remains at r=16 for nearly all steps. Parity confirmed.
95
 
96
 
97
- ## 3. Rank Dynamics under Disturbance (Qualitative — Tinker)
98
 
99
- ### Setup
100
 
101
- - **Model**: Qwen/Qwen3-4B-Instruct-2507
102
- - **Task**: GLUE CoLA (classification, autoregressive formulation)
103
- - **Environment**: Tinker (black-box — loss not directly observable)
104
- - **Hardware**: Cloud GPU (T4-class)
105
- - **Training length**: ~60 steps per method
106
 
107
- This setup reflects API-based / enterprise fine-tuning, where internal loss signals are not exposed.
108
 
109
- ### Methods compared
110
 
111
- | Method | Category | Control logic |
112
- |----------------------|-----------------------|-------------------------|
113
- | Standard LoRA | Baseline | Fixed rank |
114
- | Schedule-free | Baseline+ | Fixed rank, optimized LR|
115
- | AdaLoRA-like | Open-loop adaptive | Rank = f(step) |
116
- | Unified-LoRA | Closed-loop continuous| Rank = f(stress) |
117
 
118
- ### Observations
 
119
 
120
- **AdaLoRA-like**: monotonic decreasing trajectory from rank=32 to ~24. No reaction to shocks. Adaptive offline, but blind to real training state.
121
 
122
- **Standard / Schedule-free LoRA**: flat trajectory at fixed rank. No dynamics, no adaptation.
123
 
124
- **Unified-LoRA**: non-monotonic trajectory. Starts from rank=6, grows to ~31, immediate reaction to injected disturbances at steps ~20, ~30, ~45. No unstable oscillations.
125
 
126
- ### Disturbance rejection
 
 
127
 
128
- | Method | Shock reaction | Stability | Recovery |
129
- |-------------------------|----------------|-----------|-----------|
130
- | Standard / Schedule-free| None | Passive | — |
131
- | AdaLoRA-like | Indirect | Partial | Limited |
132
- | Unified-LoRA | Immediate | Stable | Immediate |
133
 
134
- Only Unified-LoRA exhibits disturbance rejection — a property of closed-loop control systems, absent in open-loop approaches.
 
 
135
 
136
 
137
- ## 4. Architecture Evolution — What Didn't Work
 
 
138
 
139
- ### Separate adapters (V1-V4)
140
 
141
- Four versions of the controller were tested with independent adapter matrices per rank (r=4, r=8, r=16 as separate nn.Linear pairs):
142
 
143
- | Version | Mean F1 | Delta vs baseline | Saving | Problem |
144
- |----------------|---------|-------------------|--------|--------------------------------------|
145
- | V1 Homeostatic | 0.850 | +0.002* | 62% | No baseline in same run |
146
- | V2 State-Aware | 0.812 | -0.036 | 46% | Cold start on transitions |
147
- | V3 State Ctrl | 0.817 | -0.031 | 47% | Stuck at r=8 on 2/3 seeds |
148
- | V4 Trend-Aware | 0.821 | -0.027 | 14% | Never activated on 2/3 seeds |
149
 
150
- *V1 baseline was from a different run, not directly comparable.
151
 
152
- **Root cause**: switching between separate adapters means the new adapter has independent weights that never benefited from training at the previous rank. Every transition is a partial cold start.
153
 
154
- **Solution**: nested orbital architecture (single A/B pair, rank via slicing). This eliminated the cold start entirely and achieved parity with baseline.
155
 
156
- ### Other approaches that didn't help on clean data
157
 
158
- - Adaptive rank per-layer (gradient EMA): no performance benefit
159
- - Fluid dynamics metrics (shock, vorticity, swirl): too conservative
160
- - Budget redistribution across layers: winner-takes-all problem
161
- - Fixed-threshold hysteresis: controller either never activated or got stuck
162
- - Vincolo StabilityController integration: zero shock events on stable training
163
 
164
 
165
- ## 5. Black-Box Compatibility
166
 
167
- The controller operates without access to:
168
- - Gradients
169
- - Internal activations
170
- - Optimizer state
171
- - Per-layer information
172
 
173
- It observes only the loss trajectory. This makes it compatible with API-based fine-tuning platforms (Azure OpenAI, Tinker) where the training loop is exposed but model internals are not.
174
 
175
- Computational overhead: O(1) per step. No SVD, no matrix decomposition.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
176
 
177
 
178
- ## Open Questions
179
 
180
- - Scale validation on 7B+ models (Tinker experiments in progress)
181
- - Minimum shock magnitude required for measurable controller benefit
182
- - Adaptive LR modulation as black-box analog of rank control (for platforms where rank is fixed at creation)
 
1
+ Experimental Results
2
 
 
 
 
3
 
4
+ Core result: parity with baseline performance with ~15% rank reduction and dynamic shock response.
 
 
 
 
5
 
 
6
 
 
 
 
 
 
 
7
 
8
+ 1. Stress Test — Task Switch
9
 
 
 
 
 
 
10
 
11
+ Setup
12
 
 
 
 
 
 
 
 
 
 
 
13
 
 
 
 
 
 
 
 
 
 
 
14
 
 
 
 
 
 
 
 
 
 
 
15
 
16
+ Model: DistilBERT-base-uncased + NestedLoRALinear (max_rank=16)
17
 
 
 
 
 
 
18
 
19
+ Protocol: MRPC x 60 steps → SST-2 x 60 steps (shock at step 60)
20
 
21
 
22
+ Seeds: 0, 1, 2
23
 
 
24
 
25
+ Baseline: same architecture, fixed rank=16
 
 
 
26
 
 
27
 
28
+ Hardware: Colab T4
 
 
 
 
 
29
 
 
30
 
31
 
 
32
 
33
+ Results
34
 
 
 
 
 
 
35
 
 
36
 
 
37
 
 
 
 
 
 
 
38
 
39
+ Baseline (r=16)
40
+ Orbital LoRA
41
 
 
42
 
 
43
 
 
44
 
45
+ SST-2 Accuracy
46
+ 0.736
47
+ 0.740
48
 
 
 
 
 
 
49
 
50
+ MRPC F1 (retention)
51
+ 0.526
52
+ 0.515
53
 
54
 
55
+ Effective rank
56
+ 16.0
57
+ 13.6
58
 
 
59
 
 
60
 
 
 
 
 
 
 
61
 
62
+ Parity with ~15% rank saving
63
 
 
64
 
65
+ Behavior
66
 
 
67
 
68
+ Post-shock:
 
 
 
 
69
 
70
 
 
71
 
 
 
 
 
 
72
 
73
+ detect → descend (r16 → r4)
74
 
75
+
76
+ stabilize
77
+
78
+
79
+ re-ascend (r4 → r16)
80
+
81
+
82
+
83
+
84
+ Baseline: no reaction (fixed r=16)
85
+
86
+
87
+
88
+ 2. Stable Task — Parity
89
+
90
+
91
+ Setup
92
+
93
+
94
+
95
+
96
+ Task: MRPC only (120 steps)
97
+
98
+
99
+ Seeds: 0, 1, 2
100
+
101
+
102
+ Baseline: fixed r=16
103
+
104
+
105
+
106
+
107
+ Results
108
+
109
+
110
+
111
+
112
+ Seed
113
+ Baseline F1
114
+ Orbital F1
115
+
116
+
117
+
118
+
119
+ 0
120
+ 0.806
121
+ 0.808
122
+
123
+
124
+ 1
125
+ 0.822
126
+ 0.826
127
+
128
+
129
+ 2
130
+ 0.824
131
+ 0.824
132
+
133
+
134
+ Mean
135
+ 0.818
136
+ 0.820
137
+
138
+
139
+
140
+
141
+ No degradation on stable training
142
+
143
+
144
+
145
+ 3. Rank Dynamics (Black-box — Tinker)
146
+
147
+
148
+ Methods
149
+
150
+
151
+
152
+
153
+ Method
154
+ Control
155
+
156
+
157
+
158
+
159
+ Standard LoRA
160
+ Fixed rank
161
+
162
+
163
+ AdaLoRA-like
164
+ Open-loop
165
+
166
+
167
+ Orbital LoRA
168
+ Closed-loop
169
+
170
+
171
+
172
+
173
+ Disturbance response
174
+
175
+
176
+
177
+
178
+ Method
179
+ Reaction
180
+ Stability
181
+ Recovery
182
+
183
+
184
+
185
+
186
+ Standard
187
+ None
188
+ Passive
189
+ —
190
+
191
+
192
+ AdaLoRA-like
193
+ Indirect
194
+ Partial
195
+ Limited
196
+
197
+
198
+ Orbital LoRA
199
+ Immediate
200
+ Stable
201
+ Immediate
202
+
203
+
204
+
205
+
206
+
207
+ 4. Architecture Insight
208
+
209
+
210
+ Root cause: cold start from separate adapters.
211
+
212
+
213
+ Fix: nested slicing → no cold start → parity restored.
214
+
215
+
216
+
217
+ 5. Black-box compatibility
218
+
219
+
220
+ Uses only loss signal.
221
+
222
+ No gradients required.
223
+
224
+ O(1) overhead.
225
+
226
+
227
+
228
+ Next
229
+
230
+
231
+
232
+
233
+ 7B+ validation (ongoing)
234
+
235
+
236
+ LR controller integration
237
 
238
 
 
239