TheDrummer commited on
Commit
2168604
·
verified ·
1 Parent(s): 6d67d36

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -1
README.md CHANGED
@@ -110,6 +110,18 @@ WIP
110
  - The duplicated layers on all layer types (except one) are extra sensitive. `post_attention_layernorm` interestingly had some changes in the upscale's duplicated layers, unlike Cydonia where latter layers were completely unchanged.
111
  - The duplicated layers in `o_proj` are less sensitive for some reason.
112
 
 
 
 
 
 
 
 
 
 
 
 
 
113
  # Further Experimentation
114
  Given how the duplicated layers seem to have a stabilizing effect, it begs the question:
115
 
@@ -127,4 +139,5 @@ Given how the duplicated layers seem to have a stabilizing effect, it begs the q
127
  ### Can you replicate this effect on normal models by freezing layers?
128
 
129
  ### We've so far hypothesized that training 'slowly fills' the duplicated layers. If we intentionally undercook, will the duplicated layers look *underfilled* or can you fill it up with a few steps? In other words, can a single/few updates to the model reconnect the duplicated layers?
130
- - Are we really repairing the 'neurons' step-by-step, or have they been significantly rearranged by the first (few?) steps?
 
 
110
  - The duplicated layers on all layer types (except one) are extra sensitive. `post_attention_layernorm` interestingly had some changes in the upscale's duplicated layers, unlike Cydonia where latter layers were completely unchanged.
111
  - The duplicated layers in `o_proj` are less sensitive for some reason.
112
 
113
+ # [Eureka?] Top-to-Bottom Linear Gradient Observations
114
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/cQnZwfMF0ZBLc_aDOamK-.png)
115
+ - Take note of a few things
116
+ - Top layers = Ending layers (nearer to output)
117
+ - Bottom layers = Starting layers (nearer to input)
118
+ - Training a non-upscaled model affects the top layers first and slowly descends to the bottom layers over time.
119
+ - Training an upscaled model with a slice of layers duplicated twice does two things:
120
+ - The duplicated slices EACH have their own gradient.
121
+ - There's a 'ceiling value' for each of these duplicated slices.
122
+ - Even when Tunguska's duplicated slices are nearly saturated, the resulting model remains coherent and even performant.
123
+ - Takeaway? Saturating these duplicated layers MIGHT be a good goal to pursue.
124
+
125
  # Further Experimentation
126
  Given how the duplicated layers seem to have a stabilizing effect, it begs the question:
127
 
 
139
  ### Can you replicate this effect on normal models by freezing layers?
140
 
141
  ### We've so far hypothesized that training 'slowly fills' the duplicated layers. If we intentionally undercook, will the duplicated layers look *underfilled* or can you fill it up with a few steps? In other words, can a single/few updates to the model reconnect the duplicated layers?
142
+ - Are we really repairing the 'neurons' step-by-step, or have they been significantly rearranged by the first (few?) steps?
143
+ - Or maybe this is false given the top-bottom gradient observation.