Update README.md
Browse files
README.md
CHANGED
@@ -110,6 +110,18 @@ WIP
|
|
110 |
- The duplicated layers on all layer types (except one) are extra sensitive. `post_attention_layernorm` interestingly had some changes in the upscale's duplicated layers, unlike Cydonia where latter layers were completely unchanged.
|
111 |
- The duplicated layers in `o_proj` are less sensitive for some reason.
|
112 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
113 |
# Further Experimentation
|
114 |
Given how the duplicated layers seem to have a stabilizing effect, it begs the question:
|
115 |
|
@@ -127,4 +139,5 @@ Given how the duplicated layers seem to have a stabilizing effect, it begs the q
|
|
127 |
### Can you replicate this effect on normal models by freezing layers?
|
128 |
|
129 |
### We've so far hypothesized that training 'slowly fills' the duplicated layers. If we intentionally undercook, will the duplicated layers look *underfilled* or can you fill it up with a few steps? In other words, can a single/few updates to the model reconnect the duplicated layers?
|
130 |
-
- Are we really repairing the 'neurons' step-by-step, or have they been significantly rearranged by the first (few?) steps?
|
|
|
|
110 |
- The duplicated layers on all layer types (except one) are extra sensitive. `post_attention_layernorm` interestingly had some changes in the upscale's duplicated layers, unlike Cydonia where latter layers were completely unchanged.
|
111 |
- The duplicated layers in `o_proj` are less sensitive for some reason.
|
112 |
|
113 |
+
# [Eureka?] Top-to-Bottom Linear Gradient Observations
|
114 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/cQnZwfMF0ZBLc_aDOamK-.png)
|
115 |
+
- Take note of a few things
|
116 |
+
- Top layers = Ending layers (nearer to output)
|
117 |
+
- Bottom layers = Starting layers (nearer to input)
|
118 |
+
- Training a non-upscaled model affects the top layers first and slowly descends to the bottom layers over time.
|
119 |
+
- Training an upscaled model with a slice of layers duplicated twice does two things:
|
120 |
+
- The duplicated slices EACH have their own gradient.
|
121 |
+
- There's a 'ceiling value' for each of these duplicated slices.
|
122 |
+
- Even when Tunguska's duplicated slices are nearly saturated, the resulting model remains coherent and even performant.
|
123 |
+
- Takeaway? Saturating these duplicated layers MIGHT be a good goal to pursue.
|
124 |
+
|
125 |
# Further Experimentation
|
126 |
Given how the duplicated layers seem to have a stabilizing effect, it begs the question:
|
127 |
|
|
|
139 |
### Can you replicate this effect on normal models by freezing layers?
|
140 |
|
141 |
### We've so far hypothesized that training 'slowly fills' the duplicated layers. If we intentionally undercook, will the duplicated layers look *underfilled* or can you fill it up with a few steps? In other words, can a single/few updates to the model reconnect the duplicated layers?
|
142 |
+
- Are we really repairing the 'neurons' step-by-step, or have they been significantly rearranged by the first (few?) steps?
|
143 |
+
- Or maybe this is false given the top-bottom gradient observation.
|