TheDrummer
commited on
Commit
•
f6eabc1
1
Parent(s):
6f8a728
Update README.md
Browse files
README.md
CHANGED
@@ -89,4 +89,10 @@ WIP
|
|
89 |
- The duplicated layers seem to be playing catch up with the original layers. The change difference on the duplicate layers in Sample B show that they are more responsive to training than the original layers. There are no indications that the duplicate layers are 'slowing down', given that the gradient pattern remains the same in Sample A, B, and especially C (where C can be considered the rate of change).
|
90 |
- In some layer types, the change difference for the first two layers are overwhelmingly huge compared to the other layers seeing as how they diluted the entire heatmap!
|
91 |
- The duplicated layers on all layer types (except one) are extra sensitive. `post_attention_layernorm` interestingly had some changes in the upscale's duplicated layers, unlike Cydonia where latter layers were completely unchanged.
|
92 |
-
- The duplicated layers in `o_proj` are less sensitive for some reason.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
89 |
- The duplicated layers seem to be playing catch up with the original layers. The change difference on the duplicate layers in Sample B show that they are more responsive to training than the original layers. There are no indications that the duplicate layers are 'slowing down', given that the gradient pattern remains the same in Sample A, B, and especially C (where C can be considered the rate of change).
|
90 |
- In some layer types, the change difference for the first two layers are overwhelmingly huge compared to the other layers seeing as how they diluted the entire heatmap!
|
91 |
- The duplicated layers on all layer types (except one) are extra sensitive. `post_attention_layernorm` interestingly had some changes in the upscale's duplicated layers, unlike Cydonia where latter layers were completely unchanged.
|
92 |
+
- The duplicated layers in `o_proj` are less sensitive for some reason.
|
93 |
+
|
94 |
+
# Further Experiments
|
95 |
+
Given how the duplicated layers seem to have a stabilizing effect, it begs the question: What if we duplicate only ONE layer? What about five layers?
|
96 |
+
- Will fewer empty layers dampen the stabilizing effect?
|
97 |
+
- Will the few empty layers get 'filled' quickly? Will the 600MB dataset be enough?
|
98 |
+
- Will there be a greater concetration of weight change in the duplicate layers?
|