BeaverAI
/

Tunguska-39B-v1b-GGUF

Inference Endpoints

Model card Files Files and versions Community

TheDrummer commited on 8 days ago

Commit

f6eabc1

•

1 Parent(s): 6f8a728

Update README.md

Files changed (1) hide show

README.md +7 -1

README.md CHANGED Viewed

@@ -89,4 +89,10 @@ WIP
 - The duplicated layers seem to be playing catch up with the original layers. The change difference on the duplicate layers in Sample B show that they are more responsive to training than the original layers. There are no indications that the duplicate layers are 'slowing down', given that the gradient pattern remains the same in Sample A, B, and especially C (where C can be considered the rate of change).
 - In some layer types, the change difference for the first two layers are overwhelmingly huge compared to the other layers seeing as how they diluted the entire heatmap!
 - The duplicated layers on all layer types (except one) are extra sensitive. `post_attention_layernorm` interestingly had some changes in the upscale's duplicated layers, unlike Cydonia where latter layers were completely unchanged.
-- The duplicated layers in `o_proj` are less sensitive for some reason.

 - The duplicated layers seem to be playing catch up with the original layers. The change difference on the duplicate layers in Sample B show that they are more responsive to training than the original layers. There are no indications that the duplicate layers are 'slowing down', given that the gradient pattern remains the same in Sample A, B, and especially C (where C can be considered the rate of change).
 - In some layer types, the change difference for the first two layers are overwhelmingly huge compared to the other layers seeing as how they diluted the entire heatmap!
 - The duplicated layers on all layer types (except one) are extra sensitive. `post_attention_layernorm` interestingly had some changes in the upscale's duplicated layers, unlike Cydonia where latter layers were completely unchanged.
+- The duplicated layers in `o_proj` are less sensitive for some reason.
+# Further Experiments
+Given how the duplicated layers seem to have a stabilizing effect, it begs the question: What if we duplicate only ONE layer? What about five layers?
+- Will fewer empty layers dampen the stabilizing effect?
+- Will the few empty layers get 'filled' quickly? Will the 600MB dataset be enough?
+- Will there be a greater concetration of weight change in the duplicate layers?