TheDrummer
/

Tunguska-39B-v1-GGUF

Inference Endpoints

Model card Files Files and versions Community

TheDrummer commited on Nov 19, 2024

Commit

2168604

·

verified ·

1 Parent(s): 6d67d36

Update README.md

Files changed (1) hide show

README.md +14 -1

README.md CHANGED Viewed

@@ -110,6 +110,18 @@ WIP
 - The duplicated layers on all layer types (except one) are extra sensitive. `post_attention_layernorm` interestingly had some changes in the upscale's duplicated layers, unlike Cydonia where latter layers were completely unchanged.
 - The duplicated layers in `o_proj` are less sensitive for some reason.
 # Further Experimentation
 Given how the duplicated layers seem to have a stabilizing effect, it begs the question:
@@ -127,4 +139,5 @@ Given how the duplicated layers seem to have a stabilizing effect, it begs the q
 ### Can you replicate this effect on normal models by freezing layers?
 ### We've so far hypothesized that training 'slowly fills' the duplicated layers. If we intentionally undercook, will the duplicated layers look *underfilled* or can you fill it up with a few steps? In other words, can a single/few updates to the model reconnect the duplicated layers?
-- Are we really repairing the 'neurons' step-by-step, or have they been significantly rearranged by the first (few?) steps?

 - The duplicated layers on all layer types (except one) are extra sensitive. `post_attention_layernorm` interestingly had some changes in the upscale's duplicated layers, unlike Cydonia where latter layers were completely unchanged.
 - The duplicated layers in `o_proj` are less sensitive for some reason.
+# [Eureka?] Top-to-Bottom Linear Gradient Observations
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/cQnZwfMF0ZBLc_aDOamK-.png)
+- Take note of a few things
+  - Top layers = Ending layers (nearer to output)
+  - Bottom layers = Starting layers (nearer to input)
+  - Training a non-upscaled model affects the top layers first and slowly descends to the bottom layers over time.
+  - Training an upscaled model with a slice of layers duplicated twice does two things:
+    - The duplicated slices EACH have their own gradient.
+    - There's a 'ceiling value' for each of these duplicated slices.
+    - Even when Tunguska's duplicated slices are nearly saturated, the resulting model remains coherent and even performant.
+- Takeaway? Saturating these duplicated layers MIGHT be a good goal to pursue.
 # Further Experimentation
 Given how the duplicated layers seem to have a stabilizing effect, it begs the question:
 ### Can you replicate this effect on normal models by freezing layers?
 ### We've so far hypothesized that training 'slowly fills' the duplicated layers. If we intentionally undercook, will the duplicated layers look *underfilled* or can you fill it up with a few steps? In other words, can a single/few updates to the model reconnect the duplicated layers?
+- Are we really repairing the 'neurons' step-by-step, or have they been significantly rearranged by the first (few?) steps?
+- Or maybe this is false given the top-bottom gradient observation.