TheDrummer commited on
Commit
9d9c564
·
verified ·
1 Parent(s): e1e3a7d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -71,20 +71,20 @@ slices:
71
  **Visualization Issue: Some layer types like `input_layernorm` look unchanged because the highest value (usually layer 0) dilluted the entire heatmap**
72
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/xdH_7fy9HuhSzaSE2-h4X.png)
73
 
74
- ## Sample A: Tunguska 39B 1 Epoch vs. its base
75
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/X3-bHyQg03-QvZFvOhGp7.png)
76
 
77
- ## Sample B: Tunguska 39B 2 Epochs vs. its base
78
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/-dRSeXmPXdE3_g67iKT0K.png)
79
 
80
- ## Sample C: Tunguska 39B 1 Epoch vs 2 Epochs
81
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/cjKf37TrSJHmq0S0_PZyE.png)
82
 
83
  # Glossary
84
  WIP
85
 
86
  # Impressions
87
- - Using the same LR on a larger model should have been more destructive. In reality, the duplicated layers must have saved the upscale from becoming dumber than Cydonia. The larger model seems to have preserved more of its smarts thanks to the 'empty, extra' duplicate layers.
88
  - The upscale clearly has an effect on training. Control A & B show a smooth gradient unlike Sample A, B, C where the duplicated layers perturb the heatmap.
89
- - The duplicated layers seem to be playing catch up with the original layers. The change difference on the duplicate layers in Sample B shows that they are more responsive to training than the original layers. There are no indications that the duplicate layers are 'slowing down', given that the gradient pattern remains the same in Sample A, B, C (where C can be considered the rate of change).
90
  - In some layer types, the change difference for the first two layers are overwhelmingly huge compared to the other layers seeing as how they diluted the entire heatmap!
 
71
  **Visualization Issue: Some layer types like `input_layernorm` look unchanged because the highest value (usually layer 0) dilluted the entire heatmap**
72
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/xdH_7fy9HuhSzaSE2-h4X.png)
73
 
74
+ ## Sample A: Tunguska 39B 1st Epoch vs. its base
75
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/X3-bHyQg03-QvZFvOhGp7.png)
76
 
77
+ ## Sample B: Tunguska 39B 2nd Epoch vs. its base
78
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/-dRSeXmPXdE3_g67iKT0K.png)
79
 
80
+ ## Sample C: Tunguska 39B 2nd Epoch vs. 1st Epoch
81
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/cjKf37TrSJHmq0S0_PZyE.png)
82
 
83
  # Glossary
84
  WIP
85
 
86
  # Impressions
87
+ - Using the same LR on a larger model should have been more destructive. In practice, the duplicated layers must have saved the upscale from becoming dumber than Cydonia. The larger model seems to have preserved more of its smarts thanks to the 'empty, extra' duplicate layers.
88
  - The upscale clearly has an effect on training. Control A & B show a smooth gradient unlike Sample A, B, C where the duplicated layers perturb the heatmap.
89
+ - The duplicated layers seem to be playing catch up with the original layers. The change difference on the duplicate layers in Sample B show that they are more responsive to training than the original layers. There are no indications that the duplicate layers are 'slowing down', given that the gradient pattern remains the same in Sample A, B, and especially C (where C can be considered the rate of change).
90
  - In some layer types, the change difference for the first two layers are overwhelmingly huge compared to the other layers seeing as how they diluted the entire heatmap!