TheDrummer
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,8 @@
|
|
1 |
# Usage
|
2 |
- Metharme format (Mistral works too but untested)
|
3 |
|
|
|
|
|
4 |
# Upscaled Tuning Experiment Write Up Thingy
|
5 |
|
6 |
My cute attempt at being a siyantis :3 uwu ~
|
@@ -56,27 +58,38 @@ slices:
|
|
56 |
34.5% 41.8% 23.7%
|
57 |
```
|
58 |
|
|
|
|
|
59 |
# How did the finetune go?
|
60 |
|
61 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/uo_GtNKPZ_KaWJCoAcB92.png)
|
62 |
|
63 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/SmloL6rPXe9jJuyQNpHKG.png)
|
64 |
|
|
|
|
|
65 |
# Weight Difference Visualization
|
66 |
- Contorl A: Nemo x Rocinante
|
67 |
- Control B: Small x Cydonia
|
|
|
68 |
- Sample A: 39B Upscale x Tunguska 1 Epoch
|
69 |
- Sample B: 39B Upscale x Tunguska 2 Epochs
|
70 |
- Sample C: Tunguska 1 Epoch x Tunguska 2 Epochs
|
71 |
|
|
|
|
|
|
|
|
|
|
|
72 |
## Control A (Nemo 12B & Rocinante 12B, similar training)
|
73 |
-
**Note the layer sequence and other labels since it will be unreadable for the 39B**
|
74 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/EZN8Ci2_vAGmdq0WUyrpN.png)
|
75 |
|
76 |
## Control B (Small 22B & Cydonia 22B, similar training)
|
77 |
-
**Visualization Issue: Some layer types like `input_layernorm` look unchanged because the highest value (usually layer 0) dilluted the entire heatmap**
|
78 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/xdH_7fy9HuhSzaSE2-h4X.png)
|
79 |
|
|
|
|
|
|
|
80 |
## Sample A: Tunguska 39B 1st Epoch vs. its 39B base
|
81 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/X3-bHyQg03-QvZFvOhGp7.png)
|
82 |
|
@@ -91,7 +104,7 @@ WIP
|
|
91 |
|
92 |
# Impressions
|
93 |
- Using the same LR on a larger model should have been more destructive. In practice, the duplicated layers must have saved the upscale from becoming dumber than Cydonia. The larger model seems to have preserved more of its smarts thanks to the 'empty, extra' duplicate layers.
|
94 |
-
- The upscale clearly has an effect on training. Control A & B show a smooth gradient unlike Sample A, B, C where the duplicated layers perturb the heatmap.
|
95 |
- The duplicated layers seem to be playing catch up with the original layers. The change difference on the duplicate layers in Sample B show that they are more responsive to training than the original layers. There are no indications that the duplicate layers are 'slowing down', given that the gradient pattern remains the same in Sample A, B, and especially C (where C can be considered the rate of change).
|
96 |
- In some layer types, the change difference for the first two layers are overwhelmingly huge compared to the other layers seeing as how they diluted the entire heatmap!
|
97 |
- The duplicated layers on all layer types (except one) are extra sensitive. `post_attention_layernorm` interestingly had some changes in the upscale's duplicated layers, unlike Cydonia where latter layers were completely unchanged.
|
|
|
1 |
# Usage
|
2 |
- Metharme format (Mistral works too but untested)
|
3 |
|
4 |
+
---
|
5 |
+
|
6 |
# Upscaled Tuning Experiment Write Up Thingy
|
7 |
|
8 |
My cute attempt at being a siyantis :3 uwu ~
|
|
|
58 |
34.5% 41.8% 23.7%
|
59 |
```
|
60 |
|
61 |
+
---
|
62 |
+
|
63 |
# How did the finetune go?
|
64 |
|
65 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/uo_GtNKPZ_KaWJCoAcB92.png)
|
66 |
|
67 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/SmloL6rPXe9jJuyQNpHKG.png)
|
68 |
|
69 |
+
---
|
70 |
+
|
71 |
# Weight Difference Visualization
|
72 |
- Contorl A: Nemo x Rocinante
|
73 |
- Control B: Small x Cydonia
|
74 |
+
- Control C: Upscaled Nemo x Theia
|
75 |
- Sample A: 39B Upscale x Tunguska 1 Epoch
|
76 |
- Sample B: 39B Upscale x Tunguska 2 Epochs
|
77 |
- Sample C: Tunguska 1 Epoch x Tunguska 2 Epochs
|
78 |
|
79 |
+
|
80 |
+
***Note the layer sequence and other labels since it will be unreadable for the 39B**
|
81 |
+
|
82 |
+
***Visualization Issue: Some layer types like `input_layernorm` look unchanged because the highest value (usually layer 0) dilluted the entire heatmap**
|
83 |
+
|
84 |
## Control A (Nemo 12B & Rocinante 12B, similar training)
|
|
|
85 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/EZN8Ci2_vAGmdq0WUyrpN.png)
|
86 |
|
87 |
## Control B (Small 22B & Cydonia 22B, similar training)
|
|
|
88 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/xdH_7fy9HuhSzaSE2-h4X.png)
|
89 |
|
90 |
+
## Control C (Upscaled Nemo 21B & Theia 21B, similar training)
|
91 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/RTvz5g8_fd5g8ZMLmawlv.png)
|
92 |
+
|
93 |
## Sample A: Tunguska 39B 1st Epoch vs. its 39B base
|
94 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f2fd1c25b848bd061b5c2e/X3-bHyQg03-QvZFvOhGp7.png)
|
95 |
|
|
|
104 |
|
105 |
# Impressions
|
106 |
- Using the same LR on a larger model should have been more destructive. In practice, the duplicated layers must have saved the upscale from becoming dumber than Cydonia. The larger model seems to have preserved more of its smarts thanks to the 'empty, extra' duplicate layers.
|
107 |
+
- The upscale clearly has an effect on training. Control A & B show a smooth gradient unlike Control C & Sample A, B, C where the duplicated layers perturb the heatmap.
|
108 |
- The duplicated layers seem to be playing catch up with the original layers. The change difference on the duplicate layers in Sample B show that they are more responsive to training than the original layers. There are no indications that the duplicate layers are 'slowing down', given that the gradient pattern remains the same in Sample A, B, and especially C (where C can be considered the rate of change).
|
109 |
- In some layer types, the change difference for the first two layers are overwhelmingly huge compared to the other layers seeing as how they diluted the entire heatmap!
|
110 |
- The duplicated layers on all layer types (except one) are extra sensitive. `post_attention_layernorm` interestingly had some changes in the upscale's duplicated layers, unlike Cydonia where latter layers were completely unchanged.
|