Update README.md
Browse files
README.md
CHANGED
@@ -124,7 +124,7 @@ WIP
|
|
124 |
- Takeaways
|
125 |
- These slice of layers are more connected to each other than to the model's entirety.
|
126 |
- [Question] Does this mean that the **original layer** before the slice is the one holding that whole duplicated slice together?
|
127 |
-
- [Question] What if we interleave original and duplicate layers? Will that result in a more balanced, responsive upscale?
|
128 |
- Saturating these duplicated layers MIGHT be a good goal to pursue.
|
129 |
|
130 |
# Further Experimentation
|
@@ -145,4 +145,80 @@ Given how the duplicated layers seem to have a stabilizing effect, it begs the q
|
|
145 |
|
146 |
### We've so far hypothesized that training 'slowly fills' the duplicated layers. If we intentionally undercook, will the duplicated layers look *underfilled* or can you fill it up with a few steps? In other words, can a single/few updates to the model reconnect the duplicated layers?
|
147 |
- Are we really repairing the 'neurons' step-by-step, or have they been significantly rearranged by the first (few?) steps?
|
148 |
-
- Or maybe this is false given the top-bottom gradient observation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
124 |
- Takeaways
|
125 |
- These slice of layers are more connected to each other than to the model's entirety.
|
126 |
- [Question] Does this mean that the **original layer** before the slice is the one holding that whole duplicated slice together?
|
127 |
+
- [Question] What if we interleave original and duplicate layers? Will that result in a more balanced, responsive upscale? (See Proposed Upscale Technique at the bottom)
|
128 |
- Saturating these duplicated layers MIGHT be a good goal to pursue.
|
129 |
|
130 |
# Further Experimentation
|
|
|
145 |
|
146 |
### We've so far hypothesized that training 'slowly fills' the duplicated layers. If we intentionally undercook, will the duplicated layers look *underfilled* or can you fill it up with a few steps? In other words, can a single/few updates to the model reconnect the duplicated layers?
|
147 |
- Are we really repairing the 'neurons' step-by-step, or have they been significantly rearranged by the first (few?) steps?
|
148 |
+
- Or maybe this is false given the top-bottom gradient observation.
|
149 |
+
|
150 |
+
# Proposed Upscale Technique
|
151 |
+
```yaml
|
152 |
+
merge_method: passthrough
|
153 |
+
slices:
|
154 |
+
- sources:
|
155 |
+
- layer_range: [0, 18]
|
156 |
+
model: unsloth/Mistral-Small-Instruct-2409
|
157 |
+
# Original L19
|
158 |
+
- sources:
|
159 |
+
- layer_range: [19, 19]
|
160 |
+
model: unsloth/Mistral-Small-Instruct-2409
|
161 |
+
# Dupe A of L19
|
162 |
+
- sources:
|
163 |
+
- layer_range: [19, 19]
|
164 |
+
model: unsloth/Mistral-Small-Instruct-2409
|
165 |
+
parameters:
|
166 |
+
scale:
|
167 |
+
- filter: o_proj
|
168 |
+
value: 0.0
|
169 |
+
- filter: down_proj
|
170 |
+
value: 0.0
|
171 |
+
- value: 1.0
|
172 |
+
# Dupe B of L19
|
173 |
+
- sources:
|
174 |
+
- layer_range: [19, 19]
|
175 |
+
model: unsloth/Mistral-Small-Instruct-2409
|
176 |
+
parameters:
|
177 |
+
scale:
|
178 |
+
- filter: o_proj
|
179 |
+
value: 0.0
|
180 |
+
- filter: down_proj
|
181 |
+
value: 0.0
|
182 |
+
- value: 1.0
|
183 |
+
# Original L20
|
184 |
+
- sources:
|
185 |
+
- layer_range: [20, 20]
|
186 |
+
model: unsloth/Mistral-Small-Instruct-2409
|
187 |
+
# Dupe A of L20
|
188 |
+
- sources:
|
189 |
+
- layer_range: [20, 20]
|
190 |
+
model: unsloth/Mistral-Small-Instruct-2409
|
191 |
+
parameters:
|
192 |
+
scale:
|
193 |
+
- filter: o_proj
|
194 |
+
value: 0.0
|
195 |
+
- filter: down_proj
|
196 |
+
value: 0.0
|
197 |
+
- value: 1.0
|
198 |
+
# Dupe B of L20
|
199 |
+
- sources:
|
200 |
+
- layer_range: [20, 20]
|
201 |
+
model: unsloth/Mistral-Small-Instruct-2409
|
202 |
+
parameters:
|
203 |
+
scale:
|
204 |
+
- filter: o_proj
|
205 |
+
value: 0.0
|
206 |
+
- filter: down_proj
|
207 |
+
value: 0.0
|
208 |
+
- value: 1.0
|
209 |
+
# ... REPEAT UNTIL 41
|
210 |
+
- sources:
|
211 |
+
- layer_range: [41, 55]
|
212 |
+
model: unsloth/Mistral-Small-Instruct-2409
|
213 |
+
```
|
214 |
+
|
215 |
+
```
|
216 |
+
O = original
|
217 |
+
X = duplicate
|
218 |
+
|
219 |
+
Previous Technique
|
220 |
+
OOOOOOOOOOOXXOXXOXXOXXOXXOXXOXXOXXOXXOOOOOOOOOO
|
221 |
+
OOOOOOOOOOOOOOOOOOXXXXXXXXXXXXXXXXXXXOOOOOOOOOO
|
222 |
+
Proposed Technique
|
223 |
+
OOOOOOOOOOOXXOXXOXXOXXOXXOXXOXXOXXOXXOOOOOOOOOO
|
224 |
+
```
|