TheDrummer commited on
Commit
8e48aac
·
verified ·
1 Parent(s): f98632f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -2
README.md CHANGED
@@ -124,7 +124,7 @@ WIP
124
  - Takeaways
125
  - These slice of layers are more connected to each other than to the model's entirety.
126
  - [Question] Does this mean that the **original layer** before the slice is the one holding that whole duplicated slice together?
127
- - [Question] What if we interleave original and duplicate layers? Will that result in a more balanced, responsive upscale?
128
  - Saturating these duplicated layers MIGHT be a good goal to pursue.
129
 
130
  # Further Experimentation
@@ -145,4 +145,80 @@ Given how the duplicated layers seem to have a stabilizing effect, it begs the q
145
 
146
  ### We've so far hypothesized that training 'slowly fills' the duplicated layers. If we intentionally undercook, will the duplicated layers look *underfilled* or can you fill it up with a few steps? In other words, can a single/few updates to the model reconnect the duplicated layers?
147
  - Are we really repairing the 'neurons' step-by-step, or have they been significantly rearranged by the first (few?) steps?
148
- - Or maybe this is false given the top-bottom gradient observation.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
  - Takeaways
125
  - These slice of layers are more connected to each other than to the model's entirety.
126
  - [Question] Does this mean that the **original layer** before the slice is the one holding that whole duplicated slice together?
127
+ - [Question] What if we interleave original and duplicate layers? Will that result in a more balanced, responsive upscale? (See Proposed Upscale Technique at the bottom)
128
  - Saturating these duplicated layers MIGHT be a good goal to pursue.
129
 
130
  # Further Experimentation
 
145
 
146
  ### We've so far hypothesized that training 'slowly fills' the duplicated layers. If we intentionally undercook, will the duplicated layers look *underfilled* or can you fill it up with a few steps? In other words, can a single/few updates to the model reconnect the duplicated layers?
147
  - Are we really repairing the 'neurons' step-by-step, or have they been significantly rearranged by the first (few?) steps?
148
+ - Or maybe this is false given the top-bottom gradient observation.
149
+
150
+ # Proposed Upscale Technique
151
+ ```yaml
152
+ merge_method: passthrough
153
+ slices:
154
+ - sources:
155
+ - layer_range: [0, 18]
156
+ model: unsloth/Mistral-Small-Instruct-2409
157
+ # Original L19
158
+ - sources:
159
+ - layer_range: [19, 19]
160
+ model: unsloth/Mistral-Small-Instruct-2409
161
+ # Dupe A of L19
162
+ - sources:
163
+ - layer_range: [19, 19]
164
+ model: unsloth/Mistral-Small-Instruct-2409
165
+ parameters:
166
+ scale:
167
+ - filter: o_proj
168
+ value: 0.0
169
+ - filter: down_proj
170
+ value: 0.0
171
+ - value: 1.0
172
+ # Dupe B of L19
173
+ - sources:
174
+ - layer_range: [19, 19]
175
+ model: unsloth/Mistral-Small-Instruct-2409
176
+ parameters:
177
+ scale:
178
+ - filter: o_proj
179
+ value: 0.0
180
+ - filter: down_proj
181
+ value: 0.0
182
+ - value: 1.0
183
+ # Original L20
184
+ - sources:
185
+ - layer_range: [20, 20]
186
+ model: unsloth/Mistral-Small-Instruct-2409
187
+ # Dupe A of L20
188
+ - sources:
189
+ - layer_range: [20, 20]
190
+ model: unsloth/Mistral-Small-Instruct-2409
191
+ parameters:
192
+ scale:
193
+ - filter: o_proj
194
+ value: 0.0
195
+ - filter: down_proj
196
+ value: 0.0
197
+ - value: 1.0
198
+ # Dupe B of L20
199
+ - sources:
200
+ - layer_range: [20, 20]
201
+ model: unsloth/Mistral-Small-Instruct-2409
202
+ parameters:
203
+ scale:
204
+ - filter: o_proj
205
+ value: 0.0
206
+ - filter: down_proj
207
+ value: 0.0
208
+ - value: 1.0
209
+ # ... REPEAT UNTIL 41
210
+ - sources:
211
+ - layer_range: [41, 55]
212
+ model: unsloth/Mistral-Small-Instruct-2409
213
+ ```
214
+
215
+ ```
216
+ O = original
217
+ X = duplicate
218
+
219
+ Previous Technique
220
+ OOOOOOOOOOOXXOXXOXXOXXOXXOXXOXXOXXOXXOOOOOOOOOO
221
+ OOOOOOOOOOOOOOOOOOXXXXXXXXXXXXXXXXXXXOOOOOOOOOO
222
+ Proposed Technique
223
+ OOOOOOOOOOOXXOXXOXXOXXOXXOXXOXXOXXOXXOOOOOOOOOO
224
+ ```