davidxmle commited on
Commit
823d117
1 Parent(s): f379686

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +295 -4
README.md CHANGED
@@ -39,19 +39,19 @@ datasets:
39
  # Llama-3-8B-Special-Tokens-Adjusted
40
  - Original Model creator: [Meta](https://huggingface.co/meta-llama)
41
  - Original model: [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)
 
42
  - Built with Meta Llama 3
43
  - Created by [David Xue](https://www.linkedin.com/in/david-xue-uva/) from [Astronomer](https://astronomer.io)
44
 
45
  ## Description
46
  - This is the exact same model ([meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)) with the weights for the input and output embeddings from lm head and embedding matrix adjusted for certain tokens that were untrained which caused widespread issues for people attempting to fine-tune this base model with either adding their own tokens or using existing special tokens.
47
 
48
- - ## Why We Made This Model
49
 
50
  The Llama 3 base (non-instruct) model, while powerful, came with a significant oversight that some special tokens for instruction following within its architecture were left untrained, potentially derailing further fine-tuning processes. This was first noted by Daniel Han on X, highlighting a critical but fixable flaw in a widely used model.
51
 
52
- The primary goal of releasing a patched version of this model was to address this issue so that all users can utilize the model without facing training instabilities, such as sudden gradient explosions or `NaN` gradients, or having to fix the model themselves before fine-tuning.
53
 
54
- We aim to ensure that the machine learning community could continue using this model without needing to perform repetitive manual checks or corrections, thus speeding up the research and application deployment processes.
55
 
56
  ## Details of the Adjustment
57
 
@@ -59,7 +59,298 @@ The [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-
59
 
60
  The special tokens can be found by locating the rows where the entire row of the embedding values are all zeros, which imply they were not trained during the pretraining phase of the model from Meta. Such untrained tokens could lead to heavy computational issues, like gradient explosions or `NaN` gradients, during downstream fine-tuning on specific tasks.
61
 
62
- See here for a list of the tokens we found that has fit the "untrained" profile described:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
  Once these untrained tokens are identified, the average of trained tokens can be calculated by using the sums of embedding values of trained tokens for each feature/column and divided by the number of trained. This is done for both input and output matrices.
65
 
 
39
  # Llama-3-8B-Special-Tokens-Adjusted
40
  - Original Model creator: [Meta](https://huggingface.co/meta-llama)
41
  - Original model: [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)
42
+ - The usage of this model must abide by the [Llama 3 Community License](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE).
43
  - Built with Meta Llama 3
44
  - Created by [David Xue](https://www.linkedin.com/in/david-xue-uva/) from [Astronomer](https://astronomer.io)
45
 
46
  ## Description
47
  - This is the exact same model ([meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)) with the weights for the input and output embeddings from lm head and embedding matrix adjusted for certain tokens that were untrained which caused widespread issues for people attempting to fine-tune this base model with either adding their own tokens or using existing special tokens.
48
 
49
+ ## Why We Made This Model
50
 
51
  The Llama 3 base (non-instruct) model, while powerful, came with a significant oversight that some special tokens for instruction following within its architecture were left untrained, potentially derailing further fine-tuning processes. This was first noted by Daniel Han on X, highlighting a critical but fixable flaw in a widely used model.
52
 
53
+ The primary goal of releasing a patched version of this model was to address this issue so that the community can utilize the Llama 3 model without facing training instabilities, such as sudden gradient explosions or `NaN` gradients, or having to go through complicated processes to fix the model themselves before fine-tuning.
54
 
 
55
 
56
  ## Details of the Adjustment
57
 
 
59
 
60
  The special tokens can be found by locating the rows where the entire row of the embedding values are all zeros, which imply they were not trained during the pretraining phase of the model from Meta. Such untrained tokens could lead to heavy computational issues, like gradient explosions or `NaN` gradients, during downstream fine-tuning on specific tasks.
61
 
62
+
63
+ <details>
64
+ <summary>See here for a list of the tokens we found that has fit the "untrained" profile described:</summary>
65
+ ['À',
66
+ 'Á',
67
+ 'õ',
68
+ 'ö',
69
+ '÷',
70
+ 'ø',
71
+ 'ù',
72
+ 'ú',
73
+ 'û',
74
+ 'ü',
75
+ 'ý',
76
+ 'þ',
77
+ 'ÿ',
78
+ '">ččĊ',
79
+ ';čččĊ',
80
+ 'ĉTokenNameIdentifier',
81
+ 'ĠForCanBeConverted',
82
+ 'ĠForCanBeConvertedToF',
83
+ 'PostalCodesNL',
84
+ '$PostalCodesNL',
85
+ 'useRalative',
86
+ 'Û±Û',
87
+ 'аÑĢакÑĤ',
88
+ 'аÑĤиÑģÑı',
89
+ 'иÑĤиÑģÑı',
90
+ 'еÑĢиÑģÑĤи',
91
+ 'ávajÃŃcÃŃ',
92
+ 'илакÑĤи',
93
+ 'илаÑģÑı',
94
+ 'ÑĭÑŁN',
95
+ 'ÐİÑĭÑŁN',
96
+ 'ÐİÑĭÑŁNÐİÑĭÑŁN',
97
+ 'ıldıģında',
98
+ '<|reserved_special_token_0|>',
99
+ '<|reserved_special_token_1|>',
100
+ '<|reserved_special_token_2|>',
101
+ '<|reserved_special_token_3|>',
102
+ '<|start_header_id|>',
103
+ '<|end_header_id|>',
104
+ '<|reserved_special_token_4|>',
105
+ '<|eot_id|>',
106
+ '<|reserved_special_token_5|>',
107
+ '<|reserved_special_token_6|>',
108
+ '<|reserved_special_token_7|>',
109
+ '<|reserved_special_token_8|>',
110
+ '<|reserved_special_token_9|>',
111
+ '<|reserved_special_token_10|>',
112
+ '<|reserved_special_token_11|>',
113
+ '<|reserved_special_token_12|>',
114
+ '<|reserved_special_token_13|>',
115
+ '<|reserved_special_token_14|>',
116
+ '<|reserved_special_token_15|>',
117
+ '<|reserved_special_token_16|>',
118
+ '<|reserved_special_token_17|>',
119
+ '<|reserved_special_token_18|>',
120
+ '<|reserved_special_token_19|>',
121
+ '<|reserved_special_token_20|>',
122
+ '<|reserved_special_token_21|>',
123
+ '<|reserved_special_token_22|>',
124
+ '<|reserved_special_token_23|>',
125
+ '<|reserved_special_token_24|>',
126
+ '<|reserved_special_token_25|>',
127
+ '<|reserved_special_token_26|>',
128
+ '<|reserved_special_token_27|>',
129
+ '<|reserved_special_token_28|>',
130
+ '<|reserved_special_token_29|>',
131
+ '<|reserved_special_token_30|>',
132
+ '<|reserved_special_token_31|>',
133
+ '<|reserved_special_token_32|>',
134
+ '<|reserved_special_token_33|>',
135
+ '<|reserved_special_token_34|>',
136
+ '<|reserved_special_token_35|>',
137
+ '<|reserved_special_token_36|>',
138
+ '<|reserved_special_token_37|>',
139
+ '<|reserved_special_token_38|>',
140
+ '<|reserved_special_token_39|>',
141
+ '<|reserved_special_token_40|>',
142
+ '<|reserved_special_token_41|>',
143
+ '<|reserved_special_token_42|>',
144
+ '<|reserved_special_token_43|>',
145
+ '<|reserved_special_token_44|>',
146
+ '<|reserved_special_token_45|>',
147
+ '<|reserved_special_token_46|>',
148
+ '<|reserved_special_token_47|>',
149
+ '<|reserved_special_token_48|>',
150
+ '<|reserved_special_token_49|>',
151
+ '<|reserved_special_token_50|>',
152
+ '<|reserved_special_token_51|>',
153
+ '<|reserved_special_token_52|>',
154
+ '<|reserved_special_token_53|>',
155
+ '<|reserved_special_token_54|>',
156
+ '<|reserved_special_token_55|>',
157
+ '<|reserved_special_token_56|>',
158
+ '<|reserved_special_token_57|>',
159
+ '<|reserved_special_token_58|>',
160
+ '<|reserved_special_token_59|>',
161
+ '<|reserved_special_token_60|>',
162
+ '<|reserved_special_token_61|>',
163
+ '<|reserved_special_token_62|>',
164
+ '<|reserved_special_token_63|>',
165
+ '<|reserved_special_token_64|>',
166
+ '<|reserved_special_token_65|>',
167
+ '<|reserved_special_token_66|>',
168
+ '<|reserved_special_token_67|>',
169
+ '<|reserved_special_token_68|>',
170
+ '<|reserved_special_token_69|>',
171
+ '<|reserved_special_token_70|>',
172
+ '<|reserved_special_token_71|>',
173
+ '<|reserved_special_token_72|>',
174
+ '<|reserved_special_token_73|>',
175
+ '<|reserved_special_token_74|>',
176
+ '<|reserved_special_token_75|>',
177
+ '<|reserved_special_token_76|>',
178
+ '<|reserved_special_token_77|>',
179
+ '<|reserved_special_token_78|>',
180
+ '<|reserved_special_token_79|>',
181
+ '<|reserved_special_token_80|>',
182
+ '<|reserved_special_token_81|>',
183
+ '<|reserved_special_token_82|>',
184
+ '<|reserved_special_token_83|>',
185
+ '<|reserved_special_token_84|>',
186
+ '<|reserved_special_token_85|>',
187
+ '<|reserved_special_token_86|>',
188
+ '<|reserved_special_token_87|>',
189
+ '<|reserved_special_token_88|>',
190
+ '<|reserved_special_token_89|>',
191
+ '<|reserved_special_token_90|>',
192
+ '<|reserved_special_token_91|>',
193
+ '<|reserved_special_token_92|>',
194
+ '<|reserved_special_token_93|>',
195
+ '<|reserved_special_token_94|>',
196
+ '<|reserved_special_token_95|>',
197
+ '<|reserved_special_token_96|>',
198
+ '<|reserved_special_token_97|>',
199
+ '<|reserved_special_token_98|>',
200
+ '<|reserved_special_token_99|>',
201
+ '<|reserved_special_token_100|>',
202
+ '<|reserved_special_token_101|>',
203
+ '<|reserved_special_token_102|>',
204
+ '<|reserved_special_token_103|>',
205
+ '<|reserved_special_token_104|>',
206
+ '<|reserved_special_token_105|>',
207
+ '<|reserved_special_token_106|>',
208
+ '<|reserved_special_token_107|>',
209
+ '<|reserved_special_token_108|>',
210
+ '<|reserved_special_token_109|>',
211
+ '<|reserved_special_token_110|>',
212
+ '<|reserved_special_token_111|>',
213
+ '<|reserved_special_token_112|>',
214
+ '<|reserved_special_token_113|>',
215
+ '<|reserved_special_token_114|>',
216
+ '<|reserved_special_token_115|>',
217
+ '<|reserved_special_token_116|>',
218
+ '<|reserved_special_token_117|>',
219
+ '<|reserved_special_token_118|>',
220
+ '<|reserved_special_token_119|>',
221
+ '<|reserved_special_token_120|>',
222
+ '<|reserved_special_token_121|>',
223
+ '<|reserved_special_token_122|>',
224
+ '<|reserved_special_token_123|>',
225
+ '<|reserved_special_token_124|>',
226
+ '<|reserved_special_token_125|>',
227
+ '<|reserved_special_token_126|>',
228
+ '<|reserved_special_token_127|>',
229
+ '<|reserved_special_token_128|>',
230
+ '<|reserved_special_token_129|>',
231
+ '<|reserved_special_token_130|>',
232
+ '<|reserved_special_token_131|>',
233
+ '<|reserved_special_token_132|>',
234
+ '<|reserved_special_token_133|>',
235
+ '<|reserved_special_token_134|>',
236
+ '<|reserved_special_token_135|>',
237
+ '<|reserved_special_token_136|>',
238
+ '<|reserved_special_token_137|>',
239
+ '<|reserved_special_token_138|>',
240
+ '<|reserved_special_token_139|>',
241
+ '<|reserved_special_token_140|>',
242
+ '<|reserved_special_token_141|>',
243
+ '<|reserved_special_token_142|>',
244
+ '<|reserved_special_token_143|>',
245
+ '<|reserved_special_token_144|>',
246
+ '<|reserved_special_token_145|>',
247
+ '<|reserved_special_token_146|>',
248
+ '<|reserved_special_token_147|>',
249
+ '<|reserved_special_token_148|>',
250
+ '<|reserved_special_token_149|>',
251
+ '<|reserved_special_token_150|>',
252
+ '<|reserved_special_token_151|>',
253
+ '<|reserved_special_token_152|>',
254
+ '<|reserved_special_token_153|>',
255
+ '<|reserved_special_token_154|>',
256
+ '<|reserved_special_token_155|>',
257
+ '<|reserved_special_token_156|>',
258
+ '<|reserved_special_token_157|>',
259
+ '<|reserved_special_token_158|>',
260
+ '<|reserved_special_token_159|>',
261
+ '<|reserved_special_token_160|>',
262
+ '<|reserved_special_token_161|>',
263
+ '<|reserved_special_token_162|>',
264
+ '<|reserved_special_token_163|>',
265
+ '<|reserved_special_token_164|>',
266
+ '<|reserved_special_token_165|>',
267
+ '<|reserved_special_token_166|>',
268
+ '<|reserved_special_token_167|>',
269
+ '<|reserved_special_token_168|>',
270
+ '<|reserved_special_token_169|>',
271
+ '<|reserved_special_token_170|>',
272
+ '<|reserved_special_token_171|>',
273
+ '<|reserved_special_token_172|>',
274
+ '<|reserved_special_token_173|>',
275
+ '<|reserved_special_token_174|>',
276
+ '<|reserved_special_token_175|>',
277
+ '<|reserved_special_token_176|>',
278
+ '<|reserved_special_token_177|>',
279
+ '<|reserved_special_token_178|>',
280
+ '<|reserved_special_token_179|>',
281
+ '<|reserved_special_token_180|>',
282
+ '<|reserved_special_token_181|>',
283
+ '<|reserved_special_token_182|>',
284
+ '<|reserved_special_token_183|>',
285
+ '<|reserved_special_token_184|>',
286
+ '<|reserved_special_token_185|>',
287
+ '<|reserved_special_token_186|>',
288
+ '<|reserved_special_token_187|>',
289
+ '<|reserved_special_token_188|>',
290
+ '<|reserved_special_token_189|>',
291
+ '<|reserved_special_token_190|>',
292
+ '<|reserved_special_token_191|>',
293
+ '<|reserved_special_token_192|>',
294
+ '<|reserved_special_token_193|>',
295
+ '<|reserved_special_token_194|>',
296
+ '<|reserved_special_token_195|>',
297
+ '<|reserved_special_token_196|>',
298
+ '<|reserved_special_token_197|>',
299
+ '<|reserved_special_token_198|>',
300
+ '<|reserved_special_token_199|>',
301
+ '<|reserved_special_token_200|>',
302
+ '<|reserved_special_token_201|>',
303
+ '<|reserved_special_token_202|>',
304
+ '<|reserved_special_token_203|>',
305
+ '<|reserved_special_token_204|>',
306
+ '<|reserved_special_token_205|>',
307
+ '<|reserved_special_token_206|>',
308
+ '<|reserved_special_token_207|>',
309
+ '<|reserved_special_token_208|>',
310
+ '<|reserved_special_token_209|>',
311
+ '<|reserved_special_token_210|>',
312
+ '<|reserved_special_token_211|>',
313
+ '<|reserved_special_token_212|>',
314
+ '<|reserved_special_token_213|>',
315
+ '<|reserved_special_token_214|>',
316
+ '<|reserved_special_token_215|>',
317
+ '<|reserved_special_token_216|>',
318
+ '<|reserved_special_token_217|>',
319
+ '<|reserved_special_token_218|>',
320
+ '<|reserved_special_token_219|>',
321
+ '<|reserved_special_token_220|>',
322
+ '<|reserved_special_token_221|>',
323
+ '<|reserved_special_token_222|>',
324
+ '<|reserved_special_token_223|>',
325
+ '<|reserved_special_token_224|>',
326
+ '<|reserved_special_token_225|>',
327
+ '<|reserved_special_token_226|>',
328
+ '<|reserved_special_token_227|>',
329
+ '<|reserved_special_token_228|>',
330
+ '<|reserved_special_token_229|>',
331
+ '<|reserved_special_token_230|>',
332
+ '<|reserved_special_token_231|>',
333
+ '<|reserved_special_token_232|>',
334
+ '<|reserved_special_token_233|>',
335
+ '<|reserved_special_token_234|>',
336
+ '<|reserved_special_token_235|>',
337
+ '<|reserved_special_token_236|>',
338
+ '<|reserved_special_token_237|>',
339
+ '<|reserved_special_token_238|>',
340
+ '<|reserved_special_token_239|>',
341
+ '<|reserved_special_token_240|>',
342
+ '<|reserved_special_token_241|>',
343
+ '<|reserved_special_token_242|>',
344
+ '<|reserved_special_token_243|>',
345
+ '<|reserved_special_token_244|>',
346
+ '<|reserved_special_token_245|>',
347
+ '<|reserved_special_token_246|>',
348
+ '<|reserved_special_token_247|>',
349
+ '<|reserved_special_token_248|>',
350
+ '<|reserved_special_token_249|>',
351
+ '<|reserved_special_token_250|>']
352
+ </details>
353
+
354
 
355
  Once these untrained tokens are identified, the average of trained tokens can be calculated by using the sums of embedding values of trained tokens for each feature/column and divided by the number of trained. This is done for both input and output matrices.
356