CalamitousFelicitousness commited on
Commit
19129ec
1 Parent(s): 318cddf

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +431 -0
README.md ADDED
@@ -0,0 +1,431 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ base_model: Qwen/Qwen2.5-32B
5
+ datasets:
6
+ - anthracite-org/kalo-opus-instruct-22k-no-refusal
7
+ - Nopm/Opus_WritingStruct
8
+ - Gryphe/Sonnet3.5-SlimOrcaDedupCleaned
9
+ - Gryphe/Sonnet3.5-Charcard-Roleplay
10
+ - Gryphe/ChatGPT-4o-Writing-Prompts
11
+ - Epiculous/Synthstruct-Gens-v1.1-Filtered-n-Cleaned
12
+ - Epiculous/SynthRP-Gens-v1.1-Filtered-n-Cleaned
13
+ - nothingiisreal/Reddit-Dirty-And-WritingPrompts
14
+ - allura-org/Celeste-1.x-data-mixture
15
+ - cognitivecomputations/dolphin-2.9.3
16
+ tags:
17
+ - generated_from_trainer
18
+ model-index:
19
+ - name: EVA-Qwen2.5-32B-SFFT-v0.1
20
+ results: []
21
+ ---
22
+
23
+ # This repo contains the copy of the original quantized to FP8. Original: [EVA-UNIT-01/EVA-Qwen2.5-32B-v0.1](https://huggingface.co/EVA-UNIT-01/EVA-Qwen2.5-32B-v0.1)
24
+
25
+ # EVA Qwen2.5-32B v0.1
26
+
27
+ <p>
28
+ A RP/storywriting specialist model, full-parameter finetune of Qwen2.5-32B on mixture of synthetic and natural data.<br>
29
+ It uses Celeste 70B 0.1 data mixture, greatly expanding it to improve versatility, creativity and "flavor" of the resulting model.<br>
30
+ </p>
31
+
32
+ <p>Version notes for 0.1: Additional round of cleaning for the datasets, new subsets of 4o-WritingPrompts and Charcards, picking the most diverse samples from them, plus added a small subset of SystemChat2.0 to improve instruction following and sliglthy increased sequence length. Additionally, fixed the training config mistake from 32B 0.0, layernorm layers stay frozen this time. Unfreezing them caused positivity bias to appear in 32B 0.0 for some reason.</p>
33
+
34
+ <p>
35
+ <p>Prompt format is ChatML.</p><br>
36
+ <h3>Recommended sampler values:</h3>
37
+ <ul>
38
+ <li>Temperature: 1</li>
39
+ <li>Typical-P: 0.9</li>
40
+ <li>Min-P: 0.05</li>
41
+ <li>Top-A: 0.2</li>
42
+ <li>Repetition Penalty: 1.03</li>
43
+ </ul>
44
+
45
+ <h3>Recommended SillyTavern presets (via CalamitousFelicitousness):</h3>
46
+
47
+ - [Context](https://huggingface.co/EVA-UNIT-01/EVA-Yi-1.5-9B-32K-V1/blob/main/%5BChatML%5D%20Roleplay-v1.9%20Context.json)
48
+ - [Instruct and System Prompt](https://huggingface.co/EVA-UNIT-01/EVA-Yi-1.5-9B-32K-V1/blob/main/%5BChatML%5D%20Roleplay-v1.9%20Instruct.json)
49
+ </p>
50
+
51
+ <p>
52
+ <br>
53
+ <h3>
54
+ Training data:
55
+ </h3>
56
+ <ul>
57
+ <li>Celeste 70B 0.1 data mixture minus Opus Instruct subset. See that model's <a href=https://huggingface.co/nothingiisreal/L3.1-70B-Celeste-V0.1-BF16>card</a> for details.</li>
58
+ <li>Kalomaze's Opus_Instruct_25k dataset, filtered for refusals.</li>
59
+ <li>A subset (1k rows) of ChatGPT-4o-WritingPrompts by Gryphe</li>
60
+ <li>A subset (2k rows) of Sonnet3.5-Charcards-Roleplay by Gryphe</li>
61
+ <li>Synthstruct and SynthRP datasets by Epiculous</li>
62
+ <li>A subset from Dolphin-2.9.3, including filtered version of not_samantha and a small subset of systemchat.</li>
63
+ </ul>
64
+ <h3>
65
+ Training time and hardware:
66
+ </h3>
67
+ <ul><li>7 hours on 8xH100 SXM, provided by <a href=https://featherless.ai/>FeatherlessAI</a></li></ul><br>
68
+ </p>
69
+ <p>Model was trained by Kearm and Auri.</p>
70
+ <h4>Special thanks:</h4><ul>
71
+ <li><b>to <a href=https://featherless.ai/>FeatherlessAI</a> for generously providing 8xH100 SXM node for training of this model</b></li>
72
+ <li>to Gryphe, Lemmy, Kalomaze, Nopm, Epiculous and CogninitiveComputations for the data</li>
73
+ <li>and to Allura-org for support, feedback, beta-testing and doing quality control of EVA models.</li></ul>
74
+
75
+ [<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
76
+ <details><summary>See axolotl config</summary>
77
+
78
+ axolotl version: `0.4.1`
79
+ ```yaml
80
+ base_model: Qwen/Qwen2.5-32B
81
+
82
+ load_in_8bit: false
83
+ load_in_4bit: false
84
+ strict: false
85
+
86
+ plugins:
87
+ - axolotl.integrations.liger.LigerPlugin
88
+ liger_rope: true
89
+ liger_rms_norm: true
90
+ liger_swiglu: true
91
+ liger_fused_linear_cross_entropy: true
92
+
93
+ # plugins:
94
+ # - axolotl.integrations.spectrum.SpectrumPlugin
95
+
96
+ # spectrum_top_fraction: 0.5
97
+ # # Optional if using a pre-scanned model as your base_model. Useful if using a model mirror
98
+ # spectrum_model_name: Qwen/Qwen2.5-32B
99
+
100
+ datasets:
101
+ - path: datasets/deduped_Synthstruct-Gens_processed_sharegpt_converted_cleaned.jsonl
102
+ type: sharegpt
103
+ - path: datasets/opus-instruct-22k-no_refusals-filtered.jsonl
104
+ type: sharegpt
105
+ - path: datasets/Celeste_Filtered.jsonl
106
+ type: sharegpt
107
+ - path: datasets/Sonnet3-5-charcard-names-filtered-sharegpt.jsonl
108
+ type: sharegpt
109
+ - path: datasets/deduped_SynthRP-Gens_processed_09-25-2024-ShareGPT_converted_cleaned.jsonl
110
+ type: sharegpt
111
+ - path: datasets/Gryphe-4o-WP-filtered-sharegpt.jsonl
112
+ type: sharegpt
113
+ - path: datasets/deduped_not_samantha_norefusals.jsonl
114
+ type: sharegpt
115
+ - path: datasets/SystemChat_subset_filtered_sharegpt.jsonl
116
+ type: sharegpt
117
+
118
+ chat_template: chatml
119
+ shuffle_merged_datasets: true
120
+ val_set_size: 0.001
121
+ output_dir: ./EVA-Qwen2.5-32B-SFFT-v0.1
122
+
123
+ sequence_len: 9216
124
+ sample_packing: true
125
+ eval_sample_packing: false
126
+ pad_to_sequence_len: true
127
+
128
+ # adapter: qlora
129
+ # lora_model_dir:
130
+ # lora_r: 64
131
+ # lora_alpha: 128
132
+ # lora_dropout: 0.05
133
+ # lora_target_linear: true
134
+ # peft_use_dora: true
135
+
136
+ unfrozen_parameters:
137
+ - ^lm_head.weight$
138
+ - ^model.embed_tokens.weight$
139
+ # mlp.down_proj layers
140
+ - model.layers.63.mlp.down_proj
141
+ - model.layers.49.mlp.down_proj
142
+ - model.layers.48.mlp.down_proj
143
+ - model.layers.45.mlp.down_proj
144
+ - model.layers.44.mlp.down_proj
145
+ - model.layers.47.mlp.down_proj
146
+ - model.layers.46.mlp.down_proj
147
+ - model.layers.43.mlp.down_proj
148
+ - model.layers.8.mlp.down_proj
149
+ - model.layers.11.mlp.down_proj
150
+ - model.layers.19.mlp.down_proj
151
+ - model.layers.35.mlp.down_proj
152
+ - model.layers.20.mlp.down_proj
153
+ - model.layers.52.mlp.down_proj
154
+ - model.layers.39.mlp.down_proj
155
+ - model.layers.62.mlp.down_proj
156
+ - model.layers.50.mlp.down_proj
157
+ - model.layers.29.mlp.down_proj
158
+ - model.layers.16.mlp.down_proj
159
+ - model.layers.28.mlp.down_proj
160
+ - model.layers.53.mlp.down_proj
161
+ - model.layers.30.mlp.down_proj
162
+ - model.layers.31.mlp.down_proj
163
+ - model.layers.32.mlp.down_proj
164
+ - model.layers.7.mlp.down_proj
165
+ - model.layers.36.mlp.down_proj
166
+ - model.layers.12.mlp.down_proj
167
+ - model.layers.18.mlp.down_proj
168
+ - model.layers.37.mlp.down_proj
169
+ - model.layers.38.mlp.down_proj
170
+ - model.layers.14.mlp.down_proj
171
+ - model.layers.13.mlp.down_proj
172
+ # mlp.gate_proj layers
173
+ - model.layers.43.mlp.gate_proj
174
+ - model.layers.61.mlp.gate_proj
175
+ - model.layers.60.mlp.gate_proj
176
+ - model.layers.44.mlp.gate_proj
177
+ - model.layers.62.mlp.gate_proj
178
+ - model.layers.28.mlp.gate_proj
179
+ - model.layers.29.mlp.gate_proj
180
+ - model.layers.45.mlp.gate_proj
181
+ - model.layers.37.mlp.gate_proj
182
+ - model.layers.35.mlp.gate_proj
183
+ - model.layers.59.mlp.gate_proj
184
+ - model.layers.36.mlp.gate_proj
185
+ - model.layers.30.mlp.gate_proj
186
+ - model.layers.48.mlp.gate_proj
187
+ - model.layers.38.mlp.gate_proj
188
+ - model.layers.27.mlp.gate_proj
189
+ - model.layers.31.mlp.gate_proj
190
+ - model.layers.34.mlp.gate_proj
191
+ - model.layers.58.mlp.gate_proj
192
+ - model.layers.33.mlp.gate_proj
193
+ - model.layers.39.mlp.gate_proj
194
+ - model.layers.26.mlp.gate_proj
195
+ - model.layers.32.mlp.gate_proj
196
+ - model.layers.46.mlp.gate_proj
197
+ - model.layers.42.mlp.gate_proj
198
+ - model.layers.49.mlp.gate_proj
199
+ - model.layers.57.mlp.gate_proj
200
+ - model.layers.50.mlp.gate_proj
201
+ - model.layers.47.mlp.gate_proj
202
+ - model.layers.56.mlp.gate_proj
203
+ - model.layers.63.mlp.gate_proj
204
+ - model.layers.55.mlp.gate_proj
205
+ # mlp.up_proj layers
206
+ - model.layers.61.mlp.up_proj
207
+ - model.layers.60.mlp.up_proj
208
+ - model.layers.32.mlp.up_proj
209
+ - model.layers.59.mlp.up_proj
210
+ - model.layers.58.mlp.up_proj
211
+ - model.layers.57.mlp.up_proj
212
+ - model.layers.44.mlp.up_proj
213
+ - model.layers.28.mlp.up_proj
214
+ - model.layers.35.mlp.up_proj
215
+ - model.layers.36.mlp.up_proj
216
+ - model.layers.29.mlp.up_proj
217
+ - model.layers.31.mlp.up_proj
218
+ - model.layers.34.mlp.up_proj
219
+ - model.layers.55.mlp.up_proj
220
+ - model.layers.49.mlp.up_proj
221
+ - model.layers.30.mlp.up_proj
222
+ - model.layers.53.mlp.up_proj
223
+ - model.layers.43.mlp.up_proj
224
+ - model.layers.56.mlp.up_proj
225
+ - model.layers.33.mlp.up_proj
226
+ - model.layers.54.mlp.up_proj
227
+ - model.layers.62.mlp.up_proj
228
+ - model.layers.27.mlp.up_proj
229
+ - model.layers.51.mlp.up_proj
230
+ - model.layers.52.mlp.up_proj
231
+ - model.layers.37.mlp.up_proj
232
+ - model.layers.45.mlp.up_proj
233
+ - model.layers.26.mlp.up_proj
234
+ - model.layers.42.mlp.up_proj
235
+ - model.layers.50.mlp.up_proj
236
+ - model.layers.48.mlp.up_proj
237
+ - model.layers.39.mlp.up_proj
238
+ # self_attn.k_proj layers
239
+ - model.layers.63.self_attn.k_proj
240
+ - model.layers.55.self_attn.k_proj
241
+ - model.layers.60.self_attn.k_proj
242
+ - model.layers.7.self_attn.k_proj
243
+ - model.layers.12.self_attn.k_proj
244
+ - model.layers.13.self_attn.k_proj
245
+ - model.layers.57.self_attn.k_proj
246
+ - model.layers.29.self_attn.k_proj
247
+ - model.layers.14.self_attn.k_proj
248
+ - model.layers.51.self_attn.k_proj
249
+ - model.layers.53.self_attn.k_proj
250
+ - model.layers.54.self_attn.k_proj
251
+ - model.layers.22.self_attn.k_proj
252
+ - model.layers.61.self_attn.k_proj
253
+ - model.layers.18.self_attn.k_proj
254
+ - model.layers.30.self_attn.k_proj
255
+ - model.layers.9.self_attn.k_proj
256
+ - model.layers.24.self_attn.k_proj
257
+ - model.layers.23.self_attn.k_proj
258
+ - model.layers.25.self_attn.k_proj
259
+ - model.layers.10.self_attn.k_proj
260
+ - model.layers.58.self_attn.k_proj
261
+ - model.layers.56.self_attn.k_proj
262
+ - model.layers.15.self_attn.k_proj
263
+ - model.layers.32.self_attn.k_proj
264
+ - model.layers.28.self_attn.k_proj
265
+ - model.layers.8.self_attn.k_proj
266
+ - model.layers.59.self_attn.k_proj
267
+ - model.layers.11.self_attn.k_proj
268
+ - model.layers.48.self_attn.k_proj
269
+ - model.layers.16.self_attn.k_proj
270
+ - model.layers.50.self_attn.k_proj
271
+ # self_attn.o_proj layers
272
+ - model.layers.15.self_attn.o_proj
273
+ - model.layers.23.self_attn.o_proj
274
+ - model.layers.31.self_attn.o_proj
275
+ - model.layers.30.self_attn.o_proj
276
+ - model.layers.18.self_attn.o_proj
277
+ - model.layers.24.self_attn.o_proj
278
+ - model.layers.17.self_attn.o_proj
279
+ - model.layers.28.self_attn.o_proj
280
+ - model.layers.34.self_attn.o_proj
281
+ - model.layers.33.self_attn.o_proj
282
+ - model.layers.25.self_attn.o_proj
283
+ - model.layers.12.self_attn.o_proj
284
+ - model.layers.14.self_attn.o_proj
285
+ - model.layers.29.self_attn.o_proj
286
+ - model.layers.16.self_attn.o_proj
287
+ - model.layers.26.self_attn.o_proj
288
+ - model.layers.22.self_attn.o_proj
289
+ - model.layers.27.self_attn.o_proj
290
+ - model.layers.35.self_attn.o_proj
291
+ - model.layers.20.self_attn.o_proj
292
+ - model.layers.13.self_attn.o_proj
293
+ - model.layers.36.self_attn.o_proj
294
+ - model.layers.19.self_attn.o_proj
295
+ - model.layers.37.self_attn.o_proj
296
+ - model.layers.21.self_attn.o_proj
297
+ - model.layers.11.self_attn.o_proj
298
+ - model.layers.54.self_attn.o_proj
299
+ - model.layers.5.self_attn.o_proj
300
+ - model.layers.38.self_attn.o_proj
301
+ - model.layers.6.self_attn.o_proj
302
+ - model.layers.8.self_attn.o_proj
303
+ - model.layers.9.self_attn.o_proj
304
+ # self_attn.q_proj layers
305
+ - model.layers.1.self_attn.q_proj
306
+ - model.layers.2.self_attn.q_proj
307
+ - model.layers.3.self_attn.q_proj
308
+ - model.layers.45.self_attn.q_proj
309
+ - model.layers.54.self_attn.q_proj
310
+ - model.layers.35.self_attn.q_proj
311
+ - model.layers.48.self_attn.q_proj
312
+ - model.layers.61.self_attn.q_proj
313
+ - model.layers.52.self_attn.q_proj
314
+ - model.layers.50.self_attn.q_proj
315
+ - model.layers.60.self_attn.q_proj
316
+ - model.layers.56.self_attn.q_proj
317
+ - model.layers.58.self_attn.q_proj
318
+ - model.layers.42.self_attn.q_proj
319
+ - model.layers.59.self_attn.q_proj
320
+ - model.layers.44.self_attn.q_proj
321
+ - model.layers.55.self_attn.q_proj
322
+ - model.layers.57.self_attn.q_proj
323
+ - model.layers.41.self_attn.q_proj
324
+ - model.layers.36.self_attn.q_proj
325
+ - model.layers.39.self_attn.q_proj
326
+ - model.layers.4.self_attn.q_proj
327
+ - model.layers.43.self_attn.q_proj
328
+ - model.layers.34.self_attn.q_proj
329
+ - model.layers.46.self_attn.q_proj
330
+ - model.layers.49.self_attn.q_proj
331
+ - model.layers.40.self_attn.q_proj
332
+ - model.layers.25.self_attn.q_proj
333
+ - model.layers.51.self_attn.q_proj
334
+ - model.layers.17.self_attn.q_proj
335
+ - model.layers.37.self_attn.q_proj
336
+ - model.layers.53.self_attn.q_proj
337
+ # self_attn.v_proj layers
338
+ - model.layers.55.self_attn.v_proj
339
+ - model.layers.31.self_attn.v_proj
340
+ - model.layers.47.self_attn.v_proj
341
+ - model.layers.45.self_attn.v_proj
342
+ - model.layers.49.self_attn.v_proj
343
+ - model.layers.48.self_attn.v_proj
344
+ - model.layers.15.self_attn.v_proj
345
+ - model.layers.30.self_attn.v_proj
346
+ - model.layers.7.self_attn.v_proj
347
+ - model.layers.44.self_attn.v_proj
348
+ - model.layers.29.self_attn.v_proj
349
+ - model.layers.51.self_attn.v_proj
350
+ - model.layers.50.self_attn.v_proj
351
+ - model.layers.14.self_attn.v_proj
352
+ - model.layers.54.self_attn.v_proj
353
+ - model.layers.32.self_attn.v_proj
354
+ - model.layers.43.self_attn.v_proj
355
+ - model.layers.10.self_attn.v_proj
356
+ - model.layers.46.self_attn.v_proj
357
+ - model.layers.38.self_attn.v_proj
358
+ - model.layers.57.self_attn.v_proj
359
+ - model.layers.22.self_attn.v_proj
360
+ - model.layers.39.self_attn.v_proj
361
+ - model.layers.6.self_attn.v_proj
362
+ - model.layers.23.self_attn.v_proj
363
+ - model.layers.58.self_attn.v_proj
364
+ - model.layers.53.self_attn.v_proj
365
+ - model.layers.40.self_attn.v_proj
366
+ - model.layers.24.self_attn.v_proj
367
+ - model.layers.9.self_attn.v_proj
368
+ - model.layers.25.self_attn.v_proj
369
+ - model.layers.5.self_attn.v_proj
370
+
371
+
372
+
373
+ wandb_project: EVA-Qwen2.5-32B-SFFT-v0.1
374
+ wandb_entity:
375
+ wandb_watch:
376
+ wandb_name: Unit-01
377
+ wandb_log_model:
378
+
379
+ gradient_accumulation_steps: 8
380
+ micro_batch_size: 1
381
+ num_epochs: 3
382
+ optimizer: paged_adamw_8bit
383
+ lr_scheduler: cosine
384
+ learning_rate: 0.00005
385
+ max_grad_norm: 3
386
+
387
+ train_on_inputs: false
388
+ group_by_length: false
389
+ bf16: auto
390
+ fp16:
391
+ tf32: false
392
+
393
+ gradient_checkpointing: "unsloth"
394
+ # gradient_checkpointing_kwargs:
395
+ # use_reentrant: true
396
+ early_stopping_patience:
397
+ resume_from_checkpoint:
398
+ local_rank:
399
+ logging_steps: 1
400
+ xformers_attention:
401
+ flash_attention: true
402
+
403
+ warmup_steps: 20
404
+ evals_per_epoch: 4
405
+ saves_per_epoch: 2
406
+ save_safetensors: true
407
+ hub_model_id:
408
+ hub_strategy:
409
+ debug:
410
+ deepspeed: deepspeed_configs/zero3_bf16.json
411
+ weight_decay: 0.1
412
+ # fsdp:
413
+ # - full_shard
414
+ # - auto_wrap
415
+ # fsdp_config:
416
+ # fsdp_limit_all_gathers: true
417
+ # fsdp_sync_module_states: false
418
+ # fsdp_offload_params: true
419
+ # fsdp_cpu_ram_efficient_loading: true
420
+ # fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
421
+ # fsdp_transformer_layer_cls_to_wrap: Qwen2DecoderLayer
422
+ # fsdp_activation_checkpointing: true
423
+ # fsdp_state_dict_type: SHARDED_STATE_DICT # Changed from FULL_STATE_DICT
424
+ # fsdp_sharding_strategy: FULL_SHARD
425
+ # fsdp_forward_prefetch: false # Added
426
+ # fsdp_backward_prefetch: "BACKWARD_PRE" # Added
427
+ # fsdp_backward_prefetch_limit: 1 # Added
428
+ # fsdp_mixed_precision: BF16 # Added
429
+ ```
430
+
431
+ </details>