lucyknada commited on
Commit
bfb58bf
1 Parent(s): a7431f7

Upload ./README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +163 -0
README.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ datasets:
4
+ - Mielikki/Erebus-87k
5
+ - allura-org/r_shortstories_24k
6
+ base_model:
7
+ - UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3
8
+ pipeline_tag: text-generation
9
+ library_name: transformers
10
+ ---
11
+ ### exl2 quant (measurement.json in main branch)
12
+ ---
13
+ ### check revisions for quants
14
+ ---
15
+
16
+
17
+ <img src="image_27.png" alt="A beautiful witch writing a book with a quill">
18
+ <sub>Image by CalamitousFelicitouness</sub>
19
+
20
+ ---
21
+
22
+ # Gemma-2-9B Sugarquill v0
23
+
24
+ An experimental continued pretrain of Gemma-2-9B-It-SPPO-Iter3 on assorted short story data from the web.
25
+ I was trying to diversify Gemma's prose, without completely destroying it's smarts. I think I half-succeeded? This model could have used another epoch of training, but even this is already more creative and descriptive than it's base model, w/o becoming too silly. Doesn't seem to have degraded much in terms of core abilities as well.
26
+ Should be usable both for RP and raw completion storywriting.
27
+ I originally planned to use this in a merge, but I feel like this model is interesting enough to be released on it's own as well.
28
+
29
+ Model was trained by Auri.
30
+
31
+ Dedicated to Cahvay, who wanted a Gemma finetune from me for months by now, and to La Rata, who loves storywriter models.
32
+
33
+ GGUFs by Prodeus: https://huggingface.co/allura-org/G2-9B-Sugarquill-v0-GGUF
34
+
35
+ **Training notes**
36
+
37
+ This model was trained for 2 epochs on 10k rows (~18.7M tokens), taken equally from Erebus-87k and r_shortstories_24k datasets. It was trained on 8xH100 SXM node for 30 minutes with rsLoRA.
38
+ I got complete nonsense reported to my wandb during this run, and logging stopped altogether after step 13 for some reason. Seems to be directly related to Gemma, as my training setup worked flawlessly for Qwen.
39
+ Thanks to Kearm for helping with setting up LF on that node and to Featherless for providing it for EVA-Qwen2.5 (and this model, unknowingly lol) training.
40
+
41
+ **Format**
42
+
43
+ Model responds to Gemma instruct formatting, exactly like it's base model.
44
+
45
+ ```
46
+ <bos><start_of_turn>user
47
+ {user message}<end_of_turn>
48
+ <start_of_turn>model
49
+ {response}<end_of_turn><eos>
50
+ ```
51
+
52
+ **Training config**
53
+ <details><summary>See LLaMA-Factory config</summary>
54
+
55
+ ```yaml
56
+ ### Model
57
+ model_name_or_path: UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3
58
+ #ref_model: # Reference model for RL (optional, for everything besides SimPO, which doesn't take it at all)
59
+ #ref_model_quantization_bit: 8 # 8 or 4
60
+
61
+ ### Method
62
+ stage: pt # pt, sft, rm, ppo, kto, dpo (includes orpo and simpo)
63
+ do_train: true
64
+ finetuning_type: lora # full, freeze or lora
65
+ lora_target: all
66
+ #pref_beta: 0.1
67
+ #pref_loss: simpo # sigmoid (dpo), orpo, simpo, ipo, hinge
68
+
69
+ ### Reward model
70
+ #reward_model: RLHFlow/ArmoRM-Llama3-8B-v0.1 # or sfairXC/FsfairX-Gemma2-RM-v0.1 or nvidia/Llama-3.1-Nemotron-70B-Reward-HF
71
+ #reward_model_type: full # full, lora, api
72
+ #reward_model_adapters: # Path to RM LoRA adapter(s) if using a LoRA RM
73
+ #reward_model_quantization_bit: 8 # 4 or 8
74
+
75
+ ### Freeze
76
+ #freeze_trainable_layers: # The number of trainable layers for freeze (partial-parameter) fine-tuning. Positive number means n last layers to train, negative - n first layers to train
77
+ #freeze_trainable_modules: # Name(s) of trainable modules for freeze (partial-parameter) fine-tuning. Use commas to separate
78
+ #freeze_extra_modules: # Name(s) of modules apart from hidden layers to be set as trainable. Use commas to separate
79
+
80
+ ### LoRA
81
+ #loraplus_lr_ratio: 8.0
82
+ #loraplus_lr_embedding:
83
+ use_dora: false
84
+ use_rslora: true
85
+ lora_rank: 64 # 64 is optimal for most trains on instruct, if training on base - use rslora or dora
86
+ lora_alpha: 32
87
+ lora_dropout: 0.05
88
+ #pissa_init: true
89
+ #pissa_iter: 16
90
+ #pissa_convert: true
91
+
92
+ ### QLoRA
93
+ quantization_bit: 8 # 2,3,4,5,6,8 in HQQ, 4 or 8 in bnb
94
+ quantization_method: hqq # bitsandbytes or hqq
95
+
96
+ ### DeepSpeed
97
+ deepspeed: examples/deepspeed/ds_z2_config.json # ds_z3_config.json or ds_z2_config.json which is required for HQQ on multigpu
98
+
99
+ ### Dataset
100
+ dataset: sugarquill-10k # define in data/dataset_info.json
101
+ cutoff_len: 8192
102
+ max_samples: 10000
103
+ overwrite_cache: true
104
+ preprocessing_num_workers: 16
105
+ #template: chatml
106
+
107
+ ### Output
108
+ output_dir: saves/gemma/lora/sugarquill-1
109
+ logging_steps: 3
110
+ save_steps: 50
111
+ plot_loss: true
112
+ compute_accuracy: true
113
+ overwrite_output_dir: true
114
+
115
+ ### Train
116
+ per_device_train_batch_size: 1 # Effective b/s == per-device b/s * grad accum steps * number of GPUs
117
+ gradient_accumulation_steps: 8
118
+ learning_rate: 3.0e-5
119
+ optim: paged_adamw_8bit # paged_adamw_8bit or adamw_torch usually
120
+ num_train_epochs: 2.0
121
+ lr_scheduler_type: cosine # cosine, constant or linear
122
+ warmup_ratio: 0.05
123
+ bf16: true
124
+ ddp_timeout: 180000000
125
+ packing: true
126
+ max_grad_norm: 1.0
127
+
128
+ ### Opts
129
+ flash_attn: fa2 # auto, disabled, sdpa, fa2 | Gemma will fallback to eager
130
+ enable_liger_kernel: true # Pretty much must have if it works
131
+ #use_unsloth: true # May not work with multigpu idk
132
+ #use_adam_mini: true # Comment optim if using this
133
+
134
+ ### Eval
135
+ val_size: 0.1
136
+ per_device_eval_batch_size: 1
137
+ eval_strategy: steps
138
+ eval_steps: 0.05
139
+
140
+ ### Misc
141
+ include_num_input_tokens_seen: true
142
+ ddp_find_unused_parameters: false # Stupid thing tries to start distributed training otherwise
143
+ upcast_layernorm: true
144
+
145
+ ### Inference for PPO
146
+ #max_new_tokens: 512
147
+ #temperature: 0.8
148
+ #top_k: 0
149
+ #top_p: 0.8
150
+
151
+ ### Tracking
152
+ report_to: wandb # or tensorboard or mlflow | LOGIN BEFORE STARTING TRAIN OR ELSE IT WILL CRASH
153
+ run_name: G2-9B-Sugarquill-1
154
+
155
+ ### Merge Adapter
156
+ #export_dir: models/G2-9B-Sugarquill
157
+ #export_size: 4
158
+ #export_device: gpu
159
+ #export_legacy_format: false
160
+
161
+ ```
162
+
163
+ </details>