Text Generation
Transformers
Safetensors
English
glm4
conversational
AuriAetherwiing commited on
Commit
a8a1326
·
verified ·
1 Parent(s): 373225d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +197 -0
README.md ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - allura-org/Celeste-Filtered
5
+ - allura-org/neon-41k
6
+ - EVA-UNIT-01/Lilith-v0.2
7
+ language:
8
+ - en
9
+ base_model:
10
+ - THUDM/GLM-4-9B-0414
11
+ ---
12
+
13
+ ---
14
+
15
+ # GLM-4-9B-0414 Neon v2
16
+
17
+ RP finetune of GLM-4-9B-0414. Feels nice, lots of personality, if bit quirky sometimes. Nice prose, not too Claude-ish or Gemini-ish. Doesn't seem to like too long system prompts or charcards though. Seems to like JSON formatted system prompts.
18
+
19
+ Model was trained by Auri.
20
+
21
+ ---
22
+
23
+ **Training notes**
24
+
25
+ Model was trained on a dataset consisting of 77M tokens of synthetic RP and short story gen data. Training took around 11 hours on 2xRTX 3090 SXM workstation, generously provided by [OwenArli](https://huggingface.co/OwenArli). Went with some sane defaults for training config, QLoRA plus CCE and sequence parallelism for nice chunk of memory usage optimization, 16k fit on 48GB nicely with some room to spare. I seem to have a problem with Eval/Loss being broken, not sure why, otherwise it trained smoothly.
26
+
27
+ Huge thanks to [ArliAI](https://www.arliai.com/) for providing compute and collaborating on this run!
28
+
29
+ **Format**
30
+
31
+ Model responds to GLM4 instruct formatting, exactly like it's base model. Backends struggle to add BOS token automatically, so you'll need to do it yourself. Jinja template should work for chat completions.
32
+
33
+ ```
34
+ [gMASK]<sop><|system|>
35
+ {system_prompt}<|user|>
36
+ {prompt}<|assistant|>
37
+ ```
38
+
39
+ **Recommended Samplers**
40
+
41
+ Nothing special, just classics.
42
+
43
+ ```
44
+ Temperature - 1
45
+ Min-P - 0.1
46
+ Repetition Penalty - 1.03
47
+ ```
48
+
49
+ **Training config**
50
+ <details><summary>See Axolotl config</summary>
51
+
52
+ ```yaml
53
+ # Model
54
+ base_model: /home/owen/models/GLM-4-9B-0414
55
+ strict: false
56
+ model_type: AutoModelForCausalLM
57
+
58
+ # Liger Kernels and CCE (optimization)
59
+ plugins:
60
+ - axolotl.integrations.liger.LigerPlugin
61
+ - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
62
+ liger_rope: false
63
+ liger_rms_norm: false
64
+ liger_glu_activation: false
65
+ liger_fused_linear_cross_entropy: false
66
+ cut_cross_entropy: true
67
+
68
+ # Output and HuggingFace
69
+ output_dir: ./GLM-9B-Neon-v2
70
+ hub_model_id: AuriAetherwiing/GLM-9B-Neon-v2-LoRA
71
+ hf_use_auth_token: true
72
+ hub_strategy: "all_checkpoints"
73
+
74
+ # WandB
75
+ wandb_project: allura-org
76
+ wandb_entity:
77
+ wandb_name: GLM-9B-Neon-v2
78
+
79
+ # === Data Configuration ===
80
+
81
+ # Data
82
+ #chat_template: chatml
83
+ #train_on_inputs: false
84
+ group_by_length: false
85
+ datasets:
86
+ - path: ./Neon/neon.jsonl
87
+ type: chat_template
88
+ field_messages: conversations
89
+ message_field_role: from
90
+ message_field_content: value
91
+ - path: ./Neon/S2.jsonl
92
+ type: chat_template
93
+ field_messages: conversations
94
+ message_field_role: from
95
+ message_field_content: value
96
+ - path: ./Neon/SystemChat_subset_filtered_sharegpt_utf8fix.jsonl
97
+ type: chat_template
98
+ field_messages: conversations
99
+ message_field_role: from
100
+ message_field_content: value
101
+
102
+ dataset_prepared_path: ./lora_last_run_prepared
103
+
104
+ ## Evaluation
105
+ val_set_size: 0.01
106
+ evals_per_epoch: 2
107
+ eval_table_size:
108
+ eval_max_new_tokens: 128
109
+
110
+ # Technical aspects
111
+ sequence_len: 16384
112
+ save_safetensors: true
113
+ saves_per_epoch: 2
114
+ logging_steps: 1
115
+ #special_tokens:
116
+ # pad_token: <pad>
117
+ # Quantization
118
+ bf16: auto
119
+ fp16:
120
+ tf32: false
121
+ ## For LoRA
122
+ load_in_8bit: false
123
+ load_in_4bit: true
124
+
125
+ # LoRA
126
+ peft_use_rslora: false
127
+ peft_use_dora: false # better but slower
128
+ adapter: qlora # lora or qlora
129
+ lora_model_dir:
130
+ lora_r: 64 # 64 is optimal for most trains on instruct
131
+ lora_alpha: 64
132
+ lora_dropout: 0.1
133
+ lora_target_linear: true
134
+ lora_fan_in_fan_out:
135
+ lora_target_modules:
136
+
137
+ # loraplus_lr_ratio: 8 # works to converge faster but is kinda cancer bc makes model unstable
138
+ #loraplus_lr_embedding:
139
+
140
+ # Training hyperparameters
141
+ # max_steps:
142
+ num_epochs: 1
143
+
144
+ # Anti Overfit and Stability
145
+ weight_decay: 0.01
146
+ max_grad_norm: 1.0
147
+
148
+ ## Learning Rate
149
+ warmup_ratio: 0.05
150
+ learning_rate: 1e-5
151
+ lr_scheduler: rex
152
+ #lr_scheduler_kwargs:
153
+ # min_lr: 0.0000024
154
+ optimizer: adamw_torch # usually adamw_torch or paged_adamw_8bit
155
+
156
+ ## Batch Size
157
+ gradient_accumulation_steps: 32 # More effective batch size - stabler train, usually. MBS also speeds it up.
158
+ micro_batch_size: 1 # Batch size per gpu = micro_batch_size * gradient_accumulation_steps
159
+ eval_batch_size: 1
160
+
161
+ # Optimizations
162
+ pad_to_sequence_len: true
163
+ sample_packing: true
164
+ eval_sample_packing: false
165
+ flash_attention: true
166
+ xformers_attention:
167
+ gradient_checkpointing:
168
+ gradient_checkpointing_kwargs:
169
+ use_reentrant: false
170
+
171
+ # Set to a divisor (> 1) of the number of GPUs available
172
+ #sequence_parallel_degree: 2 # Split sequences across 4 GPUs
173
+ # Optional; strides across the key dimension. Larger values use more memory but should make training faster.
174
+ #heads_k_stride: 1
175
+ # Optional; one of "varlen_llama3", "batch_ring", "batch_zigzag", "batch_stripe". Defaults to
176
+ # "varlen_llama3" when `sample_packing: true`, and "batch_ring" otherwise.
177
+ #ring_attn_func:
178
+
179
+ # deepspeed: /home/owen/axolotl/deepspeed_configs/zero3_bf16_cpuoffload_all.json
180
+
181
+ fsdp:
182
+ - full_shard
183
+ - auto_wrap
184
+ fsdp_config:
185
+ fsdp_limit_all_gathers: true
186
+ fsdp_sync_module_states: true
187
+ fsdp_offload_params: false
188
+ fsdp_use_orig_params: false
189
+ fsdp_cpu_ram_efficient_loading: true
190
+ fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
191
+ fsdp_transformer_layer_cls_to_wrap: Glm4DecoderLayer
192
+ fsdp_state_dict_type: FULL_STATE_DICT
193
+ fsdp_sharding_strategy: FULL_SHARD
194
+ fsdp_activation_checkpointing: true
195
+ ```
196
+
197
+ </details>