allura-org
/

GLM4-9B-Neon-v2

+---
+license: mit
+datasets:
+- allura-org/Celeste-Filtered
+- allura-org/neon-41k
+- EVA-UNIT-01/Lilith-v0.2
+language:
+- en
+base_model:
+- THUDM/GLM-4-9B-0414
+---
+---
+# GLM-4-9B-0414 Neon v2
+RP finetune of GLM-4-9B-0414. Feels nice, lots of personality, if bit quirky sometimes. Nice prose, not too Claude-ish or Gemini-ish. Doesn't seem to like too long system prompts or charcards though. Seems to like JSON formatted system prompts.
+Model was trained by Auri.
+---
+**Training notes**
+Model was trained on a dataset consisting of 77M tokens of synthetic RP and short story gen data. Training took around 11 hours on 2xRTX 3090 SXM workstation, generously provided by [OwenArli](https://huggingface.co/OwenArli). Went with some sane defaults for training config, QLoRA plus CCE and sequence parallelism for nice chunk of memory usage optimization, 16k fit on 48GB nicely with some room to spare. I seem to have a problem with Eval/Loss being broken, not sure why, otherwise it trained smoothly.
+Huge thanks to [ArliAI](https://www.arliai.com/) for providing compute and collaborating on this run!
+**Format**
+Model responds to GLM4 instruct formatting, exactly like it's base model. Backends struggle to add BOS token automatically, so you'll need to do it yourself. Jinja template should work for chat completions.
+```
+[gMASK]<sop><|system|>
+{system_prompt}<|user|>
+{prompt}<|assistant|>
+```
+**Recommended Samplers**
+Nothing special, just classics.
+```
+Temperature - 1
+Min-P - 0.1
+Repetition Penalty - 1.03
+```
+**Training config**
+<details><summary>See Axolotl config</summary>
+```yaml
+# Model
+base_model: /home/owen/models/GLM-4-9B-0414
+strict: false
+model_type: AutoModelForCausalLM
+# Liger Kernels and CCE (optimization)
+plugins:
+  - axolotl.integrations.liger.LigerPlugin
+  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
+liger_rope: false
+liger_rms_norm: false
+liger_glu_activation: false
+liger_fused_linear_cross_entropy: false
+cut_cross_entropy: true
+# Output and HuggingFace
+output_dir: ./GLM-9B-Neon-v2
+hub_model_id: AuriAetherwiing/GLM-9B-Neon-v2-LoRA
+hf_use_auth_token: true
+hub_strategy: "all_checkpoints"
+# WandB
+wandb_project: allura-org
+wandb_entity:
+wandb_name: GLM-9B-Neon-v2
+# === Data Configuration ===
+# Data
+#chat_template: chatml
+#train_on_inputs: false
+group_by_length: false
+datasets:
+  - path: ./Neon/neon.jsonl
+    type: chat_template
+    field_messages: conversations
+    message_field_role: from
+    message_field_content: value
+  - path: ./Neon/S2.jsonl
+    type: chat_template
+    field_messages: conversations
+    message_field_role: from
+    message_field_content: value
+  - path: ./Neon/SystemChat_subset_filtered_sharegpt_utf8fix.jsonl
+    type: chat_template
+    field_messages: conversations
+    message_field_role: from
+    message_field_content: value
+dataset_prepared_path: ./lora_last_run_prepared
+## Evaluation
+val_set_size: 0.01
+evals_per_epoch: 2
+eval_table_size:
+eval_max_new_tokens: 128
+# Technical aspects
+sequence_len: 16384
+save_safetensors: true
+saves_per_epoch: 2
+logging_steps: 1
+#special_tokens:
+#  pad_token: <pad>
+# Quantization
+bf16: auto
+fp16:
+tf32: false
+## For LoRA
+load_in_8bit: false
+load_in_4bit: true
+# LoRA
+peft_use_rslora: false
+peft_use_dora: false # better but slower
+adapter: qlora # lora or qlora
+lora_model_dir:
+lora_r: 64 # 64 is optimal for most trains on instruct
+lora_alpha: 64
+lora_dropout: 0.1
+lora_target_linear: true
+lora_fan_in_fan_out:
+lora_target_modules:
+# loraplus_lr_ratio: 8 # works to converge faster but is kinda cancer bc makes model unstable
+#loraplus_lr_embedding:
+# Training hyperparameters
+# max_steps:
+num_epochs: 1
+# Anti Overfit and Stability
+weight_decay: 0.01
+max_grad_norm: 1.0
+## Learning Rate
+warmup_ratio: 0.05
+learning_rate: 1e-5
+lr_scheduler: rex
+#lr_scheduler_kwargs:
+#    min_lr: 0.0000024
+optimizer: adamw_torch # usually adamw_torch or paged_adamw_8bit
+## Batch Size
+gradient_accumulation_steps: 32      # More effective batch size - stabler train, usually. MBS also speeds it up.
+micro_batch_size: 1          # Batch size per gpu = micro_batch_size * gradient_accumulation_steps
+eval_batch_size: 1
+# Optimizations
+pad_to_sequence_len: true
+sample_packing: true
+eval_sample_packing: false
+flash_attention: true
+xformers_attention:
+gradient_checkpointing:
+gradient_checkpointing_kwargs:
+   use_reentrant: false
+# Set to a divisor (> 1) of the number of GPUs available
+#sequence_parallel_degree: 2  # Split sequences across 4 GPUs
+# Optional; strides across the key dimension. Larger values use more memory but should make training faster.
+#heads_k_stride: 1
+# Optional; one of "varlen_llama3", "batch_ring", "batch_zigzag", "batch_stripe". Defaults to
+# "varlen_llama3" when `sample_packing: true`, and "batch_ring" otherwise.
+#ring_attn_func:
+# deepspeed: /home/owen/axolotl/deepspeed_configs/zero3_bf16_cpuoffload_all.json
+fsdp:
+  - full_shard
+  - auto_wrap
+fsdp_config:
+  fsdp_limit_all_gathers: true
+  fsdp_sync_module_states: true
+  fsdp_offload_params: false
+  fsdp_use_orig_params: false
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_transformer_layer_cls_to_wrap: Glm4DecoderLayer
+  fsdp_state_dict_type: FULL_STATE_DICT
+  fsdp_sharding_strategy: FULL_SHARD
+  fsdp_activation_checkpointing: true
+```
+</details>