kweinmeister
/

gemma-2-27b-it-dolly-15k

@@ -20,123 +20,15 @@ should probably proofread and complete it, then remove this comment. -->
 axolotl version: `0.6.0`
 ```yaml
-# base_model: meta-llama/Llama-3.2-1B-Instruct
-# # Automatically upload checkpoint and final model to HF
-# # hub_model_id: kweinmeister/Llama-3.2-1B-Instruct-MetaMathQA
-# hub_model_id: kweinmeister/Llama-3.2-1B-Instruct-gsm8k
-# load_in_8bit: false
-# load_in_4bit: true
-# strict: false
-# datasets:
-#   - path: openai/gsm8k
-#     type: alpaca_chat.load_qa
-#     name: "main"
-#     train_on_split: "train"
-# # datasets:
-# #   - path: meta-math/MetaMathQA
-# #     type:
-# #       field_instruction: query
-# #       field_output: response
-# val_set_size: 0.1
-# # output_dir: "/mnt/disks/gcs/axolotl/outputs/out"
-# output_dir: "/mnt/disks/gcs/axolotl/outputs/gsm8k-out"
-# # output_dir: "/mnt/disks/gcs/axolotl/outputs/MetaMathQA-out"
-# adapter: qlora
-# lora_model_dir:
-# sequence_len: 2048
-# sample_packing: true
-# eval_sample_packing: true
-# pad_to_sequence_len: true
-# lora_r: 32
-# lora_alpha: 16
-# lora_dropout: 0.05
-# lora_fan_in_fan_out:
-# lora_target_modules:
-#   - gate_proj
-#   - down_proj
-#   - up_proj
-#   - q_proj
-#   - v_proj
-#   - k_proj
-#   - o_proj
-# wandb_project:
-# wandb_entity:
-# wandb_watch:
-# wandb_name:
-# wandb_log_model:
-# gradient_accumulation_steps: 4
-# micro_batch_size: 2
-# num_epochs: 3
-# # optimizer: adamw_bnb_8bit
-# optimizer: adamw_torch
-# lr_scheduler: cosine
-# learning_rate: 2e-5
-# train_on_inputs: false
-# group_by_length: false
-# bf16: auto
-# fp16:
-# tf32: false
-# # gradient_checkpointing: true
-# gradient_checkpointing: false
-# early_stopping_patience:
-# resume_from_checkpoint:
-# local_rank:
-# logging_steps: 1
-# xformers_attention:
-# flash_attention: true
-# loss_watchdog_threshold: 5.0
-# loss_watchdog_patience: 3
-# warmup_steps: 10
-# evals_per_epoch: 4
-# eval_table_size:
-# eval_max_new_tokens: 128
-# saves_per_epoch: 1
-# debug:
-# deepspeed:
-# weight_decay: 0.0
-# # fsdp:
-# # fsdp_config:
-# fsdp:
-#   - full_shard
-#   - auto_wrap
-# fsdp_config:
-#   fsdp_limit_all_gathers: true
-#   fsdp_sync_module_states: true
-#   fsdp_offload_params: true
-#   fsdp_use_orig_params: false
-#   fsdp_cpu_ram_efficient_loading: true
-#   fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
-#   fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
-#   fsdp_state_dict_type: FULL_STATE_DICT
-#   fsdp_sharding_strategy: FULL_SHARD
-#   fsdp_activation_checkpointing: true
-# special_tokens:
-#   # pad_token: "<|end_of_text|>"
-#   special_tokens:
-#   bos_token: "<|begin_of_text|>"
-#   eos_token: "<|eot_id|>"
-#   pad_token: "<|finetune_right_pad_id|>"
 base_model: google/gemma-2-27b-it
-# model_type: AutoModelForCausalLM
-# tokenizer_type: AutoTokenizer
 hub_model_id: kweinmeister/gemma-2-27b-it-dolly-15k
 load_in_8bit: false
 load_in_4bit: true
 strict: false
@@ -152,7 +44,6 @@ val_set_size: 0.1
 output_dir: "/mnt/disks/gcs/axolotl/outputs/dolly-15k-out"
 adapter: qlora
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
@@ -160,26 +51,22 @@ lora_target_linear: true
 sequence_len: 2048
 sample_packing: true
-# eval_sample_packing: true
 pad_to_sequence_len: true
 gradient_accumulation_steps: 4
-micro_batch_size: 2
 num_epochs: 3
-# optimizer: adamw_bnb_8bit
 optimizer: adamw_torch
 lr_scheduler: cosine
 learning_rate: 2e-5
 train_on_inputs: false
 group_by_length: false
 bf16: auto
 fp16:
-tf32: false
-# gradient_checkpointing: false
 gradient_checkpointing: true
 early_stopping_patience:
 resume_from_checkpoint:
@@ -188,56 +75,17 @@ logging_steps: 1
 xformers_attention:
 flash_attention: false
-# loss_watchdog_threshold: 5.0
-# loss_watchdog_patience: 3
 warmup_ratio: 0.1
 evals_per_epoch: 4
 eval_max_new_tokens: 128
 saves_per_epoch: 1
 debug:
-# deepspeed:
-weight_decay: 0.0
 deepspeed: deepspeed_configs/zero1.json
 fsdp:
 fsdp_config:
-# fsdp:
-#   - full_shard
-#   - auto_wrap
-# fsdp_config:
-#   fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
-#   fsdp_backward_prefetch: BACKWARD_PRE
-#   fsdp_cpu_ram_efficient_loading: true
-#   fsdp_forward_prefetch: false
-#   fsdp_offload_params: true
-#   fsdp_sharding_strategy: FULL_SHARD
-#   fsdp_state_dict_type: SHARDED_STATE_DICT
-#   fsdp_transformer_layer_cls_to_wrap: GemmaDecoderLayer
-#   fsdp_sync_module_states: true
-#   fsdp_use_orig_params: true
-# fsdp_config:
-#   fsdp_limit_all_gathers: true
-#   fsdp_sync_module_states: true
-#   fsdp_offload_params: true
-#   fsdp_use_orig_params: false
-#   fsdp_cpu_ram_efficient_loading: true
-#   fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
-#   fsdp_transformer_layer_cls_to_wrap: GemmaDecoderLayer
-#   fsdp_state_dict_type: FULL_STATE_DICT
-#   fsdp_sharding_strategy: FULL_SHARD
-#   fsdp_activation_checkpointing: true
-# special_tokens:
-#   # pad_token: "<|end_of_text|>"
-#   special_tokens:
-#   bos_token: "<|begin_of_text|>"
-#   eos_token: "<|eot_id|>"
-#   pad_token: "<|finetune_right_pad_id|>"
 ```
 </details><br>
@@ -246,7 +94,7 @@ fsdp_config:
 This model is a fine-tuned version of [google/gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it) on the databricks/databricks-dolly-15k dataset.
 It achieves the following results on the evaluation set:
-- Loss: 1.6809
 ## Model description
@@ -266,35 +114,35 @@ More information needed
 The following hyperparameters were used during training:
 - learning_rate: 2e-05
-- train_batch_size: 2
-- eval_batch_size: 2
 - seed: 42
 - distributed_type: multi-GPU
 - num_devices: 2
 - gradient_accumulation_steps: 4
-- total_train_batch_size: 16
-- total_eval_batch_size: 4
 - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
 - lr_scheduler_type: cosine
-- lr_scheduler_warmup_steps: 23
 - num_epochs: 3
 ### Training results
 | Training Loss | Epoch  | Step | Validation Loss |
 |:-------------:|:------:|:----:|:---------------:|
-| 3.8741        | 0.0129 | 1    | 4.1287          |
-| 3.5275        | 0.2589 | 20   | 3.7627          |
-| 2.5496        | 0.5178 | 40   | 2.5361          |
-| 2.1047        | 0.7767 | 60   | 2.0215          |
-| 1.8435        | 1.0259 | 80   | 1.8475          |
-| 1.8821        | 1.2848 | 100  | 1.7748          |
-| 1.834         | 1.5437 | 120  | 1.7345          |
-| 1.7633        | 1.8026 | 140  | 1.7098          |
-| 1.6382        | 2.0647 | 160  | 1.6954          |
-| 1.9356        | 2.3236 | 180  | 1.6863          |
-| 1.6196        | 2.5825 | 200  | 1.6819          |
-| 1.7489        | 2.8414 | 220  | 1.6809          |
 ### Framework versions

 axolotl version: `0.6.0`
 ```yaml
 base_model: google/gemma-2-27b-it
+model_type: AutoModelForCausalLM
+tokenizer_type: AutoTokenizer
 hub_model_id: kweinmeister/gemma-2-27b-it-dolly-15k
+# https://github.com/vllm-project/vllm/issues/10590
+bnb_config_kwargs:
+  bnb_4bit_quant_storage: uint8
 load_in_8bit: false
 load_in_4bit: true
 strict: false
 output_dir: "/mnt/disks/gcs/axolotl/outputs/dolly-15k-out"
 adapter: qlora
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 sequence_len: 2048
 sample_packing: true
+eval_sample_packing: false
 pad_to_sequence_len: true
 gradient_accumulation_steps: 4
+micro_batch_size: 1
 num_epochs: 3
 optimizer: adamw_torch
 lr_scheduler: cosine
 learning_rate: 2e-5
 train_on_inputs: false
 group_by_length: false
 bf16: auto
 fp16:
+tf32: true
 gradient_checkpointing: true
 early_stopping_patience:
 resume_from_checkpoint:
 xformers_attention:
 flash_attention: false
 warmup_ratio: 0.1
 evals_per_epoch: 4
 eval_max_new_tokens: 128
 saves_per_epoch: 1
 debug:
 deepspeed: deepspeed_configs/zero1.json
+weight_decay: 0.0
 fsdp:
 fsdp_config:
 ```
 </details><br>
 This model is a fine-tuned version of [google/gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it) on the databricks/databricks-dolly-15k dataset.
 It achieves the following results on the evaluation set:
+- Loss: 1.4649
 ## Model description
 The following hyperparameters were used during training:
 - learning_rate: 2e-05
+- train_batch_size: 1
+- eval_batch_size: 1
 - seed: 42
 - distributed_type: multi-GPU
 - num_devices: 2
 - gradient_accumulation_steps: 4
+- total_train_batch_size: 8
+- total_eval_batch_size: 2
 - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
 - lr_scheduler_type: cosine
+- lr_scheduler_warmup_steps: 46
 - num_epochs: 3
 ### Training results
 | Training Loss | Epoch  | Step | Validation Loss |
 |:-------------:|:------:|:----:|:---------------:|
+| 4.0853        | 0.0065 | 1    | 2.5485          |
+| 3.4071        | 0.2524 | 39   | 2.1938          |
+| 1.9159        | 0.5049 | 78   | 1.6474          |
+| 1.6968        | 0.7573 | 117  | 1.5546          |
+| 1.7757        | 1.0129 | 156  | 1.5193          |
+| 1.7768        | 1.2654 | 195  | 1.4965          |
+| 1.3735        | 1.5178 | 234  | 1.4835          |
+| 1.7285        | 1.7702 | 273  | 1.4744          |
+| 1.6601        | 2.0259 | 312  | 1.4701          |
+| 1.6477        | 2.2783 | 351  | 1.4657          |
+| 1.3795        | 2.5307 | 390  | 1.4645          |
+| 1.6575        | 2.7832 | 429  | 1.4649          |
 ### Framework versions

adapter_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:89ea0874183b876bfbac6d55eff0dbfeaf282abd3be07d67d0ef8029990dc192
 size 456822394

 version https://git-lfs.github.com/spec/v1
+oid sha256:661b80aaae193a2bc65f5ebb67429f6c202da3bca1f700c37e0d8c4737584c7c
 size 456822394