Built with Axolotl

See axolotl config

axolotl version: 0.13.0.dev0

base_model: gs://vertex-model-garden-restricted-us/gemma3/gemma-3-12b-it

# gemma3 doesn't seem to play nice with ddp
ddp_find_unused_parameters: true
experimental_skip_move_to_device: true  # prevent OOM by NOT putting model to GPU before sharding

plugins:
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
- axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
liger_fused_linear_cross_entropy: true

chat_template: gemma3
eot_tokens:
- <end_of_turn>

dataset_prepared_path: last_run_prepared
output_dir: /workspace/outputs/out

sequence_len: 8192
sample_packing: true
eval_sample_packing: true

use_kernels: true

micro_batch_size: 2
eval_batch_size: 1
gradient_accumulation_steps: 1
num_epochs: 3
optimizer: adamw_torch_fused
learning_rate: 1e-5

lr_scheduler: cosine

bf16: true
tf32: true
logging_steps: 1
flash_attention: true

gradient_checkpointing: true
activation_offloading: true

val_set_size: 0
eval_strategy: "epoch"
save_strategy: 'no'
include_tokens_per_second: true
save_safetensors: true
use_tensorboard: true

fsdp_version: 1
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: true
  fsdp_use_orig_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Gemma3DecoderLayer
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sharding_strategy: SHARD_GRAD_OP
  fsdp_backward_prefetch: BACKWARD_PRE
  final_state_dict_type: FULL_STATE_DICT

tmp/output_dir/gcs/fine-tuning-e28a4df1-aece-4fdc-b956-740e307dc840/postprocess/node-0/checkpoints/final

This model was trained from scratch on the gs://fine-tuning-e28a4df1-aece-4fdc-b956-740e307dc840/copymediadatatask/execution_artifacts/clean_train.jsonl dataset. It achieves the following results on the evaluation set:

  • Loss: 0.1692
  • Memory/max Active (gib): 34.85
  • Memory/max Allocated (gib): 34.85
  • Memory/device Reserved (gib): 71.78

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-07
  • train_batch_size: 2
  • eval_batch_size: 1
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • total_train_batch_size: 16
  • total_eval_batch_size: 8
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 39
  • training_steps: 1323

Training results

Training Loss Epoch Step Validation Loss Active (gib) Allocated (gib) Reserved (gib)
No log 0 0 0.9626 34.82 34.82 58.44
0.172 1.0 441 0.1805 34.85 34.85 62.36
0.1608 2.0 882 0.1705 34.85 34.85 71.78
0.1587 3.0 1323 0.1692 34.85 34.85 71.78

Framework versions

  • Transformers 4.55.4
  • Pytorch 2.7.1+cu126
  • Datasets 4.0.0
  • Tokenizers 0.21.4
Downloads last month
16
Safetensors
Model size
12B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support