metadata

license: cc-by-nc-4.0
base_model: lightblue/suzume-llama-3-8B-multilingual
tags:
  - generated_from_trainer
model-index:
  - name: >-
      workspace/llm_training/axolotl/llama3-multilingual-orpo/output_mitsu_half_borda
    results: []

See axolotl config

axolotl version: 0.4.0

base_model: lightblue/suzume-llama-3-8B-multilingual
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer  # PreTrainedTokenizerFast

load_in_8bit: false
load_in_4bit: false
strict: false

rl: orpo
orpo_alpha: 0.1
remove_unused_columns: false

chat_template: chatml
datasets:
  - path: lightblue/mitsu_tophalf_borda
    type: orpo.chat_template
    conversation: llama-3
dataset_prepared_path: /workspace/llm_training/axolotl/llama3-multilingual-orpo/prepared_mitsu_half_borda
val_set_size: 0.02
output_dir: /workspace/llm_training/axolotl/llama3-multilingual-orpo/output_mitsu_half_borda

sequence_len: 8192
sample_packing: false
pad_to_sequence_len: true

use_wandb: true
wandb_project: axolotl
wandb_entity: peterd
wandb_name: mitsu_half_borda

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 8e-6

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch: 20
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed: /workspace/axolotl/deepspeed_configs/zero3_bf16.json
weight_decay: 0.0
special_tokens:
  pad_token: <|end_of_text|>

workspace/llm_training/axolotl/llama3-multilingual-orpo/output_mitsu_half_borda

This model is a fine-tuned version of lightblue/suzume-llama-3-8B-multilingual on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.0935

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 8e-06
train_batch_size: 1
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 8
total_train_batch_size: 32
total_eval_batch_size: 4
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 10
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss
7.6299	0.02	1	7.7014
7.041	0.07	3	3.9786
0.6089	0.15	6	0.1393
0.1308	0.22	9	0.1244
0.1051	0.29	12	0.1112
0.1021	0.36	15	0.1063
0.0861	0.44	18	0.1026
0.1031	0.51	21	0.0979
0.0996	0.58	24	0.0967
0.0923	0.65	27	0.0960
0.1025	0.73	30	0.0944
0.1103	0.8	33	0.0939
0.0919	0.87	36	0.0937
0.104	0.94	39	0.0935

Framework versions

Transformers 4.38.2
Pytorch 2.2.1+cu121
Datasets 2.18.0
Tokenizers 0.15.0