Edit model card

POC-NEW-Meta-Llama-3-8B-MEDAL-flash-attention-2-cosine-evaldata

This model is a fine-tuned version of meta-llama/Meta-Llama-3-8B on the generator dataset. It achieves the following results on the evaluation set:

  • Loss: 2.2356

Model description

Article: https://medium.com/@frankmorales_91352/sfttrainer-a-comprehensive-exploration-of-its-concept-advantages-limitations-history-and-19ab0926e74e

Training and evaluation data

Evaluation: https://github.com/frank-morales2020/MLxDL/blob/main/Meta_Llama_3_8B_for_MEDAL_EVALUATOR_evaldata_NEW_POC.ipynb

Training procedure

Fine Tuning: https://github.com/frank-morales2020/MLxDL/blob/main/FineTuning_LLM_Meta_Llama_3_8B_for_MEDAL_EVALDATA_PONEW.ipynb

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0002
  • train_batch_size: 3
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 24
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: constant
  • lr_scheduler_warmup_ratio: 0.03
  • lr_scheduler_warmup_steps: 1500
  • num_epochs: 0.5

from transformers import TrainingArguments

args = TrainingArguments(

output_dir="/content/gdrive/MyDrive/model/POC-NEW-Meta-Llama-3-8B-MEDAL-flash-attention-2-cosine-evaldata",

num_train_epochs=0.5,                   # number of training epochs for POC
per_device_train_batch_size=3,    #4    # batch size per device during training
gradient_accumulation_steps=8,    #6    # values like 8, 12, or even 16,    # number of steps before performing a backward/update pass
gradient_checkpointing=True,            # use gradient checkpointing to save memory
 optim="adamw_torch_fused",             # use fused adamw optimizer
logging_steps=100,                      # log every 100 steps
learning_rate=2e-4,                     # learning rate, based on QLoRA paper # i used in the first model
#learning_rate=1e-5,
bf16=True,                              # use bfloat16 precision
tf32=True,                              # use tf32 precision
max_grad_norm=1.0,                      # max gradient norm based on QLoRA paper
warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper = 0.03

weight_decay=0.01,
lr_scheduler_type="constant",           # use constant learning rate scheduler
push_to_hub=True,                       # push model to hub
report_to="tensorboard",                # report metrics to tensorboard
gradient_checkpointing_kwargs={"use_reentrant": True},

load_best_model_at_end=True,
logging_dir="/content/gdrive/MyDrive/model/POC-NEW-Meta-Llama-3-8B-MEDAL-flash-attention-2-cosine-evaldata/logs",

evaluation_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=100,
metric_for_best_model = "loss",
warmup_steps=1500,

)

Training results

Training Loss Epoch Step Validation Loss
2.4484 0.0207 100 2.3720
2.3535 0.0415 200 2.3370
2.3303 0.0622 300 2.3204
2.3153 0.0830 400 2.3081
2.3041 0.1037 500 2.2982
2.2904 0.1245 600 2.2917
2.2954 0.1452 700 2.2845
2.2795 0.1660 800 2.2790
2.2772 0.1867 900 2.2751
2.2769 0.2075 1000 2.2711
2.2711 0.2282 1100 2.2678
2.2722 0.2489 1200 2.2644
2.269 0.2697 1300 2.2610
2.2651 0.2904 1400 2.2586
2.2625 0.3112 1500 2.2550
2.2579 0.3319 1600 2.2516
2.2532 0.3527 1700 2.2501
2.256 0.3734 1800 2.2471
2.2509 0.3942 1900 2.2450
2.2482 0.4149 2000 2.2433
2.247 0.4357 2100 2.2406
2.2404 0.4564 2200 2.2395
2.2377 0.4771 2300 2.2372
2.2373 0.4979 2400 2.2356

Framework versions

  • PEFT 0.11.1
  • Transformers 4.41.2
  • Pytorch 2.3.0+cu121
  • Datasets 2.20.0
  • Tokenizers 0.19.1
Downloads last month
120
Unable to determine this model’s pipeline type. Check the docs .

Adapter for