training config differences (same dataset)
Collection
3 items
•
Updated
current batches:
nv3[v0] (1700) | nv4[v1-2k] (4000) | nv4[v1-210k] (b1b2: 4000)
same as https://huggingface.co/distill-lab/distill-n4_00-01_combined_cls_v1b2 but instead of 20 tried 100e.
metrics:
8168 ***** train metrics *****
8169 epoch = 100.0
8170 total_flos = 334833095087GF
8171 train_loss = 0.0776
8172 train_runtime = 4:53:00.40
8173 train_samples_per_second = 56.955
8174 train_steps_per_second = 0.893
8176 ***** eval metrics *****
8177 epoch = 100.0
8178 eval_accuracy = 0.7487
8179 eval_loss = 1.9947
8180 eval_runtime = 0:00:12.56
8181 eval_samples_per_second = 140.622
8182 eval_steps_per_second = 2.945
(no significant accuracy jump; just to see what happens)
BASE_MODEL = "facebook/dinov2-with-registers-large"
DATASET = "distill-lab/COMBINE_nai-distill_00-01_eagle.library"
TASK = "classification"
# using single card to train it, so had to do higher batch size
cmd = f"""python -m trainlib.hf_trainer.cli \
--model_name_or_path {BASE_MODEL} \
--dataset_name {DATASET} \
--output_dir distill-n4_00-01_combined_cls_v1b2-100e \
--remove_unused_columns False \
--label_column_name star \
--task {TASK} \
--do_train \
--do_eval \
--eval_strategy steps \
--eval_steps 100 \
--learning_rate 1e-5 \
--num_train_epochs 100 \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 48 \
--logging_strategy steps \
--logging_steps 2 \
--save_total_limit 1 \
--seed 1337 \
--lr_scheduler_type cosine \
--dataloader_num_workers 16 \
--ignore_mismatched_sizes True
"""
rest = f"""
--push_to_hub: True \
--push_to_hub_organization distill-lab \
--hub_model_id nai-distill_00-01_combined_eagle_{TASK} \
--hub_strategy "end"""
print(cmd)
!{cmd}