lxuechen/phi-2-sft · Can't reproduce the model

Hello, thank you for making this model available!

I tried to reproduce the model using the settings from the model card. While the model works fine (follows instructions and properly stops at eos_token, etc.) I don't get the same results on the arc_challenge@25:

-phi-2   : 0.6109
-phi-sft : 0.6280
-attempt : 0.616

In terms of training settings, the only difference is that I use gradient accumulation to mimic the batch size of 64 on a single GPU:

n_epochs             = 2
batch_size           = 1
effective_batch_size = 64 
grad_acc             = max(1, int(effective_batch_size/batch_size))
training_arguments = TrainingArguments(
    output_dir=".",
    num_train_epochs=n_epochs,
    per_device_train_batch_size=batch_size, 
    gradient_accumulation_steps=grad_acc,
    logging_steps=1,
    learning_rate=2e-5, 
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    weight_decay=0.001,
    max_steps=-1,
    save_steps=10000000,
    bf16=True
)

I am using the model from refs/pr/23 revision, bfloat16, training only attention and mlp layers (weights + bias). Training the whole model or with LoRA gives worse.

model = transformers.AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, torch_dtype=torch.bfloat16, 
                                                           flash_attn=True, flash_rotary=True, fused_dense=True, 
                                                           device_map='cuda', 
                                                           revision="refs/pr/23")

Any tips are highly appreciated, thank you in advance!