microsoft/phi-2 · Fine-Tuning Phi-2 using QLoRA and Flash Attention 2 does not converge after recent updates

Jan 16

Hi folks,

when fine-tuning Phi-2 with SFTTrainer using QLoRA and Flash Attention 2, the model does not converge and starts with quite a high initial loss at around 4.3. The loss fluctuates, but stays between 4.2 and 4.3 after 42 training steps.

I'm running this code in Google Colab on an A100 and installed the following libraries:

!pip uninstall -y transformers
!pip install git+https://github.com/huggingface/transformers
!pip install trl[peft]
!pip install bitsandbytes loralib
!pip install wandb
!pip install datasets
!pip install accelerate
#!pip install deepspeed
#!pip install -U optimum
!pip install -U flash-attn

These are my Training Arguments:

def get_args():
parser = argparse.ArgumentParser()
parser.add_argument("--model_path", type=str, default="/path/to/Phi-2")
parser.add_argument("--dataset_name", type=str, default="/path/to/training_data")
#parser.add_argument("--subset", type=str, default="data/finetune")
parser.add_argument("--split", type=str, default="train")
#parser.add_argument("--size_valid_set", type=int, default=4000)
parser.add_argument("--streaming", action="store_true")
#parser.add_argument("--shuffle_buffer", type=int, default=5000)
parser.add_argument("--seq_length", type=int, default=1024)
parser.add_argument("--max_steps", type=int, default=1000)
parser.add_argument("--batch_size", type=int, default=32)
parser.add_argument("--gradient_accumulation_steps", type=int, default=8)
parser.add_argument("--eos_token_id", type=int, default=2)
parser.add_argument("--learning_rate", type=float, default=1e-4)
parser.add_argument("--lr_scheduler_type", type=str, default="linear") # For Llama-2 model use cosine
parser.add_argument("--num_warmup_steps", type=int, default=100)
parser.add_argument("--weight_decay", type=float, default=0.1)
parser.add_argument("--local_rank", type=int, default=0)
parser.add_argument("--neftune_noise_alpha", type=int, default=5)
parser.add_argument("--fp16", action="store_true", default=False)
parser.add_argument("--bf16", action="store_true", default=True)
parser.add_argument("--gradient_checkpointing", action="store_true", default=True)
#parser.add_argument("--use_reentrant", default=False)
parser.add_argument("--seed", type=int, default=0)
parser.add_argument("--num_workers", type=int, default=4)
parser.add_argument("--output_dir", type=str, default="/path/to/checkpoints")
parser.add_argument("--log_freq", default=1, type=int)
parser.add_argument("--eval_freq", default=300, type=int)
parser.add_argument("--save_freq", default=300, type=int)

args, unknown = parser.parse_known_args()

return args

This is my Training Function, where I'm also initializing QLoRA and Phi-2:

def run_training(args, train_data, val_data):
print("Loading the model")

lora_config = LoraConfig(
    r=64,
    lora_alpha=128,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "dense"
        ]
)

train_data.start_iteration = 0

print("Starting main loop")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=False,
)

training_args = TrainingArguments(
    output_dir=args.output_dir,
    dataloader_drop_last=True,
    evaluation_strategy="steps",
    #torch_compile=True,
    max_steps=args.max_steps,
    eval_steps=args.eval_freq,
    save_steps=args.save_freq,
    logging_steps=args.log_freq,
    per_device_train_batch_size=args.batch_size,
    per_device_eval_batch_size=args.batch_size,
    learning_rate=args.learning_rate,
    lr_scheduler_type=args.lr_scheduler_type,
    warmup_steps=args.num_warmup_steps,
    gradient_accumulation_steps=args.gradient_accumulation_steps,
    gradient_checkpointing=args.gradient_checkpointing,
    fp16=args.fp16,
    bf16=args.bf16,
    optim="paged_adamw_32bit",
    weight_decay=args.weight_decay,
    neftune_noise_alpha=args.neftune_noise_alpha,
    #deepspeed="ds_config_zero3.json",
    run_name="Phi-2-ParlaMint-reduced-QLoRA",
    report_to="wandb",
    ddp_find_unused_parameters=False,
)

model = AutoModelForCausalLM.from_pretrained(
    args.model_path, quantization_config=bnb_config, trust_remote_code=True, use_flash_attention_2=True, device_map={"":

Accelerator().process_index}
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
    peft_config=lora_config,
    packing=True
)

print_trainable_parameters(trainer.model)

print("Training...")
trainer.train()

print("Saving last checkpoint of the model")
trainer.model.save_pretrained(os.path.join(args.output_dir, "final_checkpoint/"))

Could you guys help me along? Am I doing something wrong? Am I missing something?

The Dataset that I'm using is a reduced version of the ParlaMint Corpus, as this is all part of my Master’s Thesis. I haven't uploaded the dataset to the Hub yet. I did run it on a Llama-2 fine-tuning run, where it did converge quite nicely.

gugarosa

Microsoft org Jan 16

Hello @h4rz3rk4s3 !

Could you please try again the very same script you are using, but with the latest revision?

We might have found the problem source. Phi has never used softmax_scale when it was trained with Flash-Attention. Enabling it to 1.0 seems to corrupt the outputs when using Flash-Attention.

Regards,
Gustavo.

h4rz3rk4s3

Jan 16

@gugarosa Yes, one moment. I will let the script run again.

h4rz3rk4s3

Jan 16

@gugarosa It started with a loss of 3.25 and after 10 steps it is still fluctuating. I will keep it running for a bit and update you.

Thanks for the help so far!

h4rz3rk4s3

Jan 16

Wait a second, I think I misunderstood you. I just saw that you updated modeling_phi.py. I first thought you were referring to an update in transformers. I'll download it again, run the script and update you.

gugarosa

Microsoft org Jan 16

•

edited Jan 16

No worries! The idea is to merge into transformers, but we can do it here for a more quick debug.

h4rz3rk4s3

Jan 16

@gugarosa No improvement. The loss still starts and idles around 3.25. I will try to run it with refs/pr/23, as suggested in a different discussion, later and see if that works for me.

Thanks again for the help!

gugarosa

Microsoft org Jan 16

No problems! Thanks for the update!! We will continue investigating as well.

mbakler581c

Jan 16

I have a similar issue, where finetuned models (on some math problems) after the update have deteriorated. Before the update using fp16 mixed precision training (with HF accelerate) finetuned models got an avg of 63% accuracy, after the update fp16 training doesn't work anymore (loss give nans) and bf16 training results in a model with average acc of 55%. Is this because the original model is trained with fp16 and bf16 is expected to perform worse (and currently there's some issue with fp16 mixed precision training)?

gugarosa

Microsoft org Jan 17

Could you please re-run with the latest update (FP16)? We updated the modeling_phi.py file and disabled the auto-casting on the Attention layer. This is the same fix as the previous code had.

For the BF16, I think it is acceptable since the pre-trained model was trained with FP16.

mbakler581c

Jan 18

•

edited Jan 18

It indeed now is training with fp16 as pre-update, thanks for the quick fix!

gugarosa

Microsoft org Jan 19

No problems! Please let me know if you see anything else.

gugarosa changed discussion status to closed Jan 19