Yukang/Llama-2-70b-longlora-32k · Training VRAM for 70B 32K

Hi & thanks for a very nice paper. From the paper, it appears you managed to achieve an effective batch size of 64 on 8xA100, described as gradient accumulation = 8 x batch size = 8. Was that statement in the paper only for the smaller models, or did it apply to the 70B model @ 32K context as well?

It also looks like you used LORA (as opposed to QLORA or GPTQ-LORA), and it doesn't seem like you used bitsandbytes (load_in_8_bit), so I'm curious what batch size you achieved with 8xA100 using your method. If you really managed batch size = 8 that's very impressive and significantly cheaper than any other long-context training method around today, including quantization-based methods!

I also see no reason why your method cannot be combined with quantization for even more efficient training.

Hi,

Thanks for your question.

Yes, it is. To be more clear, batch_size_per_gpu is 1. gradient_acc_steps is 8. We use 8x A100 GPUs. Thus, the global effective batch size is 1x8x8=64. We use deespeed with stage3 and flash-attn2.

The training script is just like:
torchrun --nproc_per_node=8 --master_port=6034 fine-tune.py
--model_name_or_path /dataset/pretrained-models/Llama-2-70b-hf
--bf16 True
--output_dir /dataset/models/
--cache_dir /dataset/datasets/redpajama
--num_train_epochs 1
--model_max_length 32768
--use_flash_attn True
--low_rank_training True
--per_device_train_batch_size 1
--per_device_eval_batch_size 2
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1000
--save_total_limit 16
--learning_rate 2e-5
--weight_decay 0.0
--warmup_steps 20
--lr_scheduler_type "constant_with_warmup"
--logging_steps 1
--deepspeed configs/default_offload_opt_param.json
--tf32 True
--max_steps 1000

The GPU memory cost on our machine is

Regards,
Yukang Chen