required setup for training

by kmfoda - opened Jun 17, 2022

Jun 17, 2022

•

edited Jun 17, 2022

Thanks so much for they @Stancld ! This is amazing. Are you able to share your training setup?

I'm trying to fine tune this on another dataset but I'm getting OOM errors on just 1 GPU when max_source_length is set to 16384. I also run in to the same problem when I use 4 GPUs and deepspeed model parallelisation.

Stancld

Owner Jun 17, 2022

Heu @kmfoda , I trained the model using 2 A100 40GB GPUs, and was required to use gradient checkpointing otherwise were getting OOM as well. Let me know if it helps. :]

Jorgeutd

Jun 17, 2022

Hey Daniel thank you for providing this model details. Is there any documentation out there on how to to train this model using the trainer API? I did it but the results do not look good. Also, on the tokenization step is it required to add the summarize prefix like: if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
prefix = "summarize: "?

Thank you.

Stancld

Owner Jun 20, 2022

Hey @Jorgeutd , there's no prefix for LongT5 model as it uses a different pre-training technique from T5's one.

I used the command below to run training using HF Seq2Seq Trainer :]

run_summarization.py --model_name_or_path google/long-t5-tglobal-large  --do_train --do_eval --do_predict --dataset_name ccdv/pubmed-summarization --max_source_length 16384 --max_target_length 512 --per_device_train_batch_size 1 --gradient_accumulation_steps 64 --optim adafactor --learning_rate 0.001 --lr_scheduler_type constant --num_train_epochs 20 --gradient_checkpointing --bf16=True --per_device_eval_batch_size 2 --predict_with_generate --generation_num_beams 1 --generation_max_length 512 --output_dir /tmp/longt5_pubmed --run_name LongT5-pubmed-16k-512-bs_128 --report_to all --logging_steps 10 --eval_steps 500 --evaluation_strategy steps --ddp_find_unused_parameters=False --no_cuda=False

Jorgeutd

Jun 20, 2022

Thank you Daniel.

kmfoda

Jun 23, 2022

•

edited Jun 23, 2022

Thanks @Stancld , very helpful. Gradient checkpointing does work but I find it increases training time 4x. Did you find this as well? Alternatively, did you ever consider model partitioning using DeepSpeed for LongT5? I'm facing issues doing so and wondering wether it's because of a limitation in DeepSpeed working only with full attention models.

Stancld

Owner Jun 25, 2022

Yes, gradient checkpointing, unfortunately, slows down a training, however, I'm really surprised what the difference it is!
I haven't tried training with DeepSpeed, but it definitely deserves a try. Nonetheless, I'm not any experienced with that :/

kmfoda

Jun 29, 2022

yeah it is surprising given that the original gradient_checkpointing paper says that speed should only slow down by 20%. I'll debug this + try and get deepspeed working and let you know if I make any progress. Thanks for the help!

kmfoda changed discussion status to closed Jun 29, 2022

whaleloops

Jul 12, 2022

•

edited Jul 12, 2022

@Stancld , When ddp, I encountered the following error when I try to run your code in the middle of an epoch. I used run_summarization.py. Is there anything particular about step 13? Any suggestion for solving this error?

1%|█ | 13/1872 [16:51<43:45:35, 84.74s/it]Traceback (most recent call last):
File "run_summarization.py", line 737, in
main()
File "run_summarization.py", line 656, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/miniconda3/envs/t5long/lib/python3.8/site-packages/transformers/trainer.py", line 1409, in train
return inner_training_loop(
File "/home/miniconda3/envs/t5long/lib/python3.8/site-packages/transformers/trainer.py", line 1649, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/miniconda3/envs/t5long/lib/python3.8/site-packages/transformers/trainer.py", line 2345, in training_step
loss = self.compute_loss(model, inputs)
File "/home/miniconda3/envs/t5long/lib/python3.8/site-packages/transformers/trainer.py", line 2377, in compute_loss
outputs = model(**inputs)
File "/home//miniconda3/envs/t5long/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/miniconda3/envs/t5long/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 947, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 6

Hardware:
6 Quadro RTX 6000 and 2 A100-40GB gpus, but I only used 2 A100-40GB gpus for this task.

Env:
transformers==4.20.1
torch==1.11.0+cu113

To Reproduce:
CUDA_VISIBLE_DEVICES=4,5 python -m torch.distributed.launch --nproc_per_node 2 --master_port 56666 run_summarization.py
--model_name_or_path Stancld/longt5-tglobal-large-16384-pubmed-3k_steps
--do_train --do_eval --do_predict
--dataset_name ccdv/pubmed-summarization
--max_source_length 16384 --max_target_length 512
--per_device_train_batch_size 1 --gradient_accumulation_steps 64
--optim adafactor --learning_rate 0.001 --lr_scheduler_type constant --num_train_epochs 1 --gradient_checkpointing
--bf16=True --per_device_eval_batch_size 2 --predict_with_generate --generation_num_beams 1 --generation_max_length 512
--output_dir ./tmp/longt5_pubmed --run_name LongT5-pubmed-16k-512-bs_128 --report_to all
--logging_steps 100 --eval_steps 2000 --evaluation_strategy steps --ddp_find_unused_parameters=False --no_cuda=False

Here is the failed wandb.

I tried to run with 1 GPU, and it works for 50+ steps without the error above.

Jianwen

Oct 31, 2022

@whaleloops : you can try turn on "--find_unused_parameters", this works for me (but looks slowed down a bit as well).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment