Does gradient checkpointing work with this model?

#4
by oshizo - opened

Thank you for publishing such a wonderful model.
I am experiencing an issue where setting gradient_checkpointing=True in the TrainingArguments does not seem to reduce the VRAM usage during training.

Though my understanding may not be thorough, when I compare the source code of modeling_gpt_neox.py with modeling_gpt_neox_japanese.py, it appears that the latter does not include the conditional statement concerning self.gradient_checkpointing as seen here:
https://github.com/huggingface/transformers/blob/118e9810687dd713b6be07af79e80eeb1d916908/src/transformers/models/gpt_neox/modeling_gpt_neox.py#L546

Is this an intentional modification or perhaps an oversight? I would appreciate any insights you might have regarding this.

transformers v4.29.1

ABEJA, Inc. org

Thanks for looking at all the details and asking the question.
The difference regarding gradient_checkpointing is not intentional. At the time we submitted our pull request, GPT NeoX had the same configuration, but the gradient checkpointing has been corrected in the following commit, and the difference is now in place.
https://github.com/huggingface/transformers/commit/225c36fbe5ae2bdb1880da52e093c7e53596a7d1

Thank you for your response! I now understand the situation.
It might be helpful if you could incorporate the support for gradient_checkpointing or provide a warning when this flag is set to True.

ABEJA, Inc. org

We cannot promise a completion date, but we have started preparing for PR. Thank you for reminding up of the update opportunity!

Sign up or log in to comment