Proper dataset prep for Causal LM training task?
I have a question about how to properly train a GPT-2-like transformer for a causal LM task. This question applies to both fine-tuning and training a model from scractch.
Following along with the tutorial supplied at https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py#L480 , the authors seem to manually group/resize examples in their dataset
to match the model context length. Then, when they setup their Trainer
instance, they use the default_data_collator
supplied by the library.
transformers
also supplies a DataCollatorForLanguageModeling
. In my scripts, I'm using this class. However, at train time, I get some ValueError
related to the GPT-2 tokenizer not having a padding token. I can "solve" this problem by adding the special pad token to the tokenizer myself and calling model.resize_token_embeddings
.
My question is basically: is this approach with using a DataCollatorForLanguageModeling
instance + adding the special padding token correct / good practice? Or should I be using the default_data_collator
and restructuring my dataset prior to training? Aren't they supposed to be functionally equivalent approaches?
Thanks in advance.