Proper dataset prep for Causal LM training task?

#34
by dlo3 - opened

I have a question about how to properly train a GPT-2-like transformer for a causal LM task. This question applies to both fine-tuning and training a model from scractch.

Following along with the tutorial supplied at https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py#L480 , the authors seem to manually group/resize examples in their dataset to match the model context length. Then, when they setup their Trainer instance, they use the default_data_collator supplied by the library.

transformers also supplies a DataCollatorForLanguageModeling. In my scripts, I'm using this class. However, at train time, I get some ValueError related to the GPT-2 tokenizer not having a padding token. I can "solve" this problem by adding the special pad token to the tokenizer myself and calling model.resize_token_embeddings.

My question is basically: is this approach with using a DataCollatorForLanguageModeling instance + adding the special padding token correct / good practice? Or should I be using the default_data_collator and restructuring my dataset prior to training? Aren't they supposed to be functionally equivalent approaches?

Thanks in advance.

dlo3 changed discussion title from Proper training method for Causal LM task? to Proper dataset prep for Causal LM training task?
dlo3 changed discussion status to closed

Sign up or log in to comment