Training script for cosmo-1b?

#6
by vdmbrsv - opened

Is there a training code for cosmo-1d?

Hugging Face TB Research org
β€’
edited Mar 25, 2024

Hi @loubnabnl , thanks for pointing the yaml file. I have two questions regarding the data preprocessing part.

  1. Cosmopedia data was in prompt-text format. For pretraining, do you simply concatenate prompt and text together to form a document?
  2. I noticed the datasets in the yaml file have different folder names, tokenized_text_document, tokenized_completion_document, tokenized_train_prompt_document, tokenized_script_document. Does this mean different data preparation methods were used for different subsets?

Thanks a lot!

Hugging Face TB Research org
  • we only train on text column, the prompts are not used
  • no we didn't do any post-processing, this is only because the target columns had different names at the time, but they were all renamed to text incosmopedia
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment