Thanks for the models and data, I have some questions.

by NickyNicky - opened Jun 21, 2024

Discussion

NickyNicky

Jun 21, 2024

hi,

I would like to know what template they have trained you in (chatml?)

Training time, was it training with Lora or complete?

thank you so much

instruction-pretrain

Owner Jun 22, 2024

Hi,

Thanks for your interest!

Template
Since we are training a base model instead of a chat model, we use a wide range of templates to diversify the format. We utilize templates from the Flan collection to diversify instruction-response pairs and templates from AdaptLLM to concatenate the raw text (context) with downstream instruction-response pairs. We are currently working on open-sourcing the code for templifying the data, so please stay tuned!

Training Settings
For pre-training, we train on all parameters. The training time for Instruction Pre-training is the same as Vanilla Pre-training. We present the training details in Table 10 in Appendix, where we train the 500M model for 5 days and the 1.3B model for 10 days.

dudugo

Jul 12, 2024

Hi, For pre-training, are you split and concatenate all samples to max sequence length?

instruction-pretrain

Owner Jul 12, 2024

Hi, we follow the same settings as GPT-3: packing multiple examples into one sequence until the max sequence length is reached. Each example is a one/few-shot instruction-augmented text.

dudugo

Jul 18, 2024

Hi, we follow the same settings as GPT-3: packing multiple examples into one sequence until the max sequence length is reached. Each example is a one/few-shot instruction-augmented text.

Thank you for your reply! Another question is: as for Mix PT expriment, this means mixing raw docs with the SFT data used to fine-tuning instruction synthesizer? How about mix raw docs with the instruction datas generated by instruction synthesizer?

instruction-pretrain

Owner Jul 18, 2024

Hi, "mixing raw docs with the instruction data generated by the instruction synthesizer" is exactly what we did in the general pre-training from scratch, as described in section 2.2:

"Considering the large amount of data required for general pre-training from scratch, we only convert part of the raw corpora into instruction-augmented corpora, leaving the rest unchanged. Additionally, we mix the corpora with the data used for fine-tuning the instruction synthesizer to enhance task diversity."

For instance, if there are N raw texts, we may convert 1/5 of them into instruction-augmented texts, leaving the remaining 4/5 unchanged. We also mix these with the SFT data used for fine-tuning the instruction synthesizer .

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment