Text Generation
Transformers
Safetensors
English
mistral
text-generation-inference
Inference Endpoints

Thanks for the models and data, I have some questions.

#2
by NickyNicky - opened

hi,

I would like to know what template they have trained you in (chatml?)

Training time, was it training with Lora or complete?

thank you so much

Hi,

Thanks for your interest!

Template
Since we are training a base model instead of a chat model, we use a wide range of templates to diversify the format. We utilize templates from the Flan collection to diversify instruction-response pairs and templates from AdaptLLM to concatenate the raw text (context) with downstream instruction-response pairs. We are currently working on open-sourcing the code for templifying the data, so please stay tuned!

Training Settings
For pre-training, we train on all parameters. The training time for Instruction Pre-training is the same as Vanilla Pre-training. We present the training details in Table 10 in Appendix, where we train the 500M model for 5 days and the 1.3B model for 10 days.

image.png

Hi, For pre-training, are you split and concatenate all samples to max sequence length?

Hi, we follow the same settings as GPT-3: packing multiple examples into one sequence until the max sequence length is reached. Each example is a one/few-shot instruction-augmented text.

Hi, we follow the same settings as GPT-3: packing multiple examples into one sequence until the max sequence length is reached. Each example is a one/few-shot instruction-augmented text.

Thank you for your reply! Another question is: as for Mix PT expriment, this means mixing raw docs with the SFT data used to fine-tuning instruction synthesizer? How about mix raw docs with the instruction datas generated by instruction synthesizer?

Hi, "mixing raw docs with the instruction data generated by the instruction synthesizer" is exactly what we did in the general pre-training from scratch, as described in section 2.2:

"Considering the large amount of data required for general pre-training from scratch, we only convert part of the raw corpora into instruction-augmented corpora, leaving the rest unchanged. Additionally, we mix the corpora with the data used for fine-tuning the instruction synthesizer to enhance task diversity."

For instance, if there are N raw texts, we may convert 1/5 of them into instruction-augmented texts, leaving the remaining 4/5 unchanged. We also mix these with the SFT data used for fine-tuning the instruction synthesizer .

Sign up or log in to comment