Papers
arxiv:2406.14491

Instruction Pre-Training: Language Models are Supervised Multitask Learners

Published on Jun 20
Β· Submitted by daixuancheng on Jun 21
#2 Paper of the day
Authors:
,
,

Abstract

Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. In pre-training from scratch, Instruction Pre-Training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at https://github.com/microsoft/LMOps.

Community

Paper author Paper submitter
β€’
edited Jul 15

πŸ€— We share our data and models with example usages, feel free to open any issues or discussions! πŸ€—

Hi, What is the difference between instruction fine-tuning and instruction pre-training (in terms of training) discussed in the paper (except the fact that in IFT, we normally use parameter efficient techniques like LoRA only to update a portion of parameters)?

Β·

Hi, thanks for your interest. Except for the pre-training data, Instruction Pre-Training keeps all other pre-training settings the same as Vanilla Pre-Training. In our experiment with instruction tuning, we tune all the parameters, but I think the PEFT method would also be applicable!

This comment has been hidden

Hi - is the dataset you released the 200M one?

Β·

Hi,

This is the dataset we use to train the instruction synthesizer. We've been thinking about how to upload the pre-training data (including the 200M instruction-response pairs), but the dataset is too largeπŸ€”.

Hi, thanks for your work! I get few interesting insights from this.

  1. Based on your results, can I say that we can replace "pretrain on the raw corpora -> instruction tuning" with "instruction tuning" directly? I do not see large differences between typical instruction tuning with your proposed instruction pretraining, except that in your approach, you directly train using the instruction-response pairs.

  2. Do you perform any verifications to the generated instruction-response pairs?

Β·

Hi,

Thanks for your question!

Q1: Can I say that we can replace "pretrain on the raw corpora -> instruction tuning" with "instruction tuning" directly?

This is a promising approach worth trying. However, it may come with two limitations:

  1. Lack of Knowledge Source: Instruction Pre-training does not train on the instruction-response pairs alone. Instead, it trains on the concatenation of raw text and synthesized pairs, formatting the context-based task completion (e.g., reading comprehension), hoping to learn the knowledge embedded in the raw text. Vanilla instruction tuning (those without the raw text) tends to teach the pre-trained base model to follow instructions rather than learning new knowledge.

For example, in the following image, vanilla pre-training trains on the raw texts, instruction tuning trains on instruction-response pairs, whereas instruction pre-training trains on the instruction-augmented texts.
image.png

  1. Data Limitation: As far as I know, the existing datasets for instruction tuning are significantly smaller than those available for pre-training.

Q2: Do you perform any verifications of the generated instruction-response pairs?

Yes, in section 5 of our paper, we have checked the synthesized pairs in terms of context relevance, response accuracy, and task diversity.

image.png

In Table 3, the result of instruct-pt on MED is 61.3 and fin is 74.7, but in Table 4, the results are reversed. Is this a mistake?

Β·

Thanks for your careful review. The domain names in Table 4 should be reversed.

Have you given any thought to training a version on a higher ratio of code?

I am interested in finding a way to generate more up-to-date datasets on code libraries, examples etc... as many of today's LLMs are using quite dated knowledge thus often result in generating code with deprecated libraries and patterns. Just wondering if a more code-tuned version of this might do the trick.

Β·

Hi,

Thanks for your suggestion. You could try using our instruction synthesizer to generate instruction-response pairs based on new code materials, such as updated code textbooks. We've observed that the instruction synthesizer can generate relevant code tasks when the input text is related to the coding domain. This approach might help in creating more up-to-date datasets for code libraries and examples.

Cann you share the configurations, computation and time for instruction-response pairs generation with each raw dataset as well as for the pre-instruction training? It would help for us to work with new domains of dataset. Thank you!

Β·

Thanks for the question. We've added pre-training suggestions in the Advanced Usage section of our instruction synthesizer.

Using the vLLM inference code, on a single A100-80GB GPU, it takes about 1 day to synthesize instruction-response pairs for 1 billion tokens of raw corpora.

For domain-adaptive continual pre-training, we recommend setting M = 3 and mixing the instruction-augmented corpora with general instructions from OpenOrca at a 1:1 ratio (counted by tokens).

Other training details are presented in Table 10 in the Appendix of our paper:

image.png

Hi! From what I understand, you train the model by computing the loss on all tokens. Have you tried training the model by computing the loss only on the answers, as is commonly done during the Alignment SFT phase? Have you conducted any ablation studies on this? Or do you have any insights on why it might be better to compute the loss on all tokens?

Β·

Hi, thanks for the question. We have not tried computing the loss only on the answers yet for two reasons:

  • Keeping It Simple: We want our method to be very easy for everyone (including us) to use. By computing the loss on all tokens, we can just convert the data and train it like Vanilla Pre-training with familiar training setups.

  • Fair Comparison: If we only compute the loss on the answers, the number of tokens the model learns from is much smaller. So, if we run Vanilla Pre-training and Instruction Pre-training for the same number of steps with the same batch size, Instruction Pre-training ends up with fewer trained tokens, making it an unfair comparison.

Anyway, it’s a promising idea and worth trying since it has worked well in instruction tuning.

This comment has been hidden

Hello, I reviewed the paper you suggested. Since I haven’t read through all the content of the paper, I am a bit confused.

In the general-instruction-augmented-corpora, it seems that bos and eos tokens are not used.

However, in the ft-instruction-synthesizer-collection, it seems that bos and eos tokens are used to structure the data.

Is there a difference between the two?

Β·

Thanks for your question.

Q1: In the general-instruction-augmented-corpora, it seems that BOS and EOS tokens are not used.

BOS and EOS tokens are indeed used during pre-training. Although these tokens are not shown in the released dataset, our pre-training code automatically adds BOS and EOS token IDs during tokenization.

Following GPT-3, we pack multiple instruction-augmented texts into one sequence until the maximum sequence length is reached. Suppose the tokenized IDs for a single templatified instruction-augmented text (which represents an M-shot example) are TEXT_IDs_N:

An input sequence for pre-training would look like this:
bos_token_id, TEXT_IDs_1, eos_token_id, TEXT_IDs_2, eos_token_id, ... , eos_token_id, TEXT_IDs_N, eos_token_id

Q2: In the ft-instruction-synthesizer-collection, it seems that BOS and EOS tokens are used to structure the data.

When fine-tuning the instruction synthesizer, BOS and EOS tokens are also used to separate examples. Since we combined multiple examples to create a few-shot setting, we explicitly denote the presence of BOS and EOS in the script to ensure users remember to add them. Unlike pre-training, we don’t need to add BOS and EOS token IDs during tokenization in fine-tuning.

Overall, BOS and EOS tokens are utilized in both fine-tuning the instruction synthesizer and pre-training the LMs, but in different ways.

Sign up or log in to comment

Models citing this paper 29

Browse 29 models citing this paper

Datasets citing this paper 8

Browse 8 datasets citing this paper

Spaces citing this paper 30

Collections including this paper 26