instruction-pretrain
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -193,6 +193,36 @@ Let's try our method in continual pre-training for a quick start---it works easi
|
|
193 |
|
194 |
Feel free to ask for any suggestions at [this page](https://huggingface.co/papers/2406.14491); we will reply ASAP🤗!
|
195 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
196 |
## Citation
|
197 |
If you find our work helpful, please cite us:
|
198 |
|
|
|
193 |
|
194 |
Feel free to ask for any suggestions at [this page](https://huggingface.co/papers/2406.14491); we will reply ASAP🤗!
|
195 |
|
196 |
+
Thanks for your interest.
|
197 |
+
|
198 |
+
## FAQ on Continual Pre-Training from LLama3
|
199 |
+
**Q1: Do you use the official Llama3 instruction prompt for pre-training?**
|
200 |
+
|
201 |
+
No, the provided Llama3 instruction prompt is designed for the [instruction-tuned model](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), but our continual pre-training is conducted on the [pre-trained base model](https://huggingface.co/meta-llama/Meta-Llama-3-8B) where only BOS (`<|begin_of_text|>`) and EOS (`<|end_of_text|>`) tokens are required.
|
202 |
+
|
203 |
+
**Q2: For the general instructions from OpenOrca, do you concatenate each instruction with its output using '\n'?**
|
204 |
+
|
205 |
+
No, as mentioned in the pre-training suggestions, we use a simple whitespace to concatenate each question with its response for the general instruction data from OpenOrca. This is because OpenOrca's data is already templated with diverse natural languge templates (such as those with `\n`), so a whitespace is sufficient to formulate the data.
|
206 |
+
|
207 |
+
Note that when using our templated instruction-augmented texts, you don't need to add any concatenations.
|
208 |
+
|
209 |
+
**Q3: What about those system prompts in OpenOrca?**
|
210 |
+
|
211 |
+
We simply discard the system prompts.
|
212 |
+
|
213 |
+
**To put it all together, the text before tokenization looks like this:**
|
214 |
+
|
215 |
+
`general_instruction_response_text = "<|begin_of_text|>{question} {response}<|end_of_text|>"`
|
216 |
+
|
217 |
+
or
|
218 |
+
|
219 |
+
`instruction_augmented_text = "<|begin_of_text|>{instruction augmented text}<|end_of_text|>"`
|
220 |
+
|
221 |
+
Then, for tokenization, you don't need to add BOS and EOS token ids. The tokenization code looks like this:
|
222 |
+
|
223 |
+
`text_ids = tokenizer(text, add_special_tokens=False, **kwargs).input_ids`
|
224 |
+
|
225 |
+
|
226 |
## Citation
|
227 |
If you find our work helpful, please cite us:
|
228 |
|