Training config link is broken

hello new here kind of wanted to learn more and figure out some things some of this is way over my head I feel like I need a book for dummy's, but in your bust words how do i get the lib for this and use it I get lost alot 🤔

jon-tow

Feb 1, 2024

@davidgortega re:

Im planing to continue pre-training with a dataset of 100M tokens for two epochs. Do you think it would be enough to learn it?

If the domain of your data is relatively close to the pre-training dataset (see datasets metadata), it should be enough. Otherwise, it is hard to tell 😅 I'd also suggest fine-tuning the released checkpoint as opposed to continued pre-training from the pre-cooldown version since it's only for 200M tokens.

davidgortega

Feb 1, 2024

@jon-tow thanks for the reply.

Its wiki data in a specific domain (like fandom). I hope it works.
The problem of fine-tuning after the cool down is that training raw data with an empty prompt alone does not work as I expect. I have to combine empty prompts with synthetic instruct data generated from my corpus to learn a little and the outcome hallucinates a bit too much. Maybe you have a recipe?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment