joelniklaus HF Staff commited on
Commit
230393c
·
1 Parent(s): 49ef907

added conclusions paragraph

Browse files
app/src/content/chapters/conclusions.mdx CHANGED
@@ -1,5 +1,6 @@
1
  ## Conclusions
2
 
 
3
 
4
  ### Next Steps
5
 
 
1
  ## Conclusions
2
 
3
+ We ran 65 experiments, generated over 750 billion tokens, and spent more than 74,000 GPU hours to figure out what actually matters for synthetic pretraining data. The answer is surprisingly simple: **prompt design is the single biggest lever**. Structured formats like Table, Math, FAQ, and Tutorial consistently beat both curated web baselines and prior synthetic methods, producing our best configuration, FinePhrase. You don't need a large rephrasing model to get there. A 1B model is sufficient for most prompts, and even low-quality source data works fine when paired with a strong mix-in dataset. In fact, template diversity matters more than template polish: a messier model that produces varied outputs can outperform a polished one that repeats the same structure. SmolLM2-1.7B emerged as the best rephrasing model across all prompts, beating larger models from other families. And we found no reliable proxy metric that can replace training and evaluating a model, meaning there is no shortcut around the full pipeline. We open-source all infrastructure, prompts, and benchmarking code through DataTrove so others can build on these findings without reinventing the plumbing.
4
 
5
  ### Next Steps
6