joelniklaus HF Staff commited on
Commit
9894e4e
Β·
1 Parent(s): 8be4608

added analysis with differently sized student models

Browse files
app/src/content/assets/data/benchmark-results.csv CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:dc7f86e2cd5b311eb1fec66972254890034acd866037b5096f1551ef877fe72e
3
- size 1598658
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:531d5bf89cf3ca78b3813c791ae0096d816c9f50b16e851f4c9e9ed03b576462
3
+ size 2000388
app/src/content/chapters/2-setup.mdx CHANGED
@@ -62,7 +62,7 @@ With the datasets defined, we need a consistent way to tell whether one configur
62
 
63
  ### How We Measure Success
64
 
65
- To evaluate each configuration, we follow the ablation methodology from FineWeb [@fineweb]: train a 1.2B parameter language model with a Qwen2-style architecture [@qwen2] (details in the [Appendix](#details-on-the-experiments)) on 20B tokens and evaluate on 12 benchmarks across six categories using 3-shot prompting with a single seed:
66
  <Sidenote>
67
  Since our model is small and trained on only 20B tokens, we use the **cloze format** (CF) for most tasks rather than standard multiple-choice. CF frames evaluation as next-token prediction, which gives more reliable signal for smaller models that may struggle with instruction following or multiple-choice formatting.
68
  </Sidenote>
 
62
 
63
  ### How We Measure Success
64
 
65
+ To evaluate each configuration, we follow the ablation methodology from FineWeb [@fineweb]: train a 1.7B parameter language model with a Qwen2-style architecture [@qwen2] (details in the [Appendix](#details-on-the-experiments)) on 20B tokens and evaluate on 12 benchmarks across six categories using 3-shot prompting with a single seed:
66
  <Sidenote>
67
  Since our model is small and trained on only 20B tokens, we use the **cloze format** (CF) for most tasks rather than standard multiple-choice. CF frames evaluation as next-token prediction, which gives more reliable signal for smaller models that may struggle with instruction following or multiple-choice formatting.
68
  </Sidenote>
app/src/content/chapters/3-experiments.mdx CHANGED
@@ -186,7 +186,9 @@ The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B mo
186
  This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
187
  The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), bigger models don't buy you better synthetic data. This aligns with findings from @demystifyingsynth, who showed that scaling generators from 8B to 70B parameters did not yield superior pretraining data, and with SwallowMath-v2 [@swallowmathv2], which reports no downstream gains on math data from scaling the rewriter from Qwen3-30B-A3B to Qwen3-235B-A22B. This is great news for cost: you can use cheap, fast models for most rephrasing tasks.
188
 
189
- That raises an interesting follow-up. REWIRE claims that you specifically need large models to salvage low-quality data. Does that hold up?
 
 
190
 
191
  #### Do we need better models for rephrasing low-quality data?
192
 
 
186
  This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
187
  The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), bigger models don't buy you better synthetic data. This aligns with findings from @demystifyingsynth, who showed that scaling generators from 8B to 70B parameters did not yield superior pretraining data, and with SwallowMath-v2 [@swallowmathv2], which reports no downstream gains on math data from scaling the rewriter from Qwen3-30B-A3B to Qwen3-235B-A22B. This is great news for cost: you can use cheap, fast models for most rephrasing tasks.
188
 
189
+ That said, those conclusions all come from a single 1.7B student model. We revisit whether a larger student would change the picture in the [Analyses](#is-our-proxy-model-too-small) section.
190
+
191
+ REWIRE claims that you specifically need large models to salvage low-quality data. Does that hold up?
192
 
193
  #### Do we need better models for rephrasing low-quality data?
194
 
app/src/content/chapters/4-analyses.mdx CHANGED
@@ -1,11 +1,13 @@
1
  import HtmlEmbed from "../../components/HtmlEmbed.astro";
2
  import Note from "../../components/Note.astro";
3
  import Wide from "../../components/Wide.astro";
 
 
4
 
5
 
6
  ## Analyses
7
 
8
- The experiments tell us *what* works. Now let's zoom out and ask *why*. We look at the cost of running these experiments, whether cheap proxy metrics can replace expensive training runs, what the rephrased outputs actually look like, and why a messier model sometimes wins.
9
 
10
  ### Is More Compute Worth It?
11
 
@@ -72,7 +74,188 @@ The correlation matrix tells us that quality scores are weak predictors, but not
72
 
73
  **Edu-scores tell the opposite story.** Most experiments *decrease* the edu-score through rephrasing, particularly those starting from high-quality sources (FineWeb-Edu-HQ has high baseline edu-scores). The edu-score classifier penalizes format changes like tables, FAQs, and math notation that our best prompts produce. This is a case where the proxy metric actively misleads: the "quality degradation" measured by edu-score corresponds to format transformations that *improve* downstream performance.
74
 
75
- So quality scores designed for filtering web data don't transfer to synthetic data. Maybe looking at the outputs more directly helps. For instance, does the length of the rephrased output tell us anything?
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  ### Do Chatty Models Make Better Data?
78
 
@@ -160,6 +343,7 @@ The lesson: for pretraining data, diversity beats consistency. A model that does
160
  <Note title="Summary: Analyses" variant="info">
161
  **Cost**: Small models with simple prompts dominate the Pareto frontier. Invest in prompt design, not model size.<br/>
162
  **Quality scores**: Neither edu-score nor DCLM-score reliably predicts downstream performance for synthetic data. There is no shortcut to training and evaluating.<br/>
 
163
  **Verbosity**: Output length has no meaningful relationship with performance. What matters is content, not compression ratio.<br/>
164
  **Diversity**: Template collapse hurts more than noisy outputs. A messier model that produces varied text can outperform a polished one that repeats the same template.
165
  </Note>
 
1
  import HtmlEmbed from "../../components/HtmlEmbed.astro";
2
  import Note from "../../components/Note.astro";
3
  import Wide from "../../components/Wide.astro";
4
+ import Accordion from "../../components/Accordion.astro";
5
+ import Sidenote from "../../components/Sidenote.astro";
6
 
7
 
8
  ## Analyses
9
 
10
+ The experiments tell us *what* works. Now let's zoom out and ask *why*. We look at the cost of running these experiments, whether cheap proxy metrics can replace expensive training runs, whether our proxy model is too small to reveal quality differences, what the rephrased outputs actually look like, and why a messier model sometimes wins.
11
 
12
  ### Is More Compute Worth It?
13
 
 
74
 
75
  **Edu-scores tell the opposite story.** Most experiments *decrease* the edu-score through rephrasing, particularly those starting from high-quality sources (FineWeb-Edu-HQ has high baseline edu-scores). The edu-score classifier penalizes format changes like tables, FAQs, and math notation that our best prompts produce. This is a case where the proxy metric actively misleads: the "quality degradation" measured by edu-score corresponds to format transformations that *improve* downstream performance.
76
 
77
+ So quality scores designed for filtering web data don't transfer to synthetic data. If we can't shortcut the evaluation, we should at least make sure the evaluation itself is trustworthy. One obvious concern: all our model-size experiments used a 1.7B student. What if that student is simply too small to tell good data from great data?
78
+
79
+ ### Is Our Proxy Model Too Small?
80
+
81
+ In the [model size experiment](#does-the-model-size-matter) we found that generator size barely matters past 1B. But all of that was on a single 1.7B student. What if that student is just too small to tell good data from great data? A small model might cap out on all the mixes equally, making 1B and 27B generator data look the same when a bigger student could tell them apart. To check, we trained students at four sizes on identical data mixes.
82
+
83
+ <Accordion title="Student preset sizes and training settings">
84
+
85
+ | Preset | Parameters | hidden | intermediate | tp | recompute layer | micro batch | eval batch |
86
+ |--------|---------------|--------|--------------|----|-----------------|-------------|------------|
87
+ | 0.5B | 483,714,048 | 1024 | 3072 | 1 | off | 4 | 32 |
88
+ | 1.7B | 1,672,071,168 | 2048 | 6144 | 1 | off | 2 | 16 |
89
+ | 2.9B | 2,860,792,320 | 2560 | 9216 | 1 | on | 1 | 8 |
90
+ | 6.2B | 6,162,714,624 | 4096 | 12288 | 2 | on | 1 | 4 |
91
+
92
+ `tp` is tensor-parallel width. Recompute layer is activation checkpointing; when on, it saves memory during training. Micro batch and eval batch are the micro-batch size during training and the batch size for evaluation runs.
93
+
94
+ </Accordion>
95
+
96
+ {/*
97
+ <Sidenote>
98
+ Why stop at 6.2B? Flash Attention 2 [@flashattention2] in our stack can't go past hidden size 4096 with 16 attention heads without running out of memory. And as we'll see, the 6.2B student doesn't separate generators any better than the 2.9B, so there's little reason to push further.
99
+ </Sidenote>
100
+ */}
101
+
102
+ We swept Gemma-3 generators (270M through 27B) on three prompts ([guided_rewrite](#guided_rewrite_original), [math](#math), [tutorial](#tutorial)), always mixing with FineWeb-Edu-HQ. Pick a student size and prompt below:
103
+
104
+ <HtmlEmbed
105
+ id="student-capacity-generator-sweep"
106
+ src="d3-benchmark-comparison.html"
107
+ desc="Gemma-3 generator sweep at different student capacities. Use Student to switch capacity and Setup to switch prompt."
108
+ config={{
109
+ setupDimension: {
110
+ label: "Student",
111
+ groups: {
112
+ "0.5B": ["Guided rewrite (0.5B)", "Math (0.5B)", "Tutorial (0.5B)"],
113
+ "1.7B": ["Guided rewrite (1.7B)", "Math (1.7B)", "Tutorial (1.7B)"],
114
+ "2.9B": ["Guided rewrite (2.9B)", "Math (2.9B)", "Tutorial (2.9B)"],
115
+ "6.2B": ["Guided rewrite (6.2B)", "Math (6.2B)", "Tutorial (6.2B)"]
116
+ }
117
+ },
118
+ setups: {
119
+ "Guided rewrite (0.5B)": {
120
+ datasets: {
121
+ "mix-fw_edu_hq-guided_rewrite_original_270m_hq-0.5b": "Gemma-3 270M",
122
+ "mix-fw_edu_hq-guided_rewrite_original_1b_hq-0.5b": "Gemma-3 1B",
123
+ "mix-fw_edu_hq-guided_rewrite_original_4b_hq-0.5b": "Gemma-3 4B",
124
+ "mix-fw_edu_hq-guided_rewrite_original_12b_hq-0.5b": "Gemma-3 12B",
125
+ "mix-fw_edu_hq-guided_rewrite_original_27b_hq-0.5b": "Gemma-3 27B",
126
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
127
+ }
128
+ },
129
+ "Guided rewrite (1.7B)": {
130
+ datasets: {
131
+ "mix-fw_edu_hq-guided_rewrite_original_270m_hq": "Gemma-3 270M",
132
+ "mix-fw_edu_hq-guided_rewrite_original_1b_hq": "Gemma-3 1B",
133
+ "mix-fw_edu_hq-guided_rewrite_original_4b_hq": "Gemma-3 4B",
134
+ "mix-fw_edu_hq-guided_rewrite_original_12b_hq": "Gemma-3 12B",
135
+ "mix-fw_edu_hq-guided_rewrite_original_27b_hq": "Gemma-3 27B",
136
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
137
+ }
138
+ },
139
+ "Guided rewrite (2.9B)": {
140
+ datasets: {
141
+ "mix-fw_edu_hq-guided_rewrite_original_270m_hq-2.9b": "Gemma-3 270M",
142
+ "mix-fw_edu_hq-guided_rewrite_original_1b_hq-2.9b": "Gemma-3 1B",
143
+ "mix-fw_edu_hq-guided_rewrite_original_4b_hq-2.9b": "Gemma-3 4B",
144
+ "mix-fw_edu_hq-guided_rewrite_original_12b_hq-2.9b": "Gemma-3 12B",
145
+ "mix-fw_edu_hq-guided_rewrite_original_27b_hq-2.9b": "Gemma-3 27B",
146
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
147
+ }
148
+ },
149
+ "Guided rewrite (6.2B)": {
150
+ datasets: {
151
+ "mix-fw_edu_hq-guided_rewrite_original_270m_hq-6.2b": "Gemma-3 270M",
152
+ "mix-fw_edu_hq-guided_rewrite_original_1b_hq-6.2b": "Gemma-3 1B",
153
+ "mix-fw_edu_hq-guided_rewrite_original_4b_hq-6.2b": "Gemma-3 4B",
154
+ "mix-fw_edu_hq-guided_rewrite_original_12b_hq-6.2b": "Gemma-3 12B",
155
+ "mix-fw_edu_hq-guided_rewrite_original_27b_hq-6.2b": "Gemma-3 27B",
156
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
157
+ }
158
+ },
159
+ "Math (0.5B)": {
160
+ datasets: {
161
+ "mix-fw_edu_hq-math_270m_hq-0.5b": "Gemma-3 270M",
162
+ "mix-fw_edu_hq-math_1b_hq-0.5b": "Gemma-3 1B",
163
+ "mix-fw_edu_hq-math_4b_hq-0.5b": "Gemma-3 4B",
164
+ "mix-fw_edu_hq-math_12b_hq-0.5b": "Gemma-3 12B",
165
+ "mix-fw_edu_hq-math_27b_hq-0.5b": "Gemma-3 27B",
166
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
167
+ }
168
+ },
169
+ "Math (1.7B)": {
170
+ datasets: {
171
+ "mix-fw_edu_hq-math_270m_hq": "Gemma-3 270M",
172
+ "mix-fw_edu_hq-math_1b_hq": "Gemma-3 1B",
173
+ "mix-fw_edu_hq-math_4b_hq": "Gemma-3 4B",
174
+ "mix-fw_edu_hq-math_12b_hq": "Gemma-3 12B",
175
+ "mix-fw_edu_hq-math_27b_hq": "Gemma-3 27B",
176
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
177
+ }
178
+ },
179
+ "Math (2.9B)": {
180
+ datasets: {
181
+ "mix-fw_edu_hq-math_270m_hq-2.9b": "Gemma-3 270M",
182
+ "mix-fw_edu_hq-math_1b_hq-2.9b": "Gemma-3 1B",
183
+ "mix-fw_edu_hq-math_4b_hq-2.9b": "Gemma-3 4B",
184
+ "mix-fw_edu_hq-math_12b_hq-2.9b": "Gemma-3 12B",
185
+ "mix-fw_edu_hq-math_27b_hq-2.9b": "Gemma-3 27B",
186
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
187
+ }
188
+ },
189
+ "Math (6.2B)": {
190
+ datasets: {
191
+ "mix-fw_edu_hq-math_270m_hq-6.2b": "Gemma-3 270M",
192
+ "mix-fw_edu_hq-math_1b_hq-6.2b": "Gemma-3 1B",
193
+ "mix-fw_edu_hq-math_4b_hq-6.2b": "Gemma-3 4B",
194
+ "mix-fw_edu_hq-math_12b_hq-6.2b": "Gemma-3 12B",
195
+ "mix-fw_edu_hq-math_27b_hq-6.2b": "Gemma-3 27B",
196
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
197
+ }
198
+ },
199
+ "Tutorial (0.5B)": {
200
+ datasets: {
201
+ "mix-fw_edu_hq-tutorial_270m_hq-0.5b": "Gemma-3 270M",
202
+ "mix-fw_edu_hq-tutorial_1b_hq-0.5b": "Gemma-3 1B",
203
+ "mix-fw_edu_hq-tutorial_4b_hq-0.5b": "Gemma-3 4B",
204
+ "mix-fw_edu_hq-tutorial_12b_hq-0.5b": "Gemma-3 12B",
205
+ "mix-fw_edu_hq-tutorial_27b_hq-0.5b": "Gemma-3 27B",
206
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
207
+ }
208
+ },
209
+ "Tutorial (1.7B)": {
210
+ datasets: {
211
+ "mix-fw_edu_hq-tutorial_270m_hq": "Gemma-3 270M",
212
+ "mix-fw_edu_hq-tutorial_1b_hq": "Gemma-3 1B",
213
+ "mix-fw_edu_hq-tutorial_4b_hq": "Gemma-3 4B",
214
+ "mix-fw_edu_hq-tutorial_12b_hq": "Gemma-3 12B",
215
+ "mix-fw_edu_hq-tutorial_27b_hq": "Gemma-3 27B",
216
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
217
+ }
218
+ },
219
+ "Tutorial (2.9B)": {
220
+ datasets: {
221
+ "mix-fw_edu_hq-tutorial_270m_hq-2.9b": "Gemma-3 270M",
222
+ "mix-fw_edu_hq-tutorial_1b_hq-2.9b": "Gemma-3 1B",
223
+ "mix-fw_edu_hq-tutorial_4b_hq-2.9b": "Gemma-3 4B",
224
+ "mix-fw_edu_hq-tutorial_12b_hq-2.9b": "Gemma-3 12B",
225
+ "mix-fw_edu_hq-tutorial_27b_hq-2.9b": "Gemma-3 27B",
226
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
227
+ }
228
+ },
229
+ "Tutorial (6.2B)": {
230
+ datasets: {
231
+ "mix-fw_edu_hq-tutorial_270m_hq-6.2b": "Gemma-3 270M",
232
+ "mix-fw_edu_hq-tutorial_1b_hq-6.2b": "Gemma-3 1B",
233
+ "mix-fw_edu_hq-tutorial_4b_hq-6.2b": "Gemma-3 4B",
234
+ "mix-fw_edu_hq-tutorial_12b_hq-6.2b": "Gemma-3 12B",
235
+ "mix-fw_edu_hq-tutorial_27b_hq-6.2b": "Gemma-3 27B",
236
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
237
+ }
238
+ }
239
+ }
240
+ }}
241
+ />
242
+
243
+ **A small student squashes differences between generators.** At 0.5B, the spread across generator sizes is about half of what we see at 2.9B (on [guided_rewrite](#guided_rewrite_original): 0.012 vs 0.024 macro spread at 10k steps). Bump the student to 2.9B and a clear ranking appears: 270M lowest, 1B in the middle, larger generators on top (12B often wins on [guided_rewrite](#guided_rewrite_original) and [tutorial](#tutorial)). Going further to 6.2B doesn't help much: the total spread stays about the same, but the ordering among large generators gets noisier (4B can edge out 12B).
244
+
245
+ **Math is the exception.** At 0.5B, the 1B generator is actually the *best* in the sweep, beating 12B and 27B. That lead fades at larger student sizes (12B takes over at 2.9B and 6.2B), but 1B stays surprisingly competitive. Switch the chart to GSM8K and the pattern gets even sharper: [math](#math) data from the small Gemma is unusually strong for how cheap it is.
246
+
247
+ **2.9B is the sweet spot.** It cleanly separates 270M, 1B, and larger generators without the extra cost of 6.2B, which barely widens the spread.
248
+
249
+ <Sidenote>
250
+ We ran this student sweep later in the project. The earlier experiments still use the 1.7B student because those runs were planned or launched before these results existed.
251
+ </Sidenote>
252
+
253
+
254
+ **The 1.7B student hid differences above 1B.** On [guided_rewrite](#guided_rewrite_original), the gap between 1B and the best 4B+ generator is just +0.009 at the 1.7B student, easy to write off as noise. At 2.9B that same gap jumps to +0.017, at 6.2B it's +0.013. [Tutorial](#tutorial) tells the same story (+0.004 β†’ +0.006 β†’ +0.014). So "bigger generators don't help" was partly the 1.7B student squashing those differences. **With a bigger student, three tiers show up:** 270M is clearly worst, 1B sits in the middle, and 4B+ generators form a top group. The gap from 1B to the top is real, just smaller than the jump from 270M to 1B.
255
+
256
+ **Bigger students get more out of the same data.** @rewire report the same pattern: their rewritten data adds +1.0pp at 1B, +1.3pp at 3B, and +2.5pp at 7B over filtered web data alone. We see it too: average macro score climbs from 0.109 (0.5B student) to 0.143 (1.7B) to 0.150 (2.9B) to 0.157 (6.2B), and the generator spread roughly doubles from 0.5B to 2.9B.
257
+
258
+ All of the above is about *who* generates the data and *how big* the student is. But what about the data itself? Let's start with the simplest property: how long is the output?
259
 
260
  ### Do Chatty Models Make Better Data?
261
 
 
343
  <Note title="Summary: Analyses" variant="info">
344
  **Cost**: Small models with simple prompts dominate the Pareto frontier. Invest in prompt design, not model size.<br/>
345
  **Quality scores**: Neither edu-score nor DCLM-score reliably predicts downstream performance for synthetic data. There is no shortcut to training and evaluating.<br/>
346
+ **Proxy model size**: A 2.9B student reveals three tiers (270M {'<'} 1B {'<'} 4B+) that the 1.7B student compressed. Generator gains above 1B are real but smaller than student-side gains. Student scale is the bigger lever.<br/>
347
  **Verbosity**: Output length has no meaningful relationship with performance. What matters is content, not compression ratio.<br/>
348
  **Diversity**: Template collapse hurts more than noisy outputs. A messier model that produces varied text can outperform a polished one that repeats the same template.
349
  </Note>
app/src/content/chapters/7-conclusions.mdx CHANGED
@@ -13,7 +13,7 @@ Beyond faster generation, we answered several questions about best practices but
13
  - **Generation parameters**: What influence do temperature, `top_p`, and other sampling settings have on rephrasing quality?
14
  - **Context extension**: Does chunked rollouts context extension during mid-training improve downstream performance?
15
  - **Best-of-N filtering**: Can we generate multiple rollouts per example and score them to keep only the best one?
16
- - **Scaling to larger models**: REWIRE [@rewire] reports larger gains for bigger models trained on their data. Can we reproduce this?
17
  - **Automatic prompt optimization**: Does prompt optimization with tools like DSPy [@dspy] improve rephrasing performance?
18
  - **Longer pretraining**: Our ablations trained for 21B tokens. Do the same findings hold at 100B+ token scales, and do prompt rankings shift with longer training?
19
  - **Source filtering**: Should we filter documents before or after rephrasing? For instance, applying a math prompt to non-mathematical documents likely wastes compute and adds noise.
 
13
  - **Generation parameters**: What influence do temperature, `top_p`, and other sampling settings have on rephrasing quality?
14
  - **Context extension**: Does chunked rollouts context extension during mid-training improve downstream performance?
15
  - **Best-of-N filtering**: Can we generate multiple rollouts per example and score them to keep only the best one?
16
+ - **Scaling to larger models**: Our student sweep (0.5B to 6.2B) confirms that larger students extract more value from synthetic data, consistent with @rewire, and reveals generator differences above 1B that smaller students hide. Training at even larger scales (10B+) could amplify these gaps further.
17
  - **Automatic prompt optimization**: Does prompt optimization with tools like DSPy [@dspy] improve rephrasing performance?
18
  - **Longer pretraining**: Our ablations trained for 21B tokens. Do the same findings hold at 100B+ token scales, and do prompt rankings shift with longer training?
19
  - **Source filtering**: Should we filter documents before or after rephrasing? For instance, applying a math prompt to non-mathematical documents likely wastes compute and adds noise.
app/src/content/chapters/8-appendix.mdx CHANGED
@@ -2,7 +2,7 @@
2
 
3
  ### Details on the experiments
4
 
5
- For our ablations we train a 1.2B parameter language model using a Qwen2-style [@qwen2] architecture with 28 layers, a hidden dimension of 2048, 16 attention heads with 8 key-value heads (grouped-query attention [@gqa]), and an intermediate size of 6144. The model utilized the Llama 3.2 [@llama3] tokenizer (`hynky/Llama-3.2-1B-no-bos`) with a vocabulary size of 128,256 tokens. Training was conducted on 64 NVIDIA H100 80GB GPUs across 8 nodes using pure data parallelism (DP=64) with a global batch size of 512 and a sequence length of 4,096 tokens, accumulating to approximately 21 billion tokens total over 10,000 steps. We employed the AdamW [@adamw] optimizer with a learning rate of 5Γ—10⁻⁴, β₁=0.9, Ξ²β‚‚=0.95, weight decay of 0.1, and gradient clipping at 1.0. All training utilized bfloat16 precision with Flash Attention 2 [@flashattention2], fused operations (RMS normalization and rotary embeddings [@rope]), and document masking to prevent cross-document attention. We aim to rephrase at least 10B tokens per experiment but due to wildly varying number of completion tokens by prompt we sometimes get less than that. In these cases we train on some of the data twice.
6
 
7
  ### Prompts
8
 
 
2
 
3
  ### Details on the experiments
4
 
5
+ For our ablations we train a 1.7B parameter language model using a Qwen2-style [@qwen2] architecture with 28 layers, a hidden dimension of 2048, 16 attention heads with 8 key-value heads (grouped-query attention [@gqa]), and an intermediate size of 6144. The model utilized the Llama 3.2 [@llama3] tokenizer (`hynky/Llama-3.2-1B-no-bos`) with a vocabulary size of 128,256 tokens. Training was conducted on 64 NVIDIA H100 80GB GPUs across 8 nodes using pure data parallelism (DP=64) with a global batch size of 512 and a sequence length of 4,096 tokens, accumulating to approximately 21 billion tokens total over 10,000 steps. We employed the AdamW [@adamw] optimizer with a learning rate of 5Γ—10⁻⁴, β₁=0.9, Ξ²β‚‚=0.95, weight decay of 0.1, and gradient clipping at 1.0. All training utilized bfloat16 precision with Flash Attention 2 [@flashattention2], fused operations (RMS normalization and rotary embeddings [@rope]), and document masking to prevent cross-document attention. We aim to rephrase at least 10B tokens per experiment but due to wildly varying number of completion tokens by prompt we sometimes get less than that. In these cases we train on some of the data twice.
6
 
7
  ### Prompts
8
 
app/src/content/embeds/d3-benchmark-comparison.html CHANGED
@@ -9,6 +9,13 @@
9
  // full form: display is required, rest optional
10
  },
11
  "setups": { "Setup Label": { "datasets": {...} }, ... }, // optional, multi-setup mode with dropdown + average
 
 
 
 
 
 
 
12
  "defaultMetric": "agg_score_macro", // optional, default: "agg_score_macro"
13
  "defaultView": "bar", // optional, "bar" | "line", default: "bar"
14
  "defaultSetup": "average", // optional, setup name or "average", default: "average" when β‰₯2 setups
@@ -236,10 +243,36 @@
236
  // ─── SETUP SUPPORT ───
237
  const SETUPS = cfg.setups || null;
238
  const setupNames = SETUPS ? Object.keys(SETUPS) : [];
239
- const AVG_SETUP_KEY = 'Average (all setups)';
240
  const HIDE_AVERAGE = !!cfg.hideAverage;
241
- const defaultSetupCfg = cfg.defaultSetup || (setupNames.length >= 2 && !HIDE_AVERAGE ? 'average' : null);
242
- let currentSetup = SETUPS ? (defaultSetupCfg === 'average' ? AVG_SETUP_KEY : (defaultSetupCfg && setupNames.includes(defaultSetupCfg) ? defaultSetupCfg : setupNames[0])) : null;
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
243
  let DATASETS = SETUPS ? (currentSetup === AVG_SETUP_KEY ? {} : normalizeDatasets(SETUPS[currentSetup].datasets)) : normalizeDatasets(cfg.datasets);
244
  let avgDatasets = {};
245
  let parsedData = [];
@@ -344,9 +377,10 @@
344
  }
345
 
346
  function computeAverageData(rawData) {
347
- if (!SETUPS || setupNames.length < 2) return { data: [], datasets: {} };
 
348
  const displayToRaws = {};
349
- for (const sName of setupNames) {
350
  const ds = normalizeDatasets(SETUPS[sName].datasets);
351
  for (const [raw, opts] of Object.entries(ds)) {
352
  if (!displayToRaws[opts.display]) displayToRaws[opts.display] = [];
@@ -354,7 +388,7 @@
354
  }
355
  }
356
  const fullDisplay = Object.entries(displayToRaws)
357
- .filter(([, raws]) => raws.length >= setupNames.length);
358
  const byRunStep = {};
359
  for (const row of rawData) byRunStep[row[RUN_COL] + '|' + row[STEP_COL]] = row;
360
  const steps = Array.from(new Set(rawData.map(r => +r[STEP_COL]))).sort((a, b) => a - b);
@@ -364,7 +398,7 @@
364
  for (const [display, raws] of fullDisplay) {
365
  const avgRaw = '__avg__' + display.replace(/[^a-zA-Z0-9]/g, '_');
366
  // Merge options from first setup that has this display name
367
- const firstOpts = Object.values(normalizeDatasets(SETUPS[setupNames[0]].datasets)).find(o => o.display === display) || {};
368
  dsMap[avgRaw] = { display, ...firstOpts };
369
  for (const step of steps) {
370
  const rows = raws.map(r => byRunStep[r + '|' + step]).filter(Boolean);
@@ -381,6 +415,14 @@
381
  return { data: result, datasets: dsMap };
382
  }
383
 
 
 
 
 
 
 
 
 
384
  function switchSetup(name) {
385
  currentSetup = name;
386
  if (name === AVG_SETUP_KEY) {
@@ -388,7 +430,6 @@
388
  } else {
389
  DATASETS = normalizeDatasets(SETUPS[name].datasets);
390
  }
391
- // Re-add baselines from any setup
392
  for (const sName of setupNames) {
393
  const ds = normalizeDatasets(SETUPS[sName].datasets);
394
  for (const [raw, opts] of Object.entries(ds)) {
@@ -404,6 +445,33 @@
404
  buildLegend();
405
  }
406
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
407
  function showTip(html, x, y) {
408
  tipInner.innerHTML = html;
409
  const tipW = tip.offsetWidth || 180;
@@ -870,16 +938,32 @@
870
  function buildUI() {
871
  const controls = document.createElement('div'); controls.className = 'controls';
872
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
873
  if (SETUPS && setupNames.length > 0) {
874
  const setupGroup = document.createElement('div'); setupGroup.className = 'control-group';
875
  const setupLabel = document.createElement('label'); setupLabel.setAttribute('for', 'setup-' + uid); setupLabel.textContent = 'Setup';
876
  const setupSelect = document.createElement('select'); setupSelect.id = 'setup-' + uid;
877
- setupNames.forEach(name => {
878
- const opt = document.createElement('option'); opt.value = name; opt.textContent = name;
 
879
  if (name === currentSetup) opt.selected = true;
880
  setupSelect.appendChild(opt);
881
  });
882
- if (setupNames.length >= 2 && !HIDE_AVERAGE) {
883
  const avgOpt = document.createElement('option'); avgOpt.value = AVG_SETUP_KEY; avgOpt.textContent = AVG_SETUP_KEY;
884
  if (currentSetup === AVG_SETUP_KEY) avgOpt.selected = true;
885
  setupSelect.appendChild(avgOpt);
@@ -989,7 +1073,7 @@
989
  const text = await fetchFirstAvailable(csvPaths);
990
  const parsed = d3.csvParse(text);
991
  parsedData = parsed;
992
- if (SETUPS && setupNames.length >= 2 && !HIDE_AVERAGE) {
993
  const avg = computeAverageData(parsed);
994
  avgDatasets = avg.datasets;
995
  const hasAvgData = Object.values(avgDatasets).some(o => !o.baseline);
@@ -1000,7 +1084,7 @@
1000
  } else {
1001
  const sel = container.querySelector('#setup-' + uid);
1002
  if (sel) { const o = sel.querySelector(`option[value="${AVG_SETUP_KEY}"]`); if (o) o.remove(); }
1003
- if (currentSetup === AVG_SETUP_KEY) { currentSetup = setupNames[0]; DATASETS = normalizeDatasets(SETUPS[currentSetup].datasets); if (sel) sel.value = currentSetup; }
1004
  }
1005
  }
1006
  filterData();
 
9
  // full form: display is required, rest optional
10
  },
11
  "setups": { "Setup Label": { "datasets": {...} }, ... }, // optional, multi-setup mode with dropdown + average
12
+ "setupDimension": { // optional, adds a second dropdown that groups setups
13
+ "label": "Student", // dropdown label
14
+ "groups": { // maps dimension values β†’ lists of setup keys
15
+ "0.5B student": ["Math (0.5B)", "Tutorial (0.5B)"], // only these setups are shown when this group is active
16
+ "1.7B student": ["Math (1.7B)", "Tutorial (1.7B)"]
17
+ }
18
+ },
19
  "defaultMetric": "agg_score_macro", // optional, default: "agg_score_macro"
20
  "defaultView": "bar", // optional, "bar" | "line", default: "bar"
21
  "defaultSetup": "average", // optional, setup name or "average", default: "average" when β‰₯2 setups
 
243
  // ─── SETUP SUPPORT ───
244
  const SETUPS = cfg.setups || null;
245
  const setupNames = SETUPS ? Object.keys(SETUPS) : [];
 
246
  const HIDE_AVERAGE = !!cfg.hideAverage;
247
+
248
+ // ─── SETUP DIMENSION (optional second dropdown that groups setups) ───
249
+ const SETUP_DIM = cfg.setupDimension || null;
250
+ const dimLabel = SETUP_DIM ? (SETUP_DIM.label || 'Group') : '';
251
+ const dimGroups = SETUP_DIM ? (SETUP_DIM.groups || {}) : {};
252
+ const dimensions = Object.keys(dimGroups);
253
+ let currentDimension = dimensions.length >= 2 ? dimensions[0] : null;
254
+ // Reverse lookup: setup name β†’ dimension value
255
+ const setupToDim = {};
256
+ for (const [dim, names] of Object.entries(dimGroups)) {
257
+ for (const n of names) setupToDim[n] = dim;
258
+ }
259
+ function filteredSetupNames() {
260
+ if (!currentDimension) return setupNames;
261
+ return dimGroups[currentDimension] || [];
262
+ }
263
+ function setupDisplayName(name) {
264
+ if (!currentDimension) return name;
265
+ // Strip " (dimensionValue)" suffix from display name when a dimension is active
266
+ for (const dim of dimensions) {
267
+ const suffix = ' (' + dim + ')';
268
+ if (name.endsWith(suffix)) return name.slice(0, -suffix.length);
269
+ }
270
+ return name;
271
+ }
272
+
273
+ const AVG_SETUP_KEY = 'Average (all setups)';
274
+ const defaultSetupCfg = cfg.defaultSetup || (filteredSetupNames().length >= 2 && !HIDE_AVERAGE ? 'average' : null);
275
+ let currentSetup = SETUPS ? (defaultSetupCfg === 'average' ? AVG_SETUP_KEY : (defaultSetupCfg && setupNames.includes(defaultSetupCfg) ? defaultSetupCfg : filteredSetupNames()[0])) : null;
276
  let DATASETS = SETUPS ? (currentSetup === AVG_SETUP_KEY ? {} : normalizeDatasets(SETUPS[currentSetup].datasets)) : normalizeDatasets(cfg.datasets);
277
  let avgDatasets = {};
278
  let parsedData = [];
 
377
  }
378
 
379
  function computeAverageData(rawData) {
380
+ const activeSetups = filteredSetupNames();
381
+ if (!SETUPS || activeSetups.length < 2) return { data: [], datasets: {} };
382
  const displayToRaws = {};
383
+ for (const sName of activeSetups) {
384
  const ds = normalizeDatasets(SETUPS[sName].datasets);
385
  for (const [raw, opts] of Object.entries(ds)) {
386
  if (!displayToRaws[opts.display]) displayToRaws[opts.display] = [];
 
388
  }
389
  }
390
  const fullDisplay = Object.entries(displayToRaws)
391
+ .filter(([, raws]) => raws.length >= activeSetups.length);
392
  const byRunStep = {};
393
  for (const row of rawData) byRunStep[row[RUN_COL] + '|' + row[STEP_COL]] = row;
394
  const steps = Array.from(new Set(rawData.map(r => +r[STEP_COL]))).sort((a, b) => a - b);
 
398
  for (const [display, raws] of fullDisplay) {
399
  const avgRaw = '__avg__' + display.replace(/[^a-zA-Z0-9]/g, '_');
400
  // Merge options from first setup that has this display name
401
+ const firstOpts = Object.values(normalizeDatasets(SETUPS[activeSetups[0]].datasets)).find(o => o.display === display) || {};
402
  dsMap[avgRaw] = { display, ...firstOpts };
403
  for (const step of steps) {
404
  const rows = raws.map(r => byRunStep[r + '|' + step]).filter(Boolean);
 
415
  return { data: result, datasets: dsMap };
416
  }
417
 
418
+ function recomputeAverage() {
419
+ const avg = computeAverageData(parsedData.filter(r => !r[RUN_COL].startsWith('__avg__')));
420
+ avgDatasets = avg.datasets;
421
+ const origData = parsedData.filter(r => !r[RUN_COL].startsWith('__avg__'));
422
+ parsedData = origData.concat(avg.data);
423
+ parsedData.columns = origData.columns || (origData[0] ? Object.keys(origData[0]) : []);
424
+ }
425
+
426
  function switchSetup(name) {
427
  currentSetup = name;
428
  if (name === AVG_SETUP_KEY) {
 
430
  } else {
431
  DATASETS = normalizeDatasets(SETUPS[name].datasets);
432
  }
 
433
  for (const sName of setupNames) {
434
  const ds = normalizeDatasets(SETUPS[sName].datasets);
435
  for (const [raw, opts] of Object.entries(ds)) {
 
445
  buildLegend();
446
  }
447
 
448
+ function switchDimension(dim) {
449
+ currentDimension = dim;
450
+ recomputeAverage();
451
+ // Rebuild setup dropdown options
452
+ const sel = container.querySelector('#setup-' + uid);
453
+ if (sel) {
454
+ sel.innerHTML = '';
455
+ const active = filteredSetupNames();
456
+ active.forEach(name => {
457
+ const opt = document.createElement('option'); opt.value = name; opt.textContent = setupDisplayName(name);
458
+ sel.appendChild(opt);
459
+ });
460
+ if (active.length >= 2 && !HIDE_AVERAGE) {
461
+ const avgOpt = document.createElement('option'); avgOpt.value = AVG_SETUP_KEY; avgOpt.textContent = AVG_SETUP_KEY;
462
+ sel.appendChild(avgOpt);
463
+ }
464
+ // Default to average if available, else first setup
465
+ if (active.length >= 2 && !HIDE_AVERAGE) {
466
+ sel.value = AVG_SETUP_KEY;
467
+ switchSetup(AVG_SETUP_KEY);
468
+ } else if (active.length) {
469
+ sel.value = active[0];
470
+ switchSetup(active[0]);
471
+ }
472
+ }
473
+ }
474
+
475
  function showTip(html, x, y) {
476
  tipInner.innerHTML = html;
477
  const tipW = tip.offsetWidth || 180;
 
938
  function buildUI() {
939
  const controls = document.createElement('div'); controls.className = 'controls';
940
 
941
+ // Dimension dropdown β€” shown before Setup when setupDimension is configured
942
+ if (dimensions.length >= 2) {
943
+ const dimGroup = document.createElement('div'); dimGroup.className = 'control-group';
944
+ const dimLabelEl = document.createElement('label'); dimLabelEl.setAttribute('for', 'dim-' + uid); dimLabelEl.textContent = dimLabel;
945
+ const dimSelect = document.createElement('select'); dimSelect.id = 'dim-' + uid;
946
+ dimensions.forEach(dim => {
947
+ const opt = document.createElement('option'); opt.value = dim; opt.textContent = dim;
948
+ if (dim === currentDimension) opt.selected = true;
949
+ dimSelect.appendChild(opt);
950
+ });
951
+ dimSelect.addEventListener('change', () => { switchDimension(dimSelect.value); });
952
+ dimGroup.appendChild(dimLabelEl); dimGroup.appendChild(dimSelect);
953
+ controls.appendChild(dimGroup);
954
+ }
955
+
956
  if (SETUPS && setupNames.length > 0) {
957
  const setupGroup = document.createElement('div'); setupGroup.className = 'control-group';
958
  const setupLabel = document.createElement('label'); setupLabel.setAttribute('for', 'setup-' + uid); setupLabel.textContent = 'Setup';
959
  const setupSelect = document.createElement('select'); setupSelect.id = 'setup-' + uid;
960
+ const active = filteredSetupNames();
961
+ active.forEach(name => {
962
+ const opt = document.createElement('option'); opt.value = name; opt.textContent = setupDisplayName(name);
963
  if (name === currentSetup) opt.selected = true;
964
  setupSelect.appendChild(opt);
965
  });
966
+ if (active.length >= 2 && !HIDE_AVERAGE) {
967
  const avgOpt = document.createElement('option'); avgOpt.value = AVG_SETUP_KEY; avgOpt.textContent = AVG_SETUP_KEY;
968
  if (currentSetup === AVG_SETUP_KEY) avgOpt.selected = true;
969
  setupSelect.appendChild(avgOpt);
 
1073
  const text = await fetchFirstAvailable(csvPaths);
1074
  const parsed = d3.csvParse(text);
1075
  parsedData = parsed;
1076
+ if (SETUPS && filteredSetupNames().length >= 2 && !HIDE_AVERAGE) {
1077
  const avg = computeAverageData(parsed);
1078
  avgDatasets = avg.datasets;
1079
  const hasAvgData = Object.values(avgDatasets).some(o => !o.baseline);
 
1084
  } else {
1085
  const sel = container.querySelector('#setup-' + uid);
1086
  if (sel) { const o = sel.querySelector(`option[value="${AVG_SETUP_KEY}"]`); if (o) o.remove(); }
1087
+ if (currentSetup === AVG_SETUP_KEY) { currentSetup = filteredSetupNames()[0]; DATASETS = normalizeDatasets(SETUPS[currentSetup].datasets); if (sel) sel.value = currentSetup; }
1088
  }
1089
  }
1090
  filterData();