finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on 2 days ago

Commit

9894e4e

1 Parent(s): 8be4608

added analysis with differently sized student models

Browse files

Files changed (7) hide show

app/src/content/assets/data/benchmark-results.csv +2 -2
app/src/content/chapters/2-setup.mdx +1 -1
app/src/content/chapters/3-experiments.mdx +3 -1
app/src/content/chapters/4-analyses.mdx +186 -2
app/src/content/chapters/7-conclusions.mdx +1 -1
app/src/content/chapters/8-appendix.mdx +1 -1
app/src/content/embeds/d3-benchmark-comparison.html +97 -13

app/src/content/assets/data/benchmark-results.csv CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:dc7f86e2cd5b311eb1fec66972254890034acd866037b5096f1551ef877fe72e
-size 1598658

 version https://git-lfs.github.com/spec/v1
+oid sha256:531d5bf89cf3ca78b3813c791ae0096d816c9f50b16e851f4c9e9ed03b576462
+size 2000388

app/src/content/chapters/2-setup.mdx CHANGED Viewed

@@ -62,7 +62,7 @@ With the datasets defined, we need a consistent way to tell whether one configur
 ### How We Measure Success
-To evaluate each configuration, we follow the ablation methodology from FineWeb [@fineweb]: train a 1.2B parameter language model with a Qwen2-style architecture [@qwen2] (details in the [Appendix](#details-on-the-experiments)) on 20B tokens and evaluate on 12 benchmarks across six categories using 3-shot prompting with a single seed:
 <Sidenote>
 Since our model is small and trained on only 20B tokens, we use the **cloze format** (CF) for most tasks rather than standard multiple-choice. CF frames evaluation as next-token prediction, which gives more reliable signal for smaller models that may struggle with instruction following or multiple-choice formatting.
 </Sidenote>

 ### How We Measure Success
+To evaluate each configuration, we follow the ablation methodology from FineWeb [@fineweb]: train a 1.7B parameter language model with a Qwen2-style architecture [@qwen2] (details in the [Appendix](#details-on-the-experiments)) on 20B tokens and evaluate on 12 benchmarks across six categories using 3-shot prompting with a single seed:
 <Sidenote>
 Since our model is small and trained on only 20B tokens, we use the **cloze format** (CF) for most tasks rather than standard multiple-choice. CF frames evaluation as next-token prediction, which gives more reliable signal for smaller models that may struggle with instruction following or multiple-choice formatting.
 </Sidenote>

app/src/content/chapters/3-experiments.mdx CHANGED Viewed

@@ -186,7 +186,9 @@ The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B mo
 This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
 The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), bigger models don't buy you better synthetic data. This aligns with findings from @demystifyingsynth, who showed that scaling generators from 8B to 70B parameters did not yield superior pretraining data, and with SwallowMath-v2 [@swallowmathv2], which reports no downstream gains on math data from scaling the rewriter from Qwen3-30B-A3B to Qwen3-235B-A22B. This is great news for cost: you can use cheap, fast models for most rephrasing tasks.
-That raises an interesting follow-up. REWIRE claims that you specifically need large models to salvage low-quality data. Does that hold up?
 #### Do we need better models for rephrasing low-quality data?

 This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
 The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), bigger models don't buy you better synthetic data. This aligns with findings from @demystifyingsynth, who showed that scaling generators from 8B to 70B parameters did not yield superior pretraining data, and with SwallowMath-v2 [@swallowmathv2], which reports no downstream gains on math data from scaling the rewriter from Qwen3-30B-A3B to Qwen3-235B-A22B. This is great news for cost: you can use cheap, fast models for most rephrasing tasks.
+That said, those conclusions all come from a single 1.7B student model. We revisit whether a larger student would change the picture in the [Analyses](#is-our-proxy-model-too-small) section.
+REWIRE claims that you specifically need large models to salvage low-quality data. Does that hold up?
 #### Do we need better models for rephrasing low-quality data?

app/src/content/chapters/4-analyses.mdx CHANGED Viewed

@@ -1,11 +1,13 @@
 import HtmlEmbed from "../../components/HtmlEmbed.astro";
 import Note from "../../components/Note.astro";
 import Wide from "../../components/Wide.astro";
 ## Analyses
-The experiments tell us *what* works. Now let's zoom out and ask *why*. We look at the cost of running these experiments, whether cheap proxy metrics can replace expensive training runs, what the rephrased outputs actually look like, and why a messier model sometimes wins.
 ### Is More Compute Worth It?
@@ -72,7 +74,188 @@ The correlation matrix tells us that quality scores are weak predictors, but not
 **Edu-scores tell the opposite story.** Most experiments *decrease* the edu-score through rephrasing, particularly those starting from high-quality sources (FineWeb-Edu-HQ has high baseline edu-scores). The edu-score classifier penalizes format changes like tables, FAQs, and math notation that our best prompts produce. This is a case where the proxy metric actively misleads: the "quality degradation" measured by edu-score corresponds to format transformations that *improve* downstream performance.
-So quality scores designed for filtering web data don't transfer to synthetic data. Maybe looking at the outputs more directly helps. For instance, does the length of the rephrased output tell us anything?
 ### Do Chatty Models Make Better Data?
@@ -160,6 +343,7 @@ The lesson: for pretraining data, diversity beats consistency. A model that does
 <Note title="Summary: Analyses" variant="info">
 **Cost**: Small models with simple prompts dominate the Pareto frontier. Invest in prompt design, not model size.<br/>
 **Quality scores**: Neither edu-score nor DCLM-score reliably predicts downstream performance for synthetic data. There is no shortcut to training and evaluating.<br/>
 **Verbosity**: Output length has no meaningful relationship with performance. What matters is content, not compression ratio.<br/>
 **Diversity**: Template collapse hurts more than noisy outputs. A messier model that produces varied text can outperform a polished one that repeats the same template.
 </Note>

 import HtmlEmbed from "../../components/HtmlEmbed.astro";
 import Note from "../../components/Note.astro";
 import Wide from "../../components/Wide.astro";
+import Accordion from "../../components/Accordion.astro";
+import Sidenote from "../../components/Sidenote.astro";
 ## Analyses
+The experiments tell us *what* works. Now let's zoom out and ask *why*. We look at the cost of running these experiments, whether cheap proxy metrics can replace expensive training runs, whether our proxy model is too small to reveal quality differences, what the rephrased outputs actually look like, and why a messier model sometimes wins.
 ### Is More Compute Worth It?
 **Edu-scores tell the opposite story.** Most experiments *decrease* the edu-score through rephrasing, particularly those starting from high-quality sources (FineWeb-Edu-HQ has high baseline edu-scores). The edu-score classifier penalizes format changes like tables, FAQs, and math notation that our best prompts produce. This is a case where the proxy metric actively misleads: the "quality degradation" measured by edu-score corresponds to format transformations that *improve* downstream performance.
+So quality scores designed for filtering web data don't transfer to synthetic data. If we can't shortcut the evaluation, we should at least make sure the evaluation itself is trustworthy. One obvious concern: all our model-size experiments used a 1.7B student. What if that student is simply too small to tell good data from great data?
+### Is Our Proxy Model Too Small?
+In the [model size experiment](#does-the-model-size-matter) we found that generator size barely matters past 1B. But all of that was on a single 1.7B student. What if that student is just too small to tell good data from great data? A small model might cap out on all the mixes equally, making 1B and 27B generator data look the same when a bigger student could tell them apart. To check, we trained students at four sizes on identical data mixes.
+<Accordion title="Student preset sizes and training settings">
+| Preset | Parameters    | hidden | intermediate | tp | recompute layer | micro batch | eval batch |
+|--------|---------------|--------|--------------|----|-----------------|-------------|------------|
+| 0.5B   |   483,714,048 | 1024   | 3072         | 1  | off             | 4           | 32         |
+| 1.7B   | 1,672,071,168 | 2048   | 6144         | 1  | off             | 2           | 16         |
+| 2.9B   | 2,860,792,320 | 2560   | 9216         | 1  | on              | 1           | 8          |
+| 6.2B   | 6,162,714,624 | 4096   | 12288        | 2  | on              | 1           | 4          |
+`tp` is tensor-parallel width. Recompute layer is activation checkpointing; when on, it saves memory during training. Micro batch and eval batch are the micro-batch size during training and the batch size for evaluation runs.
+</Accordion>
+{/*
+<Sidenote>
+Why stop at 6.2B? Flash Attention 2 [@flashattention2] in our stack can't go past hidden size 4096 with 16 attention heads without running out of memory. And as we'll see, the 6.2B student doesn't separate generators any better than the 2.9B, so there's little reason to push further.
+</Sidenote>
+*/}
+We swept Gemma-3 generators (270M through 27B) on three prompts ([guided_rewrite](#guided_rewrite_original), [math](#math), [tutorial](#tutorial)), always mixing with FineWeb-Edu-HQ. Pick a student size and prompt below:
+<HtmlEmbed
+  id="student-capacity-generator-sweep"
+  src="d3-benchmark-comparison.html"
+  desc="Gemma-3 generator sweep at different student capacities. Use Student to switch capacity and Setup to switch prompt."
+  config={{
+    setupDimension: {
+      label: "Student",
+      groups: {
+        "0.5B": ["Guided rewrite (0.5B)", "Math (0.5B)", "Tutorial (0.5B)"],
+        "1.7B": ["Guided rewrite (1.7B)", "Math (1.7B)", "Tutorial (1.7B)"],
+        "2.9B": ["Guided rewrite (2.9B)", "Math (2.9B)", "Tutorial (2.9B)"],
+        "6.2B": ["Guided rewrite (6.2B)", "Math (6.2B)", "Tutorial (6.2B)"]
+      }
+    },
+    setups: {
+      "Guided rewrite (0.5B)": {
+        datasets: {
+          "mix-fw_edu_hq-guided_rewrite_original_270m_hq-0.5b": "Gemma-3 270M",
+          "mix-fw_edu_hq-guided_rewrite_original_1b_hq-0.5b": "Gemma-3 1B",
+          "mix-fw_edu_hq-guided_rewrite_original_4b_hq-0.5b": "Gemma-3 4B",
+          "mix-fw_edu_hq-guided_rewrite_original_12b_hq-0.5b": "Gemma-3 12B",
+          "mix-fw_edu_hq-guided_rewrite_original_27b_hq-0.5b": "Gemma-3 27B",
+          dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
+        }
+      },
+      "Guided rewrite (1.7B)": {
+        datasets: {
+          "mix-fw_edu_hq-guided_rewrite_original_270m_hq": "Gemma-3 270M",
+          "mix-fw_edu_hq-guided_rewrite_original_1b_hq": "Gemma-3 1B",
+          "mix-fw_edu_hq-guided_rewrite_original_4b_hq": "Gemma-3 4B",
+          "mix-fw_edu_hq-guided_rewrite_original_12b_hq": "Gemma-3 12B",
+          "mix-fw_edu_hq-guided_rewrite_original_27b_hq": "Gemma-3 27B",
+          dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
+        }
+      },
+      "Guided rewrite (2.9B)": {
+        datasets: {
+          "mix-fw_edu_hq-guided_rewrite_original_270m_hq-2.9b": "Gemma-3 270M",
+          "mix-fw_edu_hq-guided_rewrite_original_1b_hq-2.9b": "Gemma-3 1B",
+          "mix-fw_edu_hq-guided_rewrite_original_4b_hq-2.9b": "Gemma-3 4B",
+          "mix-fw_edu_hq-guided_rewrite_original_12b_hq-2.9b": "Gemma-3 12B",
+          "mix-fw_edu_hq-guided_rewrite_original_27b_hq-2.9b": "Gemma-3 27B",
+          dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
+        }
+      },
+      "Guided rewrite (6.2B)": {
+        datasets: {
+          "mix-fw_edu_hq-guided_rewrite_original_270m_hq-6.2b": "Gemma-3 270M",
+          "mix-fw_edu_hq-guided_rewrite_original_1b_hq-6.2b": "Gemma-3 1B",
+          "mix-fw_edu_hq-guided_rewrite_original_4b_hq-6.2b": "Gemma-3 4B",
+          "mix-fw_edu_hq-guided_rewrite_original_12b_hq-6.2b": "Gemma-3 12B",
+          "mix-fw_edu_hq-guided_rewrite_original_27b_hq-6.2b": "Gemma-3 27B",
+          dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
+        }
+      },
+      "Math (0.5B)": {
+        datasets: {
+          "mix-fw_edu_hq-math_270m_hq-0.5b": "Gemma-3 270M",
+          "mix-fw_edu_hq-math_1b_hq-0.5b": "Gemma-3 1B",
+          "mix-fw_edu_hq-math_4b_hq-0.5b": "Gemma-3 4B",
+          "mix-fw_edu_hq-math_12b_hq-0.5b": "Gemma-3 12B",
+          "mix-fw_edu_hq-math_27b_hq-0.5b": "Gemma-3 27B",
+          dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
+        }
+      },
+      "Math (1.7B)": {
+        datasets: {
+          "mix-fw_edu_hq-math_270m_hq": "Gemma-3 270M",
+          "mix-fw_edu_hq-math_1b_hq": "Gemma-3 1B",
+          "mix-fw_edu_hq-math_4b_hq": "Gemma-3 4B",
+          "mix-fw_edu_hq-math_12b_hq": "Gemma-3 12B",
+          "mix-fw_edu_hq-math_27b_hq": "Gemma-3 27B",
+          dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
+        }
+      },
+      "Math (2.9B)": {
+        datasets: {
+          "mix-fw_edu_hq-math_270m_hq-2.9b": "Gemma-3 270M",
+          "mix-fw_edu_hq-math_1b_hq-2.9b": "Gemma-3 1B",
+          "mix-fw_edu_hq-math_4b_hq-2.9b": "Gemma-3 4B",
+          "mix-fw_edu_hq-math_12b_hq-2.9b": "Gemma-3 12B",
+          "mix-fw_edu_hq-math_27b_hq-2.9b": "Gemma-3 27B",
+          dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
+        }
+      },
+      "Math (6.2B)": {
+        datasets: {
+          "mix-fw_edu_hq-math_270m_hq-6.2b": "Gemma-3 270M",
+          "mix-fw_edu_hq-math_1b_hq-6.2b": "Gemma-3 1B",
+          "mix-fw_edu_hq-math_4b_hq-6.2b": "Gemma-3 4B",
+          "mix-fw_edu_hq-math_12b_hq-6.2b": "Gemma-3 12B",
+          "mix-fw_edu_hq-math_27b_hq-6.2b": "Gemma-3 27B",
+          dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
+        }
+      },
+      "Tutorial (0.5B)": {
+        datasets: {
+          "mix-fw_edu_hq-tutorial_270m_hq-0.5b": "Gemma-3 270M",
+          "mix-fw_edu_hq-tutorial_1b_hq-0.5b": "Gemma-3 1B",
+          "mix-fw_edu_hq-tutorial_4b_hq-0.5b": "Gemma-3 4B",
+          "mix-fw_edu_hq-tutorial_12b_hq-0.5b": "Gemma-3 12B",
+          "mix-fw_edu_hq-tutorial_27b_hq-0.5b": "Gemma-3 27B",
+          dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
+        }
+      },
+      "Tutorial (1.7B)": {
+        datasets: {
+          "mix-fw_edu_hq-tutorial_270m_hq": "Gemma-3 270M",
+          "mix-fw_edu_hq-tutorial_1b_hq": "Gemma-3 1B",
+          "mix-fw_edu_hq-tutorial_4b_hq": "Gemma-3 4B",
+          "mix-fw_edu_hq-tutorial_12b_hq": "Gemma-3 12B",
+          "mix-fw_edu_hq-tutorial_27b_hq": "Gemma-3 27B",
+          dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
+        }
+      },
+      "Tutorial (2.9B)": {
+        datasets: {
+          "mix-fw_edu_hq-tutorial_270m_hq-2.9b": "Gemma-3 270M",
+          "mix-fw_edu_hq-tutorial_1b_hq-2.9b": "Gemma-3 1B",
+          "mix-fw_edu_hq-tutorial_4b_hq-2.9b": "Gemma-3 4B",
+          "mix-fw_edu_hq-tutorial_12b_hq-2.9b": "Gemma-3 12B",
+          "mix-fw_edu_hq-tutorial_27b_hq-2.9b": "Gemma-3 27B",
+          dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
+        }
+      },
+      "Tutorial (6.2B)": {
+        datasets: {
+          "mix-fw_edu_hq-tutorial_270m_hq-6.2b": "Gemma-3 270M",
+          "mix-fw_edu_hq-tutorial_1b_hq-6.2b": "Gemma-3 1B",
+          "mix-fw_edu_hq-tutorial_4b_hq-6.2b": "Gemma-3 4B",
+          "mix-fw_edu_hq-tutorial_12b_hq-6.2b": "Gemma-3 12B",
+          "mix-fw_edu_hq-tutorial_27b_hq-6.2b": "Gemma-3 27B",
+          dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
+        }
+      }
+    }
+  }}
+/>
+**A small student squashes differences between generators.** At 0.5B, the spread across generator sizes is about half of what we see at 2.9B (on [guided_rewrite](#guided_rewrite_original): 0.012 vs 0.024 macro spread at 10k steps). Bump the student to 2.9B and a clear ranking appears: 270M lowest, 1B in the middle, larger generators on top (12B often wins on [guided_rewrite](#guided_rewrite_original) and [tutorial](#tutorial)). Going further to 6.2B doesn't help much: the total spread stays about the same, but the ordering among large generators gets noisier (4B can edge out 12B).
+**Math is the exception.** At 0.5B, the 1B generator is actually the *best* in the sweep, beating 12B and 27B. That lead fades at larger student sizes (12B takes over at 2.9B and 6.2B), but 1B stays surprisingly competitive. Switch the chart to GSM8K and the pattern gets even sharper: [math](#math) data from the small Gemma is unusually strong for how cheap it is.
+**2.9B is the sweet spot.** It cleanly separates 270M, 1B, and larger generators without the extra cost of 6.2B, which barely widens the spread.
+<Sidenote>
+We ran this student sweep later in the project. The earlier experiments still use the 1.7B student because those runs were planned or launched before these results existed.
+</Sidenote>
+**The 1.7B student hid differences above 1B.** On [guided_rewrite](#guided_rewrite_original), the gap between 1B and the best 4B+ generator is just +0.009 at the 1.7B student, easy to write off as noise. At 2.9B that same gap jumps to +0.017, at 6.2B it's +0.013. [Tutorial](#tutorial) tells the same story (+0.004 → +0.006 → +0.014). So "bigger generators don't help" was partly the 1.7B student squashing those differences. **With a bigger student, three tiers show up:** 270M is clearly worst, 1B sits in the middle, and 4B+ generators form a top group. The gap from 1B to the top is real, just smaller than the jump from 270M to 1B.
+**Bigger students get more out of the same data.** @rewire report the same pattern: their rewritten data adds +1.0pp at 1B, +1.3pp at 3B, and +2.5pp at 7B over filtered web data alone. We see it too: average macro score climbs from 0.109 (0.5B student) to 0.143 (1.7B) to 0.150 (2.9B) to 0.157 (6.2B), and the generator spread roughly doubles from 0.5B to 2.9B.
+All of the above is about *who* generates the data and *how big* the student is. But what about the data itself? Let's start with the simplest property: how long is the output?
 ### Do Chatty Models Make Better Data?
 <Note title="Summary: Analyses" variant="info">
 **Cost**: Small models with simple prompts dominate the Pareto frontier. Invest in prompt design, not model size.<br/>
 **Quality scores**: Neither edu-score nor DCLM-score reliably predicts downstream performance for synthetic data. There is no shortcut to training and evaluating.<br/>
+**Proxy model size**: A 2.9B student reveals three tiers (270M {'<'} 1B {'<'} 4B+) that the 1.7B student compressed. Generator gains above 1B are real but smaller than student-side gains. Student scale is the bigger lever.<br/>
 **Verbosity**: Output length has no meaningful relationship with performance. What matters is content, not compression ratio.<br/>
 **Diversity**: Template collapse hurts more than noisy outputs. A messier model that produces varied text can outperform a polished one that repeats the same template.
 </Note>

app/src/content/chapters/7-conclusions.mdx CHANGED Viewed

@@ -13,7 +13,7 @@ Beyond faster generation, we answered several questions about best practices but
 - **Generation parameters**: What influence do temperature, `top_p`, and other sampling settings have on rephrasing quality?
 - **Context extension**: Does chunked rollouts context extension during mid-training improve downstream performance?
 - **Best-of-N filtering**: Can we generate multiple rollouts per example and score them to keep only the best one?
-- **Scaling to larger models**: REWIRE [@rewire] reports larger gains for bigger models trained on their data. Can we reproduce this?
 - **Automatic prompt optimization**: Does prompt optimization with tools like DSPy [@dspy] improve rephrasing performance?
 - **Longer pretraining**: Our ablations trained for 21B tokens. Do the same findings hold at 100B+ token scales, and do prompt rankings shift with longer training?
 - **Source filtering**: Should we filter documents before or after rephrasing? For instance, applying a math prompt to non-mathematical documents likely wastes compute and adds noise.

 - **Generation parameters**: What influence do temperature, `top_p`, and other sampling settings have on rephrasing quality?
 - **Context extension**: Does chunked rollouts context extension during mid-training improve downstream performance?
 - **Best-of-N filtering**: Can we generate multiple rollouts per example and score them to keep only the best one?
+- **Scaling to larger models**: Our student sweep (0.5B to 6.2B) confirms that larger students extract more value from synthetic data, consistent with @rewire, and reveals generator differences above 1B that smaller students hide. Training at even larger scales (10B+) could amplify these gaps further.
 - **Automatic prompt optimization**: Does prompt optimization with tools like DSPy [@dspy] improve rephrasing performance?
 - **Longer pretraining**: Our ablations trained for 21B tokens. Do the same findings hold at 100B+ token scales, and do prompt rankings shift with longer training?
 - **Source filtering**: Should we filter documents before or after rephrasing? For instance, applying a math prompt to non-mathematical documents likely wastes compute and adds noise.

app/src/content/chapters/8-appendix.mdx CHANGED Viewed

@@ -2,7 +2,7 @@
 ### Details on the experiments
-For our ablations we train a 1.2B parameter language model using a Qwen2-style [@qwen2] architecture with 28 layers, a hidden dimension of 2048, 16 attention heads with 8 key-value heads (grouped-query attention [@gqa]), and an intermediate size of 6144. The model utilized the Llama 3.2 [@llama3] tokenizer (`hynky/Llama-3.2-1B-no-bos`) with a vocabulary size of 128,256 tokens. Training was conducted on 64 NVIDIA H100 80GB GPUs across 8 nodes using pure data parallelism (DP=64) with a global batch size of 512 and a sequence length of 4,096 tokens, accumulating to approximately 21 billion tokens total over 10,000 steps. We employed the AdamW [@adamw] optimizer with a learning rate of 5×10⁻⁴, β₁=0.9, β₂=0.95, weight decay of 0.1, and gradient clipping at 1.0. All training utilized bfloat16 precision with Flash Attention 2 [@flashattention2], fused operations (RMS normalization and rotary embeddings [@rope]), and document masking to prevent cross-document attention. We aim to rephrase at least 10B tokens per experiment but due to wildly varying number of completion tokens by prompt we sometimes get less than that. In these cases we train on some of the data twice.
 ### Prompts

 ### Details on the experiments
+For our ablations we train a 1.7B parameter language model using a Qwen2-style [@qwen2] architecture with 28 layers, a hidden dimension of 2048, 16 attention heads with 8 key-value heads (grouped-query attention [@gqa]), and an intermediate size of 6144. The model utilized the Llama 3.2 [@llama3] tokenizer (`hynky/Llama-3.2-1B-no-bos`) with a vocabulary size of 128,256 tokens. Training was conducted on 64 NVIDIA H100 80GB GPUs across 8 nodes using pure data parallelism (DP=64) with a global batch size of 512 and a sequence length of 4,096 tokens, accumulating to approximately 21 billion tokens total over 10,000 steps. We employed the AdamW [@adamw] optimizer with a learning rate of 5×10⁻⁴, β₁=0.9, β₂=0.95, weight decay of 0.1, and gradient clipping at 1.0. All training utilized bfloat16 precision with Flash Attention 2 [@flashattention2], fused operations (RMS normalization and rotary embeddings [@rope]), and document masking to prevent cross-document attention. We aim to rephrase at least 10B tokens per experiment but due to wildly varying number of completion tokens by prompt we sometimes get less than that. In these cases we train on some of the data twice.
 ### Prompts

app/src/content/embeds/d3-benchmark-comparison.html CHANGED Viewed

@@ -9,6 +9,13 @@
                                                                             //   full form: display is required, rest optional
     },
     "setups":         { "Setup Label": { "datasets": {...} }, ... },        // optional, multi-setup mode with dropdown + average
     "defaultMetric":  "agg_score_macro",                                    // optional, default: "agg_score_macro"
     "defaultView":    "bar",                                                // optional, "bar" | "line", default: "bar"
     "defaultSetup":   "average",                                            // optional, setup name or "average", default: "average" when ≥2 setups
@@ -236,10 +243,36 @@
       // ─── SETUP SUPPORT ───
       const SETUPS = cfg.setups || null;
       const setupNames = SETUPS ? Object.keys(SETUPS) : [];
-      const AVG_SETUP_KEY = 'Average (all setups)';
       const HIDE_AVERAGE = !!cfg.hideAverage;
-      const defaultSetupCfg = cfg.defaultSetup || (setupNames.length >= 2 && !HIDE_AVERAGE ? 'average' : null);
-      let currentSetup = SETUPS ? (defaultSetupCfg === 'average' ? AVG_SETUP_KEY : (defaultSetupCfg && setupNames.includes(defaultSetupCfg) ? defaultSetupCfg : setupNames[0])) : null;
       let DATASETS = SETUPS ? (currentSetup === AVG_SETUP_KEY ? {} : normalizeDatasets(SETUPS[currentSetup].datasets)) : normalizeDatasets(cfg.datasets);
       let avgDatasets = {};
       let parsedData = [];
@@ -344,9 +377,10 @@
       }
       function computeAverageData(rawData) {
-        if (!SETUPS || setupNames.length < 2) return { data: [], datasets: {} };
         const displayToRaws = {};
-        for (const sName of setupNames) {
           const ds = normalizeDatasets(SETUPS[sName].datasets);
           for (const [raw, opts] of Object.entries(ds)) {
             if (!displayToRaws[opts.display]) displayToRaws[opts.display] = [];
@@ -354,7 +388,7 @@
           }
         }
         const fullDisplay = Object.entries(displayToRaws)
-          .filter(([, raws]) => raws.length >= setupNames.length);
         const byRunStep = {};
         for (const row of rawData) byRunStep[row[RUN_COL] + '|' + row[STEP_COL]] = row;
         const steps = Array.from(new Set(rawData.map(r => +r[STEP_COL]))).sort((a, b) => a - b);
@@ -364,7 +398,7 @@
         for (const [display, raws] of fullDisplay) {
           const avgRaw = '__avg__' + display.replace(/[^a-zA-Z0-9]/g, '_');
           // Merge options from first setup that has this display name
-          const firstOpts = Object.values(normalizeDatasets(SETUPS[setupNames[0]].datasets)).find(o => o.display === display) || {};
           dsMap[avgRaw] = { display, ...firstOpts };
           for (const step of steps) {
             const rows = raws.map(r => byRunStep[r + '|' + step]).filter(Boolean);
@@ -381,6 +415,14 @@
         return { data: result, datasets: dsMap };
       }
       function switchSetup(name) {
         currentSetup = name;
         if (name === AVG_SETUP_KEY) {
@@ -388,7 +430,6 @@
         } else {
           DATASETS = normalizeDatasets(SETUPS[name].datasets);
         }
-        // Re-add baselines from any setup
         for (const sName of setupNames) {
           const ds = normalizeDatasets(SETUPS[sName].datasets);
           for (const [raw, opts] of Object.entries(ds)) {
@@ -404,6 +445,33 @@
         buildLegend();
       }
       function showTip(html, x, y) {
         tipInner.innerHTML = html;
         const tipW = tip.offsetWidth || 180;
@@ -870,16 +938,32 @@
       function buildUI() {
         const controls = document.createElement('div'); controls.className = 'controls';
         if (SETUPS && setupNames.length > 0) {
           const setupGroup = document.createElement('div'); setupGroup.className = 'control-group';
           const setupLabel = document.createElement('label'); setupLabel.setAttribute('for', 'setup-' + uid); setupLabel.textContent = 'Setup';
           const setupSelect = document.createElement('select'); setupSelect.id = 'setup-' + uid;
-          setupNames.forEach(name => {
-            const opt = document.createElement('option'); opt.value = name; opt.textContent = name;
             if (name === currentSetup) opt.selected = true;
             setupSelect.appendChild(opt);
           });
-          if (setupNames.length >= 2 && !HIDE_AVERAGE) {
             const avgOpt = document.createElement('option'); avgOpt.value = AVG_SETUP_KEY; avgOpt.textContent = AVG_SETUP_KEY;
             if (currentSetup === AVG_SETUP_KEY) avgOpt.selected = true;
             setupSelect.appendChild(avgOpt);
@@ -989,7 +1073,7 @@
           const text = await fetchFirstAvailable(csvPaths);
           const parsed = d3.csvParse(text);
           parsedData = parsed;
-          if (SETUPS && setupNames.length >= 2 && !HIDE_AVERAGE) {
             const avg = computeAverageData(parsed);
             avgDatasets = avg.datasets;
             const hasAvgData = Object.values(avgDatasets).some(o => !o.baseline);
@@ -1000,7 +1084,7 @@
             } else {
               const sel = container.querySelector('#setup-' + uid);
               if (sel) { const o = sel.querySelector(`option[value="${AVG_SETUP_KEY}"]`); if (o) o.remove(); }
-              if (currentSetup === AVG_SETUP_KEY) { currentSetup = setupNames[0]; DATASETS = normalizeDatasets(SETUPS[currentSetup].datasets); if (sel) sel.value = currentSetup; }
             }
           }
           filterData();

                                                                             //   full form: display is required, rest optional
     },
     "setups":         { "Setup Label": { "datasets": {...} }, ... },        // optional, multi-setup mode with dropdown + average
+    "setupDimension": {                                                    // optional, adds a second dropdown that groups setups
+      "label": "Student",                                                  //   dropdown label
+      "groups": {                                                          //   maps dimension values → lists of setup keys
+        "0.5B student": ["Math (0.5B)", "Tutorial (0.5B)"],                //   only these setups are shown when this group is active
+        "1.7B student": ["Math (1.7B)", "Tutorial (1.7B)"]
+      }
+    },
     "defaultMetric":  "agg_score_macro",                                    // optional, default: "agg_score_macro"
     "defaultView":    "bar",                                                // optional, "bar" | "line", default: "bar"
     "defaultSetup":   "average",                                            // optional, setup name or "average", default: "average" when ≥2 setups
       // ─── SETUP SUPPORT ───
       const SETUPS = cfg.setups || null;
       const setupNames = SETUPS ? Object.keys(SETUPS) : [];
       const HIDE_AVERAGE = !!cfg.hideAverage;
+      // ─── SETUP DIMENSION (optional second dropdown that groups setups) ───
+      const SETUP_DIM = cfg.setupDimension || null;
+      const dimLabel = SETUP_DIM ? (SETUP_DIM.label || 'Group') : '';
+      const dimGroups = SETUP_DIM ? (SETUP_DIM.groups || {}) : {};
+      const dimensions = Object.keys(dimGroups);
+      let currentDimension = dimensions.length >= 2 ? dimensions[0] : null;
+      // Reverse lookup: setup name → dimension value
+      const setupToDim = {};
+      for (const [dim, names] of Object.entries(dimGroups)) {
+        for (const n of names) setupToDim[n] = dim;
+      }
+      function filteredSetupNames() {
+        if (!currentDimension) return setupNames;
+        return dimGroups[currentDimension] || [];
+      }
+      function setupDisplayName(name) {
+        if (!currentDimension) return name;
+        // Strip " (dimensionValue)" suffix from display name when a dimension is active
+        for (const dim of dimensions) {
+          const suffix = ' (' + dim + ')';
+          if (name.endsWith(suffix)) return name.slice(0, -suffix.length);
+        }
+        return name;
+      }
+      const AVG_SETUP_KEY = 'Average (all setups)';
+      const defaultSetupCfg = cfg.defaultSetup || (filteredSetupNames().length >= 2 && !HIDE_AVERAGE ? 'average' : null);
+      let currentSetup = SETUPS ? (defaultSetupCfg === 'average' ? AVG_SETUP_KEY : (defaultSetupCfg && setupNames.includes(defaultSetupCfg) ? defaultSetupCfg : filteredSetupNames()[0])) : null;
       let DATASETS = SETUPS ? (currentSetup === AVG_SETUP_KEY ? {} : normalizeDatasets(SETUPS[currentSetup].datasets)) : normalizeDatasets(cfg.datasets);
       let avgDatasets = {};
       let parsedData = [];
       }
       function computeAverageData(rawData) {
+        const activeSetups = filteredSetupNames();
+        if (!SETUPS || activeSetups.length < 2) return { data: [], datasets: {} };
         const displayToRaws = {};
+        for (const sName of activeSetups) {
           const ds = normalizeDatasets(SETUPS[sName].datasets);
           for (const [raw, opts] of Object.entries(ds)) {
             if (!displayToRaws[opts.display]) displayToRaws[opts.display] = [];
           }
         }
         const fullDisplay = Object.entries(displayToRaws)
+          .filter(([, raws]) => raws.length >= activeSetups.length);
         const byRunStep = {};
         for (const row of rawData) byRunStep[row[RUN_COL] + '|' + row[STEP_COL]] = row;
         const steps = Array.from(new Set(rawData.map(r => +r[STEP_COL]))).sort((a, b) => a - b);
         for (const [display, raws] of fullDisplay) {
           const avgRaw = '__avg__' + display.replace(/[^a-zA-Z0-9]/g, '_');
           // Merge options from first setup that has this display name
+          const firstOpts = Object.values(normalizeDatasets(SETUPS[activeSetups[0]].datasets)).find(o => o.display === display) || {};
           dsMap[avgRaw] = { display, ...firstOpts };
           for (const step of steps) {
             const rows = raws.map(r => byRunStep[r + '|' + step]).filter(Boolean);
         return { data: result, datasets: dsMap };
       }
+      function recomputeAverage() {
+        const avg = computeAverageData(parsedData.filter(r => !r[RUN_COL].startsWith('__avg__')));
+        avgDatasets = avg.datasets;
+        const origData = parsedData.filter(r => !r[RUN_COL].startsWith('__avg__'));
+        parsedData = origData.concat(avg.data);
+        parsedData.columns = origData.columns || (origData[0] ? Object.keys(origData[0]) : []);
+      }
       function switchSetup(name) {
         currentSetup = name;
         if (name === AVG_SETUP_KEY) {
         } else {
           DATASETS = normalizeDatasets(SETUPS[name].datasets);
         }
         for (const sName of setupNames) {
           const ds = normalizeDatasets(SETUPS[sName].datasets);
           for (const [raw, opts] of Object.entries(ds)) {
         buildLegend();
       }
+      function switchDimension(dim) {
+        currentDimension = dim;
+        recomputeAverage();
+        // Rebuild setup dropdown options
+        const sel = container.querySelector('#setup-' + uid);
+        if (sel) {
+          sel.innerHTML = '';
+          const active = filteredSetupNames();
+          active.forEach(name => {
+            const opt = document.createElement('option'); opt.value = name; opt.textContent = setupDisplayName(name);
+            sel.appendChild(opt);
+          });
+          if (active.length >= 2 && !HIDE_AVERAGE) {
+            const avgOpt = document.createElement('option'); avgOpt.value = AVG_SETUP_KEY; avgOpt.textContent = AVG_SETUP_KEY;
+            sel.appendChild(avgOpt);
+          }
+          // Default to average if available, else first setup
+          if (active.length >= 2 && !HIDE_AVERAGE) {
+            sel.value = AVG_SETUP_KEY;
+            switchSetup(AVG_SETUP_KEY);
+          } else if (active.length) {
+            sel.value = active[0];
+            switchSetup(active[0]);
+          }
+        }
+      }
       function showTip(html, x, y) {
         tipInner.innerHTML = html;
         const tipW = tip.offsetWidth || 180;
       function buildUI() {
         const controls = document.createElement('div'); controls.className = 'controls';
+        // Dimension dropdown — shown before Setup when setupDimension is configured
+        if (dimensions.length >= 2) {
+          const dimGroup = document.createElement('div'); dimGroup.className = 'control-group';
+          const dimLabelEl = document.createElement('label'); dimLabelEl.setAttribute('for', 'dim-' + uid); dimLabelEl.textContent = dimLabel;
+          const dimSelect = document.createElement('select'); dimSelect.id = 'dim-' + uid;
+          dimensions.forEach(dim => {
+            const opt = document.createElement('option'); opt.value = dim; opt.textContent = dim;
+            if (dim === currentDimension) opt.selected = true;
+            dimSelect.appendChild(opt);
+          });
+          dimSelect.addEventListener('change', () => { switchDimension(dimSelect.value); });
+          dimGroup.appendChild(dimLabelEl); dimGroup.appendChild(dimSelect);
+          controls.appendChild(dimGroup);
+        }
         if (SETUPS && setupNames.length > 0) {
           const setupGroup = document.createElement('div'); setupGroup.className = 'control-group';
           const setupLabel = document.createElement('label'); setupLabel.setAttribute('for', 'setup-' + uid); setupLabel.textContent = 'Setup';
           const setupSelect = document.createElement('select'); setupSelect.id = 'setup-' + uid;
+          const active = filteredSetupNames();
+          active.forEach(name => {
+            const opt = document.createElement('option'); opt.value = name; opt.textContent = setupDisplayName(name);
             if (name === currentSetup) opt.selected = true;
             setupSelect.appendChild(opt);
           });
+          if (active.length >= 2 && !HIDE_AVERAGE) {
             const avgOpt = document.createElement('option'); avgOpt.value = AVG_SETUP_KEY; avgOpt.textContent = AVG_SETUP_KEY;
             if (currentSetup === AVG_SETUP_KEY) avgOpt.selected = true;
             setupSelect.appendChild(avgOpt);
           const text = await fetchFirstAvailable(csvPaths);
           const parsed = d3.csvParse(text);
           parsedData = parsed;
+          if (SETUPS && filteredSetupNames().length >= 2 && !HIDE_AVERAGE) {
             const avg = computeAverageData(parsed);
             avgDatasets = avg.datasets;
             const hasAvgData = Object.values(avgDatasets).some(o => !o.baseline);
             } else {
               const sel = container.querySelector('#setup-' + uid);
               if (sel) { const o = sel.querySelector(`option[value="${AVG_SETUP_KEY}"]`); if (o) o.remove(); }
+              if (currentSetup === AVG_SETUP_KEY) { currentSetup = filteredSetupNames()[0]; DATASETS = normalizeDatasets(SETUPS[currentSetup].datasets); if (sel) sel.value = currentSetup; }
             }
           }
           filterData();