Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

thomwolf HF staff commited on Jun 1, 2024

Commit

938cee6

•

1 Parent(s): d91f1f2

hop

Files changed (2) hide show

dist/index.html CHANGED Viewed

@@ -149,10 +149,11 @@
         training data. We then evaluated each model on the same set of tasks and compared average
         scores.</p>
     <p>Our ablation models were trained using <a
-            href="https://github.com/huggingface/nanotron"><code>nanotron</code></a><aside>We'll make the configuration to reproduce these ablation models available soon.</aside>. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
         architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
         ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
         model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
     <p>We evaluated the models using <a
             href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We carefully selected a set of benchmark for ablations by selecting
         benchmarks that would provide good signal at a relatively small scale ("small" models trained on only "a few

         training data. We then evaluated each model on the same set of tasks and compared average
         scores.</p>
     <p>Our ablation models were trained using <a
+            href="https://github.com/huggingface/nanotron"><code>nanotron</code></a>. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
         architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
         ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
         model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
+        <aside>We'll make the configuration to reproduce these ablation models available soon in Nanotron.</aside>
     <p>We evaluated the models using <a
             href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We carefully selected a set of benchmark for ablations by selecting
         benchmarks that would provide good signal at a relatively small scale ("small" models trained on only "a few

src/index.html CHANGED Viewed

@@ -149,10 +149,11 @@
         training data. We then evaluated each model on the same set of tasks and compared average
         scores.</p>
     <p>Our ablation models were trained using <a
-            href="https://github.com/huggingface/nanotron"><code>nanotron</code></a><aside>We'll make the configuration to reproduce these ablation models available soon.</aside>. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
         architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
         ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
         model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
     <p>We evaluated the models using <a
             href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We carefully selected a set of benchmark for ablations by selecting
         benchmarks that would provide good signal at a relatively small scale ("small" models trained on only "a few

         training data. We then evaluated each model on the same set of tasks and compared average
         scores.</p>
     <p>Our ablation models were trained using <a
+            href="https://github.com/huggingface/nanotron"><code>nanotron</code></a>. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
         architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
         ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
         model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
+        <aside>We'll make the configuration to reproduce these ablation models available soon in Nanotron.</aside>
     <p>We evaluated the models using <a
             href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We carefully selected a set of benchmark for ablations by selecting
         benchmarks that would provide good signal at a relatively small scale ("small" models trained on only "a few