Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

thomwolf HF staff commited on Jun 1, 2024

Commit

d91f1f2

•

1 Parent(s): 1efa1a5

update

Files changed (2) hide show

dist/index.html CHANGED Viewed

@@ -149,8 +149,7 @@
         training data. We then evaluated each model on the same set of tasks and compared average
         scores.</p>
     <p>Our ablation models were trained using <a
-            href="https://github.com/huggingface/nanotron"><code>nanotron</code></a> with this config [<strong>TODO:
-        INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
         architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
         ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
         model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>

         training data. We then evaluated each model on the same set of tasks and compared average
         scores.</p>
     <p>Our ablation models were trained using <a
+            href="https://github.com/huggingface/nanotron"><code>nanotron</code></a><aside>We'll make the configuration to reproduce these ablation models available soon.</aside>. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
         architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
         ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
         model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>

src/index.html CHANGED Viewed

@@ -149,8 +149,7 @@
         training data. We then evaluated each model on the same set of tasks and compared average
         scores.</p>
     <p>Our ablation models were trained using <a
-            href="https://github.com/huggingface/nanotron"><code>nanotron</code></a> with this config [<strong>TODO:
-        INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
         architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
         ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
         model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>

         training data. We then evaluated each model on the same set of tasks and compared average
         scores.</p>
     <p>Our ablation models were trained using <a
+            href="https://github.com/huggingface/nanotron"><code>nanotron</code></a><aside>We'll make the configuration to reproduce these ablation models available soon.</aside>. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
         architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
         ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
         model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>