thomwolf HF staff commited on
Commit
d91f1f2
1 Parent(s): 1efa1a5
Files changed (2) hide show
  1. dist/index.html +1 -2
  2. src/index.html +1 -2
dist/index.html CHANGED
@@ -149,8 +149,7 @@
149
  training data. We then evaluated each model on the same set of tasks and compared average
150
  scores.</p>
151
  <p>Our ablation models were trained using <a
152
- href="https://github.com/huggingface/nanotron"><code>nanotron</code></a> with this config [<strong>TODO:
153
- INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
154
  architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
155
  ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
156
  model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
 
149
  training data. We then evaluated each model on the same set of tasks and compared average
150
  scores.</p>
151
  <p>Our ablation models were trained using <a
152
+ href="https://github.com/huggingface/nanotron"><code>nanotron</code></a><aside>We'll make the configuration to reproduce these ablation models available soon.</aside>. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
 
153
  architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
154
  ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
155
  model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
src/index.html CHANGED
@@ -149,8 +149,7 @@
149
  training data. We then evaluated each model on the same set of tasks and compared average
150
  scores.</p>
151
  <p>Our ablation models were trained using <a
152
- href="https://github.com/huggingface/nanotron"><code>nanotron</code></a> with this config [<strong>TODO:
153
- INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
154
  architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
155
  ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
156
  model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
 
149
  training data. We then evaluated each model on the same set of tasks and compared average
150
  scores.</p>
151
  <p>Our ablation models were trained using <a
152
+ href="https://github.com/huggingface/nanotron"><code>nanotron</code></a><aside>We'll make the configuration to reproduce these ablation models available soon.</aside>. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
 
153
  architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
154
  ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
155
  model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>