thomwolf HF staff commited on
Commit
938cee6
1 Parent(s): d91f1f2
Files changed (2) hide show
  1. dist/index.html +2 -1
  2. src/index.html +2 -1
dist/index.html CHANGED
@@ -149,10 +149,11 @@
149
  training data. We then evaluated each model on the same set of tasks and compared average
150
  scores.</p>
151
  <p>Our ablation models were trained using <a
152
- href="https://github.com/huggingface/nanotron"><code>nanotron</code></a><aside>We'll make the configuration to reproduce these ablation models available soon.</aside>. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
153
  architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
154
  ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
155
  model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
 
156
  <p>We evaluated the models using <a
157
  href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We carefully selected a set of benchmark for ablations by selecting
158
  benchmarks that would provide good signal at a relatively small scale ("small" models trained on only "a few
 
149
  training data. We then evaluated each model on the same set of tasks and compared average
150
  scores.</p>
151
  <p>Our ablation models were trained using <a
152
+ href="https://github.com/huggingface/nanotron"><code>nanotron</code></a>. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
153
  architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
154
  ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
155
  model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
156
+ <aside>We'll make the configuration to reproduce these ablation models available soon in Nanotron.</aside>
157
  <p>We evaluated the models using <a
158
  href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We carefully selected a set of benchmark for ablations by selecting
159
  benchmarks that would provide good signal at a relatively small scale ("small" models trained on only "a few
src/index.html CHANGED
@@ -149,10 +149,11 @@
149
  training data. We then evaluated each model on the same set of tasks and compared average
150
  scores.</p>
151
  <p>Our ablation models were trained using <a
152
- href="https://github.com/huggingface/nanotron"><code>nanotron</code></a><aside>We'll make the configuration to reproduce these ablation models available soon.</aside>. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
153
  architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
154
  ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
155
  model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
 
156
  <p>We evaluated the models using <a
157
  href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We carefully selected a set of benchmark for ablations by selecting
158
  benchmarks that would provide good signal at a relatively small scale ("small" models trained on only "a few
 
149
  training data. We then evaluated each model on the same set of tasks and compared average
150
  scores.</p>
151
  <p>Our ablation models were trained using <a
152
+ href="https://github.com/huggingface/nanotron"><code>nanotron</code></a>. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
153
  architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
154
  ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
155
  model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
156
+ <aside>We'll make the configuration to reproduce these ablation models available soon in Nanotron.</aside>
157
  <p>We evaluated the models using <a
158
  href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We carefully selected a set of benchmark for ablations by selecting
159
  benchmarks that would provide good signal at a relatively small scale ("small" models trained on only "a few