hop
Browse files- dist/index.html +2 -1
- src/index.html +2 -1
dist/index.html
CHANGED
@@ -149,10 +149,11 @@
|
|
149 |
training data. We then evaluated each model on the same set of tasks and compared average
|
150 |
scores.</p>
|
151 |
<p>Our ablation models were trained using <a
|
152 |
-
href="https://github.com/huggingface/nanotron"><code>nanotron</code></a
|
153 |
architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
|
154 |
ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
|
155 |
model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
|
|
|
156 |
<p>We evaluated the models using <a
|
157 |
href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We carefully selected a set of benchmark for ablations by selecting
|
158 |
benchmarks that would provide good signal at a relatively small scale ("small" models trained on only "a few
|
|
|
149 |
training data. We then evaluated each model on the same set of tasks and compared average
|
150 |
scores.</p>
|
151 |
<p>Our ablation models were trained using <a
|
152 |
+
href="https://github.com/huggingface/nanotron"><code>nanotron</code></a>. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
|
153 |
architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
|
154 |
ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
|
155 |
model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
|
156 |
+
<aside>We'll make the configuration to reproduce these ablation models available soon in Nanotron.</aside>
|
157 |
<p>We evaluated the models using <a
|
158 |
href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We carefully selected a set of benchmark for ablations by selecting
|
159 |
benchmarks that would provide good signal at a relatively small scale ("small" models trained on only "a few
|
src/index.html
CHANGED
@@ -149,10 +149,11 @@
|
|
149 |
training data. We then evaluated each model on the same set of tasks and compared average
|
150 |
scores.</p>
|
151 |
<p>Our ablation models were trained using <a
|
152 |
-
href="https://github.com/huggingface/nanotron"><code>nanotron</code></a
|
153 |
architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
|
154 |
ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
|
155 |
model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
|
|
|
156 |
<p>We evaluated the models using <a
|
157 |
href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We carefully selected a set of benchmark for ablations by selecting
|
158 |
benchmarks that would provide good signal at a relatively small scale ("small" models trained on only "a few
|
|
|
149 |
training data. We then evaluated each model on the same set of tasks and compared average
|
150 |
scores.</p>
|
151 |
<p>Our ablation models were trained using <a
|
152 |
+
href="https://github.com/huggingface/nanotron"><code>nanotron</code></a>. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
|
153 |
architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
|
154 |
ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
|
155 |
model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
|
156 |
+
<aside>We'll make the configuration to reproduce these ablation models available soon in Nanotron.</aside>
|
157 |
<p>We evaluated the models using <a
|
158 |
href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We carefully selected a set of benchmark for ablations by selecting
|
159 |
benchmarks that would provide good signal at a relatively small scale ("small" models trained on only "a few
|