update
Browse files- dist/index.html +1 -2
- src/index.html +1 -2
dist/index.html
CHANGED
@@ -149,8 +149,7 @@
|
|
149 |
training data. We then evaluated each model on the same set of tasks and compared average
|
150 |
scores.</p>
|
151 |
<p>Our ablation models were trained using <a
|
152 |
-
href="https://github.com/huggingface/nanotron"><code>nanotron</code></a>
|
153 |
-
INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
|
154 |
architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
|
155 |
ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
|
156 |
model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
|
|
|
149 |
training data. We then evaluated each model on the same set of tasks and compared average
|
150 |
scores.</p>
|
151 |
<p>Our ablation models were trained using <a
|
152 |
+
href="https://github.com/huggingface/nanotron"><code>nanotron</code></a><aside>We'll make the configuration to reproduce these ablation models available soon.</aside>. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
|
|
|
153 |
architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
|
154 |
ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
|
155 |
model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
|
src/index.html
CHANGED
@@ -149,8 +149,7 @@
|
|
149 |
training data. We then evaluated each model on the same set of tasks and compared average
|
150 |
scores.</p>
|
151 |
<p>Our ablation models were trained using <a
|
152 |
-
href="https://github.com/huggingface/nanotron"><code>nanotron</code></a>
|
153 |
-
INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
|
154 |
architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
|
155 |
ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
|
156 |
model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
|
|
|
149 |
training data. We then evaluated each model on the same set of tasks and compared average
|
150 |
scores.</p>
|
151 |
<p>Our ablation models were trained using <a
|
152 |
+
href="https://github.com/huggingface/nanotron"><code>nanotron</code></a><aside>We'll make the configuration to reproduce these ablation models available soon.</aside>. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
|
|
|
153 |
architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
|
154 |
ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
|
155 |
model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
|