Spaces:
Running
Running
benchmark-viz (#6)
Browse files- benchmark viz (f8c76bdea8a217d4bba71a8943018c39c03965c8)
assets/data/benchmarks/benchmarks_interactive.html
ADDED
The diff for this file is too large to render.
See raw diff
|
|
dist/assets/data/benchmarks/benchmarks_interactive.html
ADDED
The diff for this file is too large to render.
See raw diff
|
|
dist/index.html
CHANGED
@@ -52,7 +52,7 @@
|
|
52 |
<d-contents>
|
53 |
</d-contents>
|
54 |
|
55 |
-
<p>Fueled by the scaling laws<d-cite bibtex-key="kaplan2020scalinglaws"></d-cite><d-cite bibtex-key="hoffmann2022chinchilla"></d-cite>, the trend of training ever larger language models on vaster amounts of data has been driving progress in AI for the past couple years. Initially, the development of the largest models happened exclusively behind closed doors of a handful of research labs but recently opened up more with the release of models such as Llama 3.1 405B<d-cite bibtex-key="grattafiori2024llama3herdmodels"></d-cite> and DeepSeek R1<d-cite bibtex-key="deepseekai2024deepseekv3technicalreport"></d-cite>. While these models have <a href="https://huggingface.co/meta-llama">openly shared</a> <a href="https://huggingface.co/deepseek-ai">weights</a> and their training recipes are described in <a href="https://ai.meta.com/research/publications/the-llama-3-herd-of-models/">technical</a> <a href="https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf">reports</a>, the challenging engineering to involved to train at the necessary infrastructure scale is still hidden between the lines of a handful of papers and complex training frameworks. This
|
56 |
|
57 |
<aside>Reading time: 7 days. For the best reading experience, we recommend not using a mobile phone.</aside>
|
58 |
|
@@ -178,8 +178,15 @@
|
|
178 |
<p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p>
|
179 |
|
180 |
<p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments with up to 512 GPUs to scan many possible distributed training layouts and model sizes. TODO: link to dataset too </p>
|
181 |
-
|
182 |
-
<
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
183 |
|
184 |
<p>As you can see, there’s a lot of ground to be covered. Before getting into the trenches of distributed training let’s take a quick high level look on we’ll cover in the post.</p>
|
185 |
|
|
|
52 |
<d-contents>
|
53 |
</d-contents>
|
54 |
|
55 |
+
<p>Fueled by the scaling laws<d-cite bibtex-key="kaplan2020scalinglaws"></d-cite><d-cite bibtex-key="hoffmann2022chinchilla"></d-cite>, the trend of training ever larger language models on vaster amounts of data has been driving progress in AI for the past couple years. Initially, the development of the largest models happened exclusively behind closed doors of a handful of research labs but recently opened up more with the release of models such as Llama 3.1 405B<d-cite bibtex-key="grattafiori2024llama3herdmodels"></d-cite> and DeepSeek R1<d-cite bibtex-key="deepseekai2024deepseekv3technicalreport"></d-cite>. While these models have <a href="https://huggingface.co/meta-llama">openly shared</a> <a href="https://huggingface.co/deepseek-ai">weights</a> and their training recipes are described in <a href="https://ai.meta.com/research/publications/the-llama-3-herd-of-models/">technical</a> <a href="https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf">reports</a>, the challenging engineering to involved to train at the necessary infrastructure scale is still hidden between the lines of a handful of papers and complex training frameworks. This <s>long blog post</s> open-source book is here to open this black box!</p>
|
56 |
|
57 |
<aside>Reading time: 7 days. For the best reading experience, we recommend not using a mobile phone.</aside>
|
58 |
|
|
|
178 |
<p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p>
|
179 |
|
180 |
<p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments with up to 512 GPUs to scan many possible distributed training layouts and model sizes. TODO: link to dataset too </p>
|
181 |
+
|
182 |
+
<iframe id="plotFrame" src="assets/data/benchmarks/benchmarks_interactive.html" width="90%" scrolling="no" frameborder="0"></iframe>
|
183 |
+
<script>
|
184 |
+
window.addEventListener('load', function() {
|
185 |
+
const frame = document.getElementById('plotFrame');
|
186 |
+
frame.style.height = frame.contentWindow.document.documentElement.scrollHeight + 'px';
|
187 |
+
frame.style.width = frame.contentWindow.document.documentElement.scrollWidth + 'px';
|
188 |
+
});
|
189 |
+
</script>
|
190 |
|
191 |
<p>As you can see, there’s a lot of ground to be covered. Before getting into the trenches of distributed training let’s take a quick high level look on we’ll cover in the post.</p>
|
192 |
|
src/index.html
CHANGED
@@ -52,7 +52,7 @@
|
|
52 |
<d-contents>
|
53 |
</d-contents>
|
54 |
|
55 |
-
<p>Fueled by the scaling laws<d-cite bibtex-key="kaplan2020scalinglaws"></d-cite><d-cite bibtex-key="hoffmann2022chinchilla"></d-cite>, the trend of training ever larger language models on vaster amounts of data has been driving progress in AI for the past couple years. Initially, the development of the largest models happened exclusively behind closed doors of a handful of research labs but recently opened up more with the release of models such as Llama 3.1 405B<d-cite bibtex-key="grattafiori2024llama3herdmodels"></d-cite> and DeepSeek R1<d-cite bibtex-key="deepseekai2024deepseekv3technicalreport"></d-cite>. While these models have <a href="https://huggingface.co/meta-llama">openly shared</a> <a href="https://huggingface.co/deepseek-ai">weights</a> and their training recipes are described in <a href="https://ai.meta.com/research/publications/the-llama-3-herd-of-models/">technical</a> <a href="https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf">reports</a>, the challenging engineering to involved to train at the necessary infrastructure scale is still hidden between the lines of a handful of papers and complex training frameworks. This
|
56 |
|
57 |
<aside>Reading time: 7 days. For the best reading experience, we recommend not using a mobile phone.</aside>
|
58 |
|
@@ -178,8 +178,15 @@
|
|
178 |
<p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p>
|
179 |
|
180 |
<p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments with up to 512 GPUs to scan many possible distributed training layouts and model sizes. TODO: link to dataset too </p>
|
181 |
-
|
182 |
-
<
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
183 |
|
184 |
<p>As you can see, there’s a lot of ground to be covered. Before getting into the trenches of distributed training let’s take a quick high level look on we’ll cover in the post.</p>
|
185 |
|
|
|
52 |
<d-contents>
|
53 |
</d-contents>
|
54 |
|
55 |
+
<p>Fueled by the scaling laws<d-cite bibtex-key="kaplan2020scalinglaws"></d-cite><d-cite bibtex-key="hoffmann2022chinchilla"></d-cite>, the trend of training ever larger language models on vaster amounts of data has been driving progress in AI for the past couple years. Initially, the development of the largest models happened exclusively behind closed doors of a handful of research labs but recently opened up more with the release of models such as Llama 3.1 405B<d-cite bibtex-key="grattafiori2024llama3herdmodels"></d-cite> and DeepSeek R1<d-cite bibtex-key="deepseekai2024deepseekv3technicalreport"></d-cite>. While these models have <a href="https://huggingface.co/meta-llama">openly shared</a> <a href="https://huggingface.co/deepseek-ai">weights</a> and their training recipes are described in <a href="https://ai.meta.com/research/publications/the-llama-3-herd-of-models/">technical</a> <a href="https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf">reports</a>, the challenging engineering to involved to train at the necessary infrastructure scale is still hidden between the lines of a handful of papers and complex training frameworks. This <s>long blog post</s> open-source book is here to open this black box!</p>
|
56 |
|
57 |
<aside>Reading time: 7 days. For the best reading experience, we recommend not using a mobile phone.</aside>
|
58 |
|
|
|
178 |
<p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p>
|
179 |
|
180 |
<p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments with up to 512 GPUs to scan many possible distributed training layouts and model sizes. TODO: link to dataset too </p>
|
181 |
+
|
182 |
+
<iframe id="plotFrame" src="assets/data/benchmarks/benchmarks_interactive.html" width="90%" scrolling="no" frameborder="0"></iframe>
|
183 |
+
<script>
|
184 |
+
window.addEventListener('load', function() {
|
185 |
+
const frame = document.getElementById('plotFrame');
|
186 |
+
frame.style.height = frame.contentWindow.document.documentElement.scrollHeight + 'px';
|
187 |
+
frame.style.width = frame.contentWindow.document.documentElement.scrollWidth + 'px';
|
188 |
+
});
|
189 |
+
</script>
|
190 |
|
191 |
<p>As you can see, there’s a lot of ground to be covered. Before getting into the trenches of distributed training let’s take a quick high level look on we’ll cover in the post.</p>
|
192 |
|