conclusion
Browse files- dist/index.html +4 -4
- src/index.html +4 -4
dist/index.html
CHANGED
@@ -646,10 +646,10 @@
|
|
646 |
<p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2">HuggingFaceFW/fineweb-edu-score-2</a>.</p>
|
647 |
<p>You can find the two datasets along with the classifier used for the filtering in this <a href="https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd">collection</a>.</p>
|
648 |
|
649 |
-
<h2>
|
650 |
-
<p>Through our open science efforts we hope to open more and more the
|
651 |
-
<p>
|
652 |
-
<p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale in the open
|
653 |
</d-article>
|
654 |
|
655 |
<d-appendix>
|
|
|
646 |
<p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2">HuggingFaceFW/fineweb-edu-score-2</a>.</p>
|
647 |
<p>You can find the two datasets along with the classifier used for the filtering in this <a href="https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd">collection</a>.</p>
|
648 |
|
649 |
+
<h2>Conclusion and looking forward</h2>
|
650 |
+
<p>Through our open science efforts we hope to open more and more the black box around training high performance large language models as well as give every model trainer the ability to create state-of-the-art LLMs. We're excited to continue iterating on FineWeb and increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
|
651 |
+
<p>In particular in the short term, while English currently dominates the large language model landscape, we're looking forward to applying the learnings we make in this project to make high quality training data available in other languages as well and as accessible as possible.</p>
|
652 |
+
<p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open 🤗.</p>
|
653 |
</d-article>
|
654 |
|
655 |
<d-appendix>
|
src/index.html
CHANGED
@@ -646,10 +646,10 @@
|
|
646 |
<p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2">HuggingFaceFW/fineweb-edu-score-2</a>.</p>
|
647 |
<p>You can find the two datasets along with the classifier used for the filtering in this <a href="https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd">collection</a>.</p>
|
648 |
|
649 |
-
<h2>
|
650 |
-
<p>Through our open science efforts we hope to open more and more the
|
651 |
-
<p>
|
652 |
-
<p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale in the open
|
653 |
</d-article>
|
654 |
|
655 |
<d-appendix>
|
|
|
646 |
<p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2">HuggingFaceFW/fineweb-edu-score-2</a>.</p>
|
647 |
<p>You can find the two datasets along with the classifier used for the filtering in this <a href="https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd">collection</a>.</p>
|
648 |
|
649 |
+
<h2>Conclusion and looking forward</h2>
|
650 |
+
<p>Through our open science efforts we hope to open more and more the black box around training high performance large language models as well as give every model trainer the ability to create state-of-the-art LLMs. We're excited to continue iterating on FineWeb and increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
|
651 |
+
<p>In particular in the short term, while English currently dominates the large language model landscape, we're looking forward to applying the learnings we make in this project to make high quality training data available in other languages as well and as accessible as possible.</p>
|
652 |
+
<p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open 🤗.</p>
|
653 |
</d-article>
|
654 |
|
655 |
<d-appendix>
|