thomwolf HF staff commited on
Commit
1efa1a5
1 Parent(s): d8919a0

conclusion

Browse files
Files changed (2) hide show
  1. dist/index.html +4 -4
  2. src/index.html +4 -4
dist/index.html CHANGED
@@ -646,10 +646,10 @@
646
  <p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2">HuggingFaceFW/fineweb-edu-score-2</a>.</p>
647
  <p>You can find the two datasets along with the classifier used for the filtering in this <a href="https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd">collection</a>.</p>
648
 
649
- <h2>Next steps</h2>
650
- <p>Through our open science efforts we hope to open more and more the block box aronud training good quality large language models as well as give every model trainer the ability to create state-of-the-art LLMs. We're excited to continue iterating on FineWeb and increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
651
- <p>Moreover, while English currently dominates the large language model landscape, we're looking forward to applying the learnings we make in project like this one to make high quality training data available as well in other languages and more easily accessible.</p>
652
- <p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale in the open.</p>
653
  </d-article>
654
 
655
  <d-appendix>
 
646
  <p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2">HuggingFaceFW/fineweb-edu-score-2</a>.</p>
647
  <p>You can find the two datasets along with the classifier used for the filtering in this <a href="https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd">collection</a>.</p>
648
 
649
+ <h2>Conclusion and looking forward</h2>
650
+ <p>Through our open science efforts we hope to open more and more the black box around training high performance large language models as well as give every model trainer the ability to create state-of-the-art LLMs. We're excited to continue iterating on FineWeb and increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
651
+ <p>In particular in the short term, while English currently dominates the large language model landscape, we're looking forward to applying the learnings we make in this project to make high quality training data available in other languages as well and as accessible as possible.</p>
652
+ <p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open 🤗.</p>
653
  </d-article>
654
 
655
  <d-appendix>
src/index.html CHANGED
@@ -646,10 +646,10 @@
646
  <p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2">HuggingFaceFW/fineweb-edu-score-2</a>.</p>
647
  <p>You can find the two datasets along with the classifier used for the filtering in this <a href="https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd">collection</a>.</p>
648
 
649
- <h2>Next steps</h2>
650
- <p>Through our open science efforts we hope to open more and more the block box aronud training good quality large language models as well as give every model trainer the ability to create state-of-the-art LLMs. We're excited to continue iterating on FineWeb and increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
651
- <p>Moreover, while English currently dominates the large language model landscape, we're looking forward to applying the learnings we make in project like this one to make high quality training data available as well in other languages and more easily accessible.</p>
652
- <p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale in the open.</p>
653
  </d-article>
654
 
655
  <d-appendix>
 
646
  <p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2">HuggingFaceFW/fineweb-edu-score-2</a>.</p>
647
  <p>You can find the two datasets along with the classifier used for the filtering in this <a href="https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd">collection</a>.</p>
648
 
649
+ <h2>Conclusion and looking forward</h2>
650
+ <p>Through our open science efforts we hope to open more and more the black box around training high performance large language models as well as give every model trainer the ability to create state-of-the-art LLMs. We're excited to continue iterating on FineWeb and increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
651
+ <p>In particular in the short term, while English currently dominates the large language model landscape, we're looking forward to applying the learnings we make in this project to make high quality training data available in other languages as well and as accessible as possible.</p>
652
+ <p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open 🤗.</p>
653
  </d-article>
654
 
655
  <d-appendix>