Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

anton-l HF staff commited on May 31, 2024

Commit

ad718b5

1 Parent(s): 2c9e5db

Update index.html

Browse files

Files changed (1) hide show

index.html +2 -3

index.html CHANGED Viewed

@@ -692,10 +692,9 @@
     </div>
     <p>We also experimented with  <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x-7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a> and a jury of all three models following <a href="https://arxiv.org/abs/2404.18796">Verga et al.</a>, but found that Llama3 alone gave the most reliable results.</p>
     <h3>Classifier Training</h3>
-    <p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers.  We saved the checkpoint with the highest F1 score on our held-out validation set of 50,000 samples, treating Llama3 annotations as ground-truth. After training, we rounded the scores to integers from 0 to 5.</p>
     <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
-    <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg">https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg</a>. The training and inference code is available on  <a href="https://github.com/huggingface/cosmopedia/tree/edu-classifier/classification">GitHub</a>.</p>
-    <p><strong>TODO: fill model card and move the github code to another folder</strong></p>
     <h3>Filtering and results</h3>
     <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
     <div class="main-plot-container">

     </div>
     <p>We also experimented with  <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x-7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a> and a jury of all three models following <a href="https://arxiv.org/abs/2404.18796">Verga et al.</a>, but found that Llama3 alone gave the most reliable results.</p>
     <h3>Classifier Training</h3>
+    <p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers.  We saved the checkpoint with the highest F1 score on our held-out validation set of ~47k samples, treating Llama3 annotations as ground-truth. After training, we rounded the scores to integers from 0 to 5.</p>
     <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
+    <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on  <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
     <h3>Filtering and results</h3>
     <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
     <div class="main-plot-container">