anton-l HF staff commited on
Commit
ad718b5
·
1 Parent(s): 2c9e5db

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +2 -3
index.html CHANGED
@@ -692,10 +692,9 @@
692
  </div>
693
  <p>We also experimented with <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x-7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a> and a jury of all three models following <a href="https://arxiv.org/abs/2404.18796">Verga et al.</a>, but found that Llama3 alone gave the most reliable results.</p>
694
  <h3>Classifier Training</h3>
695
- <p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 50,000 samples, treating Llama3 annotations as ground-truth. After training, we rounded the scores to integers from 0 to 5.</p>
696
  <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
697
- <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg">https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/edu-classifier/classification">GitHub</a>.</p>
698
- <p><strong>TODO: fill model card and move the github code to another folder</strong></p>
699
  <h3>Filtering and results</h3>
700
  <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
701
  <div class="main-plot-container">
 
692
  </div>
693
  <p>We also experimented with <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x-7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a> and a jury of all three models following <a href="https://arxiv.org/abs/2404.18796">Verga et al.</a>, but found that Llama3 alone gave the most reliable results.</p>
694
  <h3>Classifier Training</h3>
695
+ <p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of ~47k samples, treating Llama3 annotations as ground-truth. After training, we rounded the scores to integers from 0 to 5.</p>
696
  <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
697
+ <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
 
698
  <h3>Filtering and results</h3>
699
  <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
700
  <div class="main-plot-container">