File size: 1,509 Bytes
291a8d4 a28e354 291a8d4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
---
license: apache-2.0
---
**Model Summary**
This fastText model is used as part of the ensemble filter in GneissWeb to detect and remove low-quality documents.
Please refer to the [GneissWeb](https://huggingface.co/datasets/ibm-granite/GneissWeb) dataset page for more details
- **Developers**: IBM Research
- **Release Date**: Feb 10th, 2025
- **License**: Apache 2.0
**Training Data**
The model is trained on 400k documents, equality split between positive (i.e., high-quality) and negative (i.e., low-quality) classes. Please refer to [fasttext text classification tutorial](https://fasttext.cc/docs/en/python-module.html) for details.
Training data is selected as follows.
- *Positive documents*: 190k synthetic documents randomly sampled from the [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) dataset, and
10k documents with high educational value selected as follows: first, 600k random documents from [FineWeb-V1.1.0](https://huggingface.co/datasets/HuggingFaceFW/fineweb)
are annotated asking Mixtral-8x22B-Instruct to score each document between 1 to 5 for its educational quality (with 5 being the highest quality), using a prompt similar to the one used by [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu).
Then, 10k random documents are selected from documents with scores greater than or equal to 4.
- *Negative documents*: 200k random documents out of the 600k Mixtral-annotated documents with scores less than or equal to 2. |