Ultra-FineWeb-Classifier

πŸ“š Introduction

Ultra-FineWeb is a large-scale, high-quality, and efficiently-filtered dataset. We use the proposed efficient verification-based high-quality filtering pipeline to the FineWeb and Chinese FineWeb datasets (source data from Chinese FineWeb-edu-v2, which includes IndustryCorpus2, MiChao, WuDao, SkyPile, WanJuan, ChineseWebText, TeleChat, and CCI3), resulting in the creation of higher-quality Ultra-FineWeb-en with approximately 1T tokens, and Ultra-FineWeb-zh datasets with approximately 120B tokens, collectively referred to as Ultra-FineWeb. Ultra-FineWeb serves as a core pre-training web dataset for the MiniCPM4 Series models.

  • Ultra-FineWeb: Ultra-FineWeb, a large-scale, high-quality, and efficiently-filtered dataset, with 1T English tokens and 120B Chinese tokens.
  • Ultra-FineWeb-classifier: Ultra-FineWeb classifier, for filtering high-quality data from web corpora. (<-- you are here)

πŸ“’ What's New

  • [2025.05.09] Ultra-FineWeb technical report is available on arXiv. πŸ”₯πŸ”₯πŸ”₯
  • [2025.05.15] Ultra-FineWeb tops the Hugging Face Datasets Trending list, reaching the #1 spot! ⭐️⭐️⭐️
  • [2025.06.06] Ultra-FineWeb-en and Ultra-FineWeb-zh datasets are now available on Hugging Face, released alongside the MiniCPM4 Series models.
  • [2025.06.16] The Ultra-FineWeb-classifier is now available on Hugging Face: openbmb/Ultra-FineWeb-classifier. πŸš€πŸš€πŸš€

πŸ’‘ Highlights

Abstract: Data quality has become a key factor in enhancing model performance with the rapid development of large language models (LLMs). Model-driven data filtering has increasingly become a primary approach for acquiring high-quality data. However, it still faces two main challenges: (1) the lack of an efficient data verification strategy makes it difficult to provide timely feedback on data quality; and (2) the selection of seed data for training classifiers lacks clear criteria and relies heavily on human expertise, introducing a degree of subjectivity. To address the first challenge, we introduce an efficient verification strategy that enables rapid evaluation of the impact of data on LLM training with minimal computational cost. To tackle the second challenge, we build upon the assumption that high-quality seed data is beneficial for LLM training, and by integrating the proposed verification strategy, we optimize the selection of positive and negative samples and propose an efficient data filtering pipeline. This pipeline not only improves filtering efficiency, classifier quality, and robustness, but also significantly reduces experimental and inference costs. In addition, to efficiently filter high-quality data, we employ a lightweight classifier based on fastText, and successfully apply the filtering pipeline to two widely-used pre-training corpora, FineWeb and Chinese FineWeb datasets, resulting in the creation of the higher-quality Ultra-FineWeb dataset. Ultra-FineWeb contains approximately 1 trillion (T) English tokens and 120 billion (B) Chinese tokens. Empirical results demonstrate that the LLMs trained on Ultra-FineWeb exhibit significant performance improvements across multiple benchmark tasks, validating the effectiveness of our pipeline in enhancing both data quality and training efficiency.

  • Efficient Verification Strategy: We propose a computationally efficient verification strategy that enables rapid evaluation of the impact of data on LLM training performance with minimal computational cost, significantly improving the efficiency of high-quality data filtering experiments.
  • Large-Scale High-Quality Pre-training Datasets: We design and implement an efficient high-quality data filtering pipeline, applied to the FineWeb and Chinese FineWeb datasets, resulting in the creation of higher-quality datasets, which can facilitate high-quality LLM training.
  • Lightweight Classifier: The Ultra-FineWeb classifier significantly reduces inference costs, achieving superior performance on extracted text from the same data source, thus validating the effectiveness of our proposed data filtering pipeline in enhancing data quality and training efficiency.

πŸš€ Usage of Ultra-FineWeb Classifier

Inference single content

  1. Put the content you want to infer into the scripts/local_scripts/single_content.txt file.
  2. Run the scripts/local_scripts/infer_single_content.py script to infer the content:
# set the language you want to infer, support: en, zh
LANGUAGE=en
# set the tokenizer path, default: local_tokenizer
# user can also directly use "deepseek-ai/DeepSeek-V2"
TOKENIZER_PATH=local_tokenizer
# set the content file path, default: scripts/local_scripts/single_content.txt
CONTENT_FILE=scripts/local_scripts/single_content.txt

python scripts/local_scripts/infer_single_content.py --language ${LANGUAGE} --tokenizer-path ${TOKENIZER_PATH} --content-file ${CONTENT_FILE}

Then you can get the result in the terminal, such as:

Content: {User's input content}

Normalized content: {Normalized content}

  - Pred label: {Pred label}
  - Pred score: {Pred score}

Inference folder

Assume the input folder is data/input, the key of the content is content, and the output folder is data/output. User can run the scripts/local_scripts/infer_folder.py script to infer the folder:

# set the language you want to infer, support: en, zh
LANGUAGE=en
# set the data path
DATA_PATH=data/input
# set the save path
SAVE_PATH=data/output
# set the content key
CONTENT_KEY=content
# bellow are optional arguments
# set the tokenizer path, default: local_tokenizer
TOKENIZER_PATH=local_tokenizer
# set the processes number, default: 64
PROCESSES_NUM=64
# set the write batch size, default: 100
WRITE_BATCH_SIZE=100

python scripts/local_scripts/infer_folder.py \
    --language ${LANGUAGE} \
    --data-path ${DATA_PATH} \
    --save-path ${SAVE_PATH} \
    --content-key ${CONTENT_KEY} \
    --tokenizer-path ${TOKENIZER_PATH} \
    --processes-num ${PROCESSES_NUM} \
    --write-batch-size ${WRITE_BATCH_SIZE} \
    [--inplace]  # optional, delete the processed data and re-process the data

For Spark inference, we also provide scripts/spark_scripts/spark_infer.py, a demo script for users to run on the Spark cluster.

NOTE:

  • The numpy version should be lower than 2.0 for the fasttext package.
  • The config.json file is a fake config file, the parameters are used for the fasttext training.

❀️ Acknowledgements

Thanks for their awesome work! Open-source contributions make Ultra-FineWeb possible! πŸ™Œ

🌟 Citation

If you find our work useful, please consider citing:

@misc{wang2025ultrafineweb,
  title={{Ultra-FineWeb}: Efficient Data Filtering and Verification for High-Quality LLM Training Data},
  author={Yudong Wang and Zixuan Fu and Jie Cai and Peijun Tang and Hongya Lyu and Yewei Fang and Zhi Zheng and Jie Zhou and Guoyang Zeng and Chaojun Xiao and Xu Han and Zhiyuan Liu},
  year={2025},
  eprint={2505.05427},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
}

πŸ’³ License

This project is released under the Apache 2.0. Please note that since Ultra-FineWeb is built using multiple datasets, users should check the LICENSE of each dataset individually to ensure proper usage and compliance.

Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including openbmb/Ultra-FineWeb-classifier