FineFineWeb: A Comprehensive Study on Fine-Grained Domain Web Corpus
arXiv: Coming Soon
Project Page: Coming Soon
Blog: Coming Soon
Data Statistics
Domain (#tokens/#samples) | Iteration 1 Tokens | Iteration 2 Tokens | Iteration 3 Tokens | Total Tokens | Iteration 1 Count | Iteration 2 Count | Iteration 3 Count | Total Count |
---|---|---|---|---|---|---|---|---|
aerospace | 5.77B | 261.63M | 309.33M | 6.34B | 9100000 | 688505 | 611034 | 10399539 |
agronomy | 13.08B | 947.41M | 229.04M | 14.26B | 15752828 | 2711790 | 649404 | 19114022 |
artistic | 178.25B | 5.79B | 3.75B | 187.80B | 314279703 | 16113512 | 9957104 | 340350319 |
astronomy | 5.20B | 134.39M | 54.66M | 5.38B | 7596521 | 357647 | 145832 | 8100000 |
atmospheric_science | 2.80B | 102.04M | 259.25M | 3.16B | 5709537 | 267789 | 525969 | 6503295 |
automotive | 36.72B | 436.34M | 911.65M | 38.07B | 60239679 | 1166729 | 1535882 | 62942290 |
beauty | 19.10B | 671.88M | 1.01B | 20.78B | 34787376 | 1808382 | 2201810 | 38797568 |
biology | 85.84B | 371.29M | 776.99M | 86.99B | 81413569 | 995384 | 1350348 | 83759301 |
celebrity | 9.63B | 706.41M | 4.22B | 14.56B | 19831188 | 1803788 | 7949240 | 29584216 |
chemistry | 27.80B | 588.92M | 131.46M | 28.52B | 31188189 | 1499085 | 328038 | 33015312 |
christianity | 47.72B | 403.68M | 732.55M | 48.86B | 55013147 | 1349874 | 2021458 | 58384479 |
civil_engineering | 8.85B | 1.27B | 402.91M | 10.52B | 13591632 | 2683940 | 940742 | 17216314 |
communication_engineering | 9.21B | 3.60B | 327.66M | 13.14B | 13001767 | 5959526 | 746495 | 19707788 |
computer_science_and_technology | 194.46B | 3.95B | 4.76B | 203.16B | 278420434 | 10263521 | 8654255 | 297338210 |
design | 96.58B | 3.80B | 450.00M | 100.82B | 190275603 | 16653588 | 2090515 | 209019706 |
drama_and_film | 19.12B | 10.86B | 206.27M | 30.19B | 33117478 | 18443259 | 564251 | 52124988 |
economics | 205.01B | 1.23B | 2.63B | 208.87B | 263965085 | 3874091 | 5505880 | 273345056 |
electronic_science | 30.19B | 7.76B | 482.62M | 38.43B | 42745767 | 12572747 | 1115605 | 56434119 |
entertainment | 152.92B | 1.67B | 5.06B | 159.65B | 256935144 | 5801081 | 9648023 | 272384248 |
environmental_science | 56.98B | 1.48B | 920.77M | 59.37B | 84500393 | 3557056 | 1966731 | 90024180 |
fashion | 18.72B | 977.27M | 264.01M | 19.96B | 53465628 | 3926500 | 1346988 | 58739116 |
finance | 146.39B | 327.45M | 1.13B | 147.85B | 187797764 | 1295893 | 3058801 | 192152458 |
food | 56.10B | 136.32M | 978.91M | 57.22B | 96485838 | 613875 | 3051981 | 100151694 |
gamble | 30.12B | 696.52M | 158.48M | 30.98B | 24909037 | 770540 | 164168 | 25843745 |
game | 43.47B | 2.36B | 2.68B | 48.51B | 65680699 | 4670033 | 3720700 | 74071432 |
geography | 110.18B | 1.16B | 192.67M | 111.53B | 161677214 | 3835932 | 559447 | 166072593 |
health | 191.20B | 427.93M | 18.43B | 210.06B | 215747152 | 1291215 | 23975955 | 241014322 |
history | 45.27B | 1.56B | 1.69B | 48.52B | 55710432 | 4167508 | 3463033 | 63340973 |
hobby | 150.23B | 42.78B | 44.05B | 237.06B | 276636362 | 81360893 | 71407735 | 429404990 |
hydraulic_engineering | 57.36M | 75.40M | 3.65M | 136.41M | 135079 | 163299 | 13453 | 311831 |
instrument_science | 5.35B | 2.02B | 165.43M | 7.54B | 8307736 | 2904274 | 462256 | 11674266 |
journalism_and_media_communication | 440.98B | 21.00B | 1.55B | 463.53B | 645801807 | 50657668 | 4909008 | 701368483 |
landscape_architecture | 3.07B | 557.66M | 64.76M | 3.70B | 5613141 | 1138409 | 166526 | 6918076 |
law | 128.58B | 455.19M | 2.38B | 131.42B | 166473205 | 1660944 | 6145032 | 174279181 |
library | 57.16B | 5.01B | 36.56M | 62.21B | 86592305 | 10440991 | 153014 | 97186310 |
literature | 71.07B | 7.01B | 67.53B | 145.61B | 71191075 | 13247806 | 54760578 | 139199459 |
materials_science | 17.79B | 1.11B | 303.66M | 19.20B | 22136519 | 1663376 | 708384 | 24508279 |
mathematics | 5.87B | 50.33M | 261.65M | 6.18B | 10131933 | 179592 | 653050 | 10964575 |
mechanical_engineering | 86.13B | 1.24B | 129.96M | 87.49B | 111778813 | 3201605 | 428714 | 115409132 |
medical | 140.03B | 813.46M | 4.97B | 145.81B | 149594634 | 2266477 | 8527901 | 160389012 |
mining_engineering | 7.26B | 206.05M | 529.02M | 8.00B | 5540631 | 236145 | 468458 | 6245234 |
movie | 13.09B | 639.20M | 124.67M | 13.86B | 22938808 | 1577576 | 511882 | 25028266 |
music_and_dance | 15.42B | 10.38B | 618.46M | 26.42B | 29566554 | 20233446 | 1998272 | 51798272 |
news | 328.47B | 12.37B | 11.34B | 352.18B | 508567768 | 33206709 | 23482422 | 565256899 |
nuclear_science | 559.05M | 79.89M | 78.79M | 717.72M | 784847 | 170282 | 133598 | 1088727 |
ocean_science | 2.36B | 537.82M | 229.43M | 3.13B | 3700000 | 853052 | 425792 | 4978844 |
optical_engineering | 2.33B | 253.06M | 263.99M | 2.85B | 3510836 | 535026 | 400371 | 4446233 |
painting | 374.41M | 429.63M | 96.57M | 900.61M | 875783 | 824217 | 336203 | 2036203 |
pet | 12.12B | 154.14M | 307.28M | 12.58B | 19624688 | 457635 | 778970 | 20861293 |
petroleum_and_natural_gas_engineering | 950.08M | 515.05M | 121.56M | 1.59B | 1669447 | 899860 | 237843 | 2807150 |
philosophy | 47.99B | 121.26M | 335.77M | 48.44B | 50396964 | 505275 | 1030405 | 51932644 |
photo | 6.56B | 1.74B | 41.44M | 8.34B | 16194329 | 3901598 | 179607 | 20275534 |
physics | 21.56B | 372.21M | 191.17M | 22.12B | 24640373 | 843508 | 473758 | 25957639 |
politics | 79.52B | 253.26M | 930.96M | 80.70B | 97403603 | 1026315 | 2504127 | 100934045 |
psychology | 51.53B | 688.50M | 2.56B | 54.78B | 58829917 | 1881452 | 4066667 | 64778036 |
public_administration | 100.13B | 5.54B | 716.81M | 106.39B | 160247751 | 10657768 | 1785347 | 172690866 |
relationship | 21.87B | 3.69B | 129.60M | 25.69B | 28153321 | 6794774 | 321268 | 35269363 |
sociology | 76.34B | 3.59B | 8.88B | 88.82B | 106447186 | 7836896 | 13040695 | 127324777 |
sports | 118.64B | 379.18M | 1.79B | 120.80B | 173243631 | 1286718 | 4212540 | 178742889 |
statistics | 19.59B | 1.15B | 1.75B | 22.49B | 29958726 | 2746797 | 3390606 | 36096129 |
systems_science | 24.58B | 11.30B | 163.99M | 36.05B | 32879249 | 15120751 | 470001 | 48470001 |
textile_science | 2.59B | 2.89B | 94.56M | 5.57B | 8018141 | 8022001 | 456668 | 16496810 |
topicality | 34.87M | 5.22M | 0 | 40.09M | 137789 | 13506 | 0 | 151295 |
transportation_engineering | 12.80B | 6.61B | 972.50M | 20.38B | 23595624 | 11005933 | 2027812 | 36629369 |
travel | 78.87B | 584.78M | 957.26M | 80.41B | 127250195 | 1851342 | 2430704 | 131532241 |
urban_planning | 12.13B | 2.93B | 53.24M | 15.12B | 20040937 | 6176104 | 201963 | 26419004 |
weapons_science | 80.62M | 3.32B | 140.89M | 3.54B | 215544 | 5695154 | 369541 | 6280239 |
Grand Total | 4010.76B | 206.51B | 208.02B | 4425.30B | 5781764055 | 442387964 | 311920860 | 6536072879 |
Data Construction Workflow
The data construction workflow can be summarized as follows:
Deduplicate: The FineWeb dataset is deduplicated using exact deduplication and MinHash techniques to remove redundant data.
URL Labeling: Root URLs from FineWeb are counted, and the top 1 million URLs are labeled using GPT-4. This step generates DoI (Domain-of-Interest) Coarse-Grained URLs and DoNI (Domain-of-Non-Interest) Coarse-Grained URLs as seed data sources.
Coarse Recall:
a. Based on the labeled root URLs, data is sampled for each domain.
b. The sampled data is labeled using Qwen2-7B-Instruct, producing 500K DoI Positive Data and 500K DoI Negative Data (note that for N>1 iterations, each 500K samples are composed of 250K sampled original seed data and 250K refined data after Fine Recall).
c. A binary FastText model is trained per domain using the labeled data.
d. The FastText model performs coarse recall on FineWeb, generating Coarse DoI Data.
Fine Recall:
a. The Coarse DoI Data is labeled using Qwen2-72B-Instruct to produce 100K DoI Positive Data and 50K DoI Negative Data, with the latter further augmented with 50K negative samples from earlier FastText training.
b. A BERT model is trained using this labeled data.
c. The BERT model performs fine recall on the Coarse DoI Data, producing a refined dataset, which is the DoI subset of FineFineWeb.
Coarse-Fine Recall Iteration: The workflow of coarse and fine recall iterates for 3 rounds with the following adjustments:
a. FastText is re-trained using updated seed data, which combines BERT-recalled samples, BERT-dropped samples, and previously labeled seed data.
b. The BERT model keeps frozen during subsequent iterations.
c. Steps for training FastText, coarse recall, and fine recall are repeated without re-labeling data with Qwen2-Instruct models.
Domain-Domain Similarity Analysis
- Perform proportional weighted sampling of the domain subsets based on the sample size of each domain, with a total of 1 billion tokens sampled from the domain subsets.
- Use the BGE-M3 model to compute the embeddings of the samples in each domain subset, referred to as domain embeddings.
- Use the BGE-M3 model to compute the embeddings of the samples in each benchmark, referred to as benchmark embeddings (bench embeddings).
- Calculate the MMD distance and the Wasserstein distance between the domain embeddings and the benchmark embeddings.
The results above reveal the following observations:
- The two code-related benchmarks, MBPP and HumanEval, exhibit relatively large distances from nearly all domains, indicating that the proportion of code data in the training set is relatively small. Notably, their distance to the mathematics domain is comparatively smaller, suggesting a certain degree of overlap between mathematics data and code data.
- Benchmarks such as Hellaswag, ARC, MMLU, and BoolQ have distances that are close to almost all domains, except for the gamble domain. This indicates that the samples in these benchmarks involve synergetic effects across multiple domains of knowledge, with a wide distribution.
- GSM8K and TriviaQA show significant discrepancies with a small number of domains, suggesting that the distribution differences between domains are more pronounced for samples involving grade-school mathematics and fact-based question answering. Some domains contain a substantial amount of this type of data, while others do not.
- The gamble domain exhibits substantial differences from other domains and has large distances from all benchmarks, indicating that pretraining data related to gambling provides limited benefits for these benchmarks.
Domain-Domain Duplication
Let represent distinct domains, where we select top-20 URLs for each domain , denoted as ,. The total set of URLs across all domains is represented as , and the total number of URLs is .
For each URL , the term frequency (TF) is defined as the proportion of in the total set of URLs:
where is the number of times appears in . Additionally, the document frequency of is the number of domains in which appears. Based on this, the inverse document frequency (IDF) is calculated as:
The TF-IDF value for each URL in a specific domain is then computed as:
Using the TF-IDF values of all URLs within a domain, the domain-domain duplicate rate can be analyzed by comparing the distribution of TF-IDF values across domains. If a domain has many URLs with high TF-IDF values, it indicates that the domainβs URLs are relatively unique and significant within the entire set of URLs. Conversely, if a domain has many URLs with low TF-IDF values, it suggests that the domain's URLs are more common across other domains. Analyzing these values helps assess how similar or redundant a domain's content is in relation to others based on its URL composition.
As shown in the figure, most domains have low duplication rates, except for topicality, pet, and atmospheric science.
Domain-Benchmark BPC-Acc Correlation
Experimental method: Using 28 models (see the paper), we first calculate BPC for all domains to obtain a model ranking . Similarly, we compute scores across all benchmarks to obtain a model ranking . We then calculate the Spearman correlation between and .
- For benchmarks like ARC, MMLU, GSM8K, HumanEval, and MBPP, STEM-related domains show higher correlation rankings, particularly mathematics, physics, and systems science.
- For TriviaQA, which emphasizes factual knowledge over reasoning, domains rich in world knowledge such as literature, history, and library science demonstrate higher correlation rankings.
Bibtex
@misc{
title={FineFineWeb: A Comprehensive Study on Fine-grained Domain Web Corpus},
url={[https://huggingface.co/datasets/m-a-p/FineFineWeb](https://huggingface.co/datasets/m-a-p/FineFineWeb)},
author = {M-A-P, Ge Zhang*, Xinrun Du*, Zhimiao Yu*, Zili Wang*, Zekun Wang, Shuyue Guo, Tianyu Zheng, Kang Zhu, Jerry Liu, Shawn Yue, Binbin Liu, Zhongyuan Peng, Yifan Yao, Jack Yang, Ziming Li, Bingni Zhang, Minghao Liu, Tianyu Liu, Yang Gao, Wenhu Chen, Xiaohuan Zhou, Qian Liu, Taifeng Wang+, Wenhao Huang+},
publisher={huggingface},
verision={v0.1.0},
month={December},
year={2024}
}