English

FineFineWeb: A Comprehensive Study on Fine-Grained Domain Web Corpus

arXiv: Coming Soon

Project Page: Coming Soon

Blog: Coming Soon

Data Statistics

Domain (#tokens/#samples) Iteration 1 Tokens Iteration 2 Tokens Iteration 3 Tokens Total Tokens Iteration 1 Count Iteration 2 Count Iteration 3 Count Total Count
aerospace 5.77B 261.63M 309.33M 6.34B 9100000 688505 611034 10399539
agronomy 13.08B 947.41M 229.04M 14.26B 15752828 2711790 649404 19114022
artistic 178.25B 5.79B 3.75B 187.80B 314279703 16113512 9957104 340350319
astronomy 5.20B 134.39M 54.66M 5.38B 7596521 357647 145832 8100000
atmospheric_science 2.80B 102.04M 259.25M 3.16B 5709537 267789 525969 6503295
automotive 36.72B 436.34M 911.65M 38.07B 60239679 1166729 1535882 62942290
beauty 19.10B 671.88M 1.01B 20.78B 34787376 1808382 2201810 38797568
biology 85.84B 371.29M 776.99M 86.99B 81413569 995384 1350348 83759301
celebrity 9.63B 706.41M 4.22B 14.56B 19831188 1803788 7949240 29584216
chemistry 27.80B 588.92M 131.46M 28.52B 31188189 1499085 328038 33015312
christianity 47.72B 403.68M 732.55M 48.86B 55013147 1349874 2021458 58384479
civil_engineering 8.85B 1.27B 402.91M 10.52B 13591632 2683940 940742 17216314
communication_engineering 9.21B 3.60B 327.66M 13.14B 13001767 5959526 746495 19707788
computer_science_and_technology 194.46B 3.95B 4.76B 203.16B 278420434 10263521 8654255 297338210
design 96.58B 3.80B 450.00M 100.82B 190275603 16653588 2090515 209019706
drama_and_film 19.12B 10.86B 206.27M 30.19B 33117478 18443259 564251 52124988
economics 205.01B 1.23B 2.63B 208.87B 263965085 3874091 5505880 273345056
electronic_science 30.19B 7.76B 482.62M 38.43B 42745767 12572747 1115605 56434119
entertainment 152.92B 1.67B 5.06B 159.65B 256935144 5801081 9648023 272384248
environmental_science 56.98B 1.48B 920.77M 59.37B 84500393 3557056 1966731 90024180
fashion 18.72B 977.27M 264.01M 19.96B 53465628 3926500 1346988 58739116
finance 146.39B 327.45M 1.13B 147.85B 187797764 1295893 3058801 192152458
food 56.10B 136.32M 978.91M 57.22B 96485838 613875 3051981 100151694
gamble 30.12B 696.52M 158.48M 30.98B 24909037 770540 164168 25843745
game 43.47B 2.36B 2.68B 48.51B 65680699 4670033 3720700 74071432
geography 110.18B 1.16B 192.67M 111.53B 161677214 3835932 559447 166072593
health 191.20B 427.93M 18.43B 210.06B 215747152 1291215 23975955 241014322
history 45.27B 1.56B 1.69B 48.52B 55710432 4167508 3463033 63340973
hobby 150.23B 42.78B 44.05B 237.06B 276636362 81360893 71407735 429404990
hydraulic_engineering 57.36M 75.40M 3.65M 136.41M 135079 163299 13453 311831
instrument_science 5.35B 2.02B 165.43M 7.54B 8307736 2904274 462256 11674266
journalism_and_media_communication 440.98B 21.00B 1.55B 463.53B 645801807 50657668 4909008 701368483
landscape_architecture 3.07B 557.66M 64.76M 3.70B 5613141 1138409 166526 6918076
law 128.58B 455.19M 2.38B 131.42B 166473205 1660944 6145032 174279181
library 57.16B 5.01B 36.56M 62.21B 86592305 10440991 153014 97186310
literature 71.07B 7.01B 67.53B 145.61B 71191075 13247806 54760578 139199459
materials_science 17.79B 1.11B 303.66M 19.20B 22136519 1663376 708384 24508279
mathematics 5.87B 50.33M 261.65M 6.18B 10131933 179592 653050 10964575
mechanical_engineering 86.13B 1.24B 129.96M 87.49B 111778813 3201605 428714 115409132
medical 140.03B 813.46M 4.97B 145.81B 149594634 2266477 8527901 160389012
mining_engineering 7.26B 206.05M 529.02M 8.00B 5540631 236145 468458 6245234
movie 13.09B 639.20M 124.67M 13.86B 22938808 1577576 511882 25028266
music_and_dance 15.42B 10.38B 618.46M 26.42B 29566554 20233446 1998272 51798272
news 328.47B 12.37B 11.34B 352.18B 508567768 33206709 23482422 565256899
nuclear_science 559.05M 79.89M 78.79M 717.72M 784847 170282 133598 1088727
ocean_science 2.36B 537.82M 229.43M 3.13B 3700000 853052 425792 4978844
optical_engineering 2.33B 253.06M 263.99M 2.85B 3510836 535026 400371 4446233
painting 374.41M 429.63M 96.57M 900.61M 875783 824217 336203 2036203
pet 12.12B 154.14M 307.28M 12.58B 19624688 457635 778970 20861293
petroleum_and_natural_gas_engineering 950.08M 515.05M 121.56M 1.59B 1669447 899860 237843 2807150
philosophy 47.99B 121.26M 335.77M 48.44B 50396964 505275 1030405 51932644
photo 6.56B 1.74B 41.44M 8.34B 16194329 3901598 179607 20275534
physics 21.56B 372.21M 191.17M 22.12B 24640373 843508 473758 25957639
politics 79.52B 253.26M 930.96M 80.70B 97403603 1026315 2504127 100934045
psychology 51.53B 688.50M 2.56B 54.78B 58829917 1881452 4066667 64778036
public_administration 100.13B 5.54B 716.81M 106.39B 160247751 10657768 1785347 172690866
relationship 21.87B 3.69B 129.60M 25.69B 28153321 6794774 321268 35269363
sociology 76.34B 3.59B 8.88B 88.82B 106447186 7836896 13040695 127324777
sports 118.64B 379.18M 1.79B 120.80B 173243631 1286718 4212540 178742889
statistics 19.59B 1.15B 1.75B 22.49B 29958726 2746797 3390606 36096129
systems_science 24.58B 11.30B 163.99M 36.05B 32879249 15120751 470001 48470001
textile_science 2.59B 2.89B 94.56M 5.57B 8018141 8022001 456668 16496810
topicality 34.87M 5.22M 0 40.09M 137789 13506 0 151295
transportation_engineering 12.80B 6.61B 972.50M 20.38B 23595624 11005933 2027812 36629369
travel 78.87B 584.78M 957.26M 80.41B 127250195 1851342 2430704 131532241
urban_planning 12.13B 2.93B 53.24M 15.12B 20040937 6176104 201963 26419004
weapons_science 80.62M 3.32B 140.89M 3.54B 215544 5695154 369541 6280239
Grand Total 4010.76B 206.51B 208.02B 4425.30B 5781764055 442387964 311920860 6536072879

Data Construction Workflow

finefineweb-data-workflow

The data construction workflow can be summarized as follows:

  1. Deduplicate: The FineWeb dataset is deduplicated using exact deduplication and MinHash techniques to remove redundant data.

  2. URL Labeling: Root URLs from FineWeb are counted, and the top 1 million URLs are labeled using GPT-4. This step generates DoI (Domain-of-Interest) Coarse-Grained URLs and DoNI (Domain-of-Non-Interest) Coarse-Grained URLs as seed data sources.

  3. Coarse Recall:

    a. Based on the labeled root URLs, data is sampled for each domain.

    b. The sampled data is labeled using Qwen2-7B-Instruct, producing 500K DoI Positive Data and 500K DoI Negative Data (note that for N>1 iterations, each 500K samples are composed of 250K sampled original seed data and 250K refined data after Fine Recall).

    c. A binary FastText model is trained per domain using the labeled data.

    d. The FastText model performs coarse recall on FineWeb, generating Coarse DoI Data.

  4. Fine Recall:

    a. The Coarse DoI Data is labeled using Qwen2-72B-Instruct to produce 100K DoI Positive Data and 50K DoI Negative Data, with the latter further augmented with 50K negative samples from earlier FastText training.

    b. A BERT model is trained using this labeled data.

    c. The BERT model performs fine recall on the Coarse DoI Data, producing a refined dataset, which is the DoI subset of FineFineWeb.

  5. Coarse-Fine Recall Iteration: The workflow of coarse and fine recall iterates for 3 rounds with the following adjustments:

    a. FastText is re-trained using updated seed data, which combines BERT-recalled samples, BERT-dropped samples, and previously labeled seed data.

    b. The BERT model keeps frozen during subsequent iterations.

    c. Steps for training FastText, coarse recall, and fine recall are repeated without re-labeling data with Qwen2-Instruct models.

Domain-Domain Similarity Analysis

  1. Perform proportional weighted sampling of the domain subsets based on the sample size of each domain, with a total of 1 billion tokens sampled from the domain subsets.
  2. Use the BGE-M3 model to compute the embeddings of the samples in each domain subset, referred to as domain embeddings.
  3. Use the BGE-M3 model to compute the embeddings of the samples in each benchmark, referred to as benchmark embeddings (bench embeddings).
  4. Calculate the MMD distance and the Wasserstein distance between the domain embeddings and the benchmark embeddings.

domain-benchmark similarity

The results above reveal the following observations:

  1. The two code-related benchmarks, MBPP and HumanEval, exhibit relatively large distances from nearly all domains, indicating that the proportion of code data in the training set is relatively small. Notably, their distance to the mathematics domain is comparatively smaller, suggesting a certain degree of overlap between mathematics data and code data.
  2. Benchmarks such as Hellaswag, ARC, MMLU, and BoolQ have distances that are close to almost all domains, except for the gamble domain. This indicates that the samples in these benchmarks involve synergetic effects across multiple domains of knowledge, with a wide distribution.
  3. GSM8K and TriviaQA show significant discrepancies with a small number of domains, suggesting that the distribution differences between domains are more pronounced for samples involving grade-school mathematics and fact-based question answering. Some domains contain a substantial amount of this type of data, while others do not.
  4. The gamble domain exhibits substantial differences from other domains and has large distances from all benchmarks, indicating that pretraining data related to gambling provides limited benefits for these benchmarks.

Domain-Domain Duplication

Let D1,D2,…,DND_1, D_2, \dots, D_N represent NN distinct domains, where we select top-20 URLs for each domain DiD_i, denoted as {Ui1,Ui2,…,Ui20}\{U_{i1}, U_{i2}, \dots, U_{i20}\},. The total set of URLs across all domains is represented as U\mathcal{U}, and the total number of URLs is M=∣U∣M = |\mathcal{U}|.

For each URL Uk∈UU_k \in \mathcal{U}, the term frequency (TF) is defined as the proportion of UkU_k in the total set of URLs: TF(Uk)=count(Uk)M\text{TF}(U_k) = \frac{\text{count}(U_k)}{M}

where count(Uk)\text{count}(U_k) is the number of times UkU_k appears in U\mathcal{U}. Additionally, the document frequency KkK_k of UkU_k is the number of domains in which UkU_k appears. Based on this, the inverse document frequency (IDF) is calculated as: IDF(Uk)=log⁑(NKk)\text{IDF}(U_k) = \log(\frac{N}{K_k})

The TF-IDF value for each URL UijU_{ij} in a specific domain DiD_i is then computed as: TF-IDF(Uij)=TF(Uij)Γ—IDF(Uij)\text{TF-IDF}(U_{ij}) = \text{TF}(U_{ij}) \times \text{IDF}(U_{ij})

domain-domain URL duplication

Using the TF-IDF values of all URLs within a domain, the domain-domain duplicate rate can be analyzed by comparing the distribution of TF-IDF values across domains. If a domain has many URLs with high TF-IDF values, it indicates that the domain’s URLs are relatively unique and significant within the entire set of URLs. Conversely, if a domain has many URLs with low TF-IDF values, it suggests that the domain's URLs are more common across other domains. Analyzing these values helps assess how similar or redundant a domain's content is in relation to others based on its URL composition.

As shown in the figure, most domains have low duplication rates, except for topicality, pet, and atmospheric science.

Domain-Benchmark BPC-Acc Correlation

Experimental method: Using 28 models (see the paper), we first calculate BPC for all domains to obtain a model ranking RDR_D. Similarly, we compute scores across all benchmarks to obtain a model ranking RMR_M. We then calculate the Spearman correlation between RDR_D and RMR_M.

domain-benchmark BPC-Acc correlation

  • For benchmarks like ARC, MMLU, GSM8K, HumanEval, and MBPP, STEM-related domains show higher correlation rankings, particularly mathematics, physics, and systems science.
  • For TriviaQA, which emphasizes factual knowledge over reasoning, domains rich in world knowledge such as literature, history, and library science demonstrate higher correlation rankings.

Bibtex

@misc{
title={FineFineWeb: A Comprehensive Study on Fine-grained Domain Web Corpus},
url={[https://huggingface.co/datasets/m-a-p/FineFineWeb](https://huggingface.co/datasets/m-a-p/FineFineWeb)},
author = {M-A-P, Ge Zhang*, Xinrun Du*, Zhimiao Yu*, Zili Wang*, Zekun Wang, Shuyue Guo, Tianyu Zheng, Kang Zhu, Jerry Liu, Shawn Yue, Binbin Liu, Zhongyuan Peng, Yifan Yao, Jack Yang, Ziming Li, Bingni Zhang, Minghao Liu, Tianyu Liu, Yang Gao, Wenhu Chen, Xiaohuan Zhou, Qian Liu, Taifeng Wang+, Wenhao Huang+},
publisher={huggingface},
verision={v0.1.0},
month={December},
year={2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .