FineFineWeb: A Comprehensive Study on Fine-Grained Domain Web Corpus

arXiv: Coming Soon

Project Page: Coming Soon

Blog: Coming Soon

Data Statistics

Domain (#tokens/#samples)	Iteration 1 Tokens	Iteration 2 Tokens	Iteration 3 Tokens	Total Tokens	Iteration 1 Count	Iteration 2 Count	Iteration 3 Count	Total Count
aerospace	5.77B	261.63M	309.33M	6.34B	9100000	688505	611034	10399539
agronomy	13.08B	947.41M	229.04M	14.26B	15752828	2711790	649404	19114022
artistic	178.25B	5.79B	3.75B	187.80B	314279703	16113512	9957104	340350319
astronomy	5.20B	134.39M	54.66M	5.38B	7596521	357647	145832	8100000
atmospheric_science	2.80B	102.04M	259.25M	3.16B	5709537	267789	525969	6503295
automotive	36.72B	436.34M	911.65M	38.07B	60239679	1166729	1535882	62942290
beauty	19.10B	671.88M	1.01B	20.78B	34787376	1808382	2201810	38797568
biology	85.84B	371.29M	776.99M	86.99B	81413569	995384	1350348	83759301
celebrity	9.63B	706.41M	4.22B	14.56B	19831188	1803788	7949240	29584216
chemistry	27.80B	588.92M	131.46M	28.52B	31188189	1499085	328038	33015312
christianity	47.72B	403.68M	732.55M	48.86B	55013147	1349874	2021458	58384479
civil_engineering	8.85B	1.27B	402.91M	10.52B	13591632	2683940	940742	17216314
communication_engineering	9.21B	3.60B	327.66M	13.14B	13001767	5959526	746495	19707788
computer_science_and_technology	194.46B	3.95B	4.76B	203.16B	278420434	10263521	8654255	297338210
design	96.58B	3.80B	450.00M	100.82B	190275603	16653588	2090515	209019706
drama_and_film	19.12B	10.86B	206.27M	30.19B	33117478	18443259	564251	52124988
economics	205.01B	1.23B	2.63B	208.87B	263965085	3874091	5505880	273345056
electronic_science	30.19B	7.76B	482.62M	38.43B	42745767	12572747	1115605	56434119
entertainment	152.92B	1.67B	5.06B	159.65B	256935144	5801081	9648023	272384248
environmental_science	56.98B	1.48B	920.77M	59.37B	84500393	3557056	1966731	90024180
fashion	18.72B	977.27M	264.01M	19.96B	53465628	3926500	1346988	58739116
finance	146.39B	327.45M	1.13B	147.85B	187797764	1295893	3058801	192152458
food	56.10B	136.32M	978.91M	57.22B	96485838	613875	3051981	100151694
gamble	30.12B	696.52M	158.48M	30.98B	24909037	770540	164168	25843745
game	43.47B	2.36B	2.68B	48.51B	65680699	4670033	3720700	74071432
geography	110.18B	1.16B	192.67M	111.53B	161677214	3835932	559447	166072593
health	191.20B	427.93M	18.43B	210.06B	215747152	1291215	23975955	241014322
history	45.27B	1.56B	1.69B	48.52B	55710432	4167508	3463033	63340973
hobby	150.23B	42.78B	44.05B	237.06B	276636362	81360893	71407735	429404990
hydraulic_engineering	57.36M	75.40M	3.65M	136.41M	135079	163299	13453	311831
instrument_science	5.35B	2.02B	165.43M	7.54B	8307736	2904274	462256	11674266
journalism_and_media_communication	440.98B	21.00B	1.55B	463.53B	645801807	50657668	4909008	701368483
landscape_architecture	3.07B	557.66M	64.76M	3.70B	5613141	1138409	166526	6918076
law	128.58B	455.19M	2.38B	131.42B	166473205	1660944	6145032	174279181
library	57.16B	5.01B	36.56M	62.21B	86592305	10440991	153014	97186310
literature	71.07B	7.01B	67.53B	145.61B	71191075	13247806	54760578	139199459
materials_science	17.79B	1.11B	303.66M	19.20B	22136519	1663376	708384	24508279
mathematics	5.87B	50.33M	261.65M	6.18B	10131933	179592	653050	10964575
mechanical_engineering	86.13B	1.24B	129.96M	87.49B	111778813	3201605	428714	115409132
medical	140.03B	813.46M	4.97B	145.81B	149594634	2266477	8527901	160389012
mining_engineering	7.26B	206.05M	529.02M	8.00B	5540631	236145	468458	6245234
movie	13.09B	639.20M	124.67M	13.86B	22938808	1577576	511882	25028266
music_and_dance	15.42B	10.38B	618.46M	26.42B	29566554	20233446	1998272	51798272
news	328.47B	12.37B	11.34B	352.18B	508567768	33206709	23482422	565256899
nuclear_science	559.05M	79.89M	78.79M	717.72M	784847	170282	133598	1088727
ocean_science	2.36B	537.82M	229.43M	3.13B	3700000	853052	425792	4978844
optical_engineering	2.33B	253.06M	263.99M	2.85B	3510836	535026	400371	4446233
painting	374.41M	429.63M	96.57M	900.61M	875783	824217	336203	2036203
pet	12.12B	154.14M	307.28M	12.58B	19624688	457635	778970	20861293
petroleum_and_natural_gas_engineering	950.08M	515.05M	121.56M	1.59B	1669447	899860	237843	2807150
philosophy	47.99B	121.26M	335.77M	48.44B	50396964	505275	1030405	51932644
photo	6.56B	1.74B	41.44M	8.34B	16194329	3901598	179607	20275534
physics	21.56B	372.21M	191.17M	22.12B	24640373	843508	473758	25957639
politics	79.52B	253.26M	930.96M	80.70B	97403603	1026315	2504127	100934045
psychology	51.53B	688.50M	2.56B	54.78B	58829917	1881452	4066667	64778036
public_administration	100.13B	5.54B	716.81M	106.39B	160247751	10657768	1785347	172690866
relationship	21.87B	3.69B	129.60M	25.69B	28153321	6794774	321268	35269363
sociology	76.34B	3.59B	8.88B	88.82B	106447186	7836896	13040695	127324777
sports	118.64B	379.18M	1.79B	120.80B	173243631	1286718	4212540	178742889
statistics	19.59B	1.15B	1.75B	22.49B	29958726	2746797	3390606	36096129
systems_science	24.58B	11.30B	163.99M	36.05B	32879249	15120751	470001	48470001
textile_science	2.59B	2.89B	94.56M	5.57B	8018141	8022001	456668	16496810
topicality	34.87M	5.22M	0	40.09M	137789	13506	0	151295
transportation_engineering	12.80B	6.61B	972.50M	20.38B	23595624	11005933	2027812	36629369
travel	78.87B	584.78M	957.26M	80.41B	127250195	1851342	2430704	131532241
urban_planning	12.13B	2.93B	53.24M	15.12B	20040937	6176104	201963	26419004
weapons_science	80.62M	3.32B	140.89M	3.54B	215544	5695154	369541	6280239
Grand Total	4010.76B	206.51B	208.02B	4425.30B	5781764055	442387964	311920860	6536072879

Data Construction Workflow

The data construction workflow can be summarized as follows:

Deduplicate: The FineWeb dataset is deduplicated using exact deduplication and MinHash techniques to remove redundant data.
URL Labeling: Root URLs from FineWeb are counted, and the top 1 million URLs are labeled using GPT-4. This step generates DoI (Domain-of-Interest) Coarse-Grained URLs and DoNI (Domain-of-Non-Interest) Coarse-Grained URLs as seed data sources.
Coarse Recall:

a. Based on the labeled root URLs, data is sampled for each domain.

b. The sampled data is labeled using Qwen2-7B-Instruct, producing 500K DoI Positive Data and 500K DoI Negative Data (note that for N>1 iterations, each 500K samples are composed of 250K sampled original seed data and 250K refined data after Fine Recall).

c. A binary FastText model is trained per domain using the labeled data.

d. The FastText model performs coarse recall on FineWeb, generating Coarse DoI Data.
Fine Recall:

a. The Coarse DoI Data is labeled using Qwen2-72B-Instruct to produce 100K DoI Positive Data and 50K DoI Negative Data, with the latter further augmented with 50K negative samples from earlier FastText training.

b. A BERT model is trained using this labeled data.

c. The BERT model performs fine recall on the Coarse DoI Data, producing a refined dataset, which is the DoI subset of FineFineWeb.
Coarse-Fine Recall Iteration: The workflow of coarse and fine recall iterates for 3 rounds with the following adjustments:

a. FastText is re-trained using updated seed data, which combines BERT-recalled samples, BERT-dropped samples, and previously labeled seed data.

b. The BERT model keeps frozen during subsequent iterations.

c. Steps for training FastText, coarse recall, and fine recall are repeated without re-labeling data with Qwen2-Instruct models.

Domain-Domain Similarity Analysis

Perform proportional weighted sampling of the domain subsets based on the sample size of each domain, with a total of 1 billion tokens sampled from the domain subsets.
Use the BGE-M3 model to compute the embeddings of the samples in each domain subset, referred to as domain embeddings.
Use the BGE-M3 model to compute the embeddings of the samples in each benchmark, referred to as benchmark embeddings (bench embeddings).
Calculate the MMD distance and the Wasserstein distance between the domain embeddings and the benchmark embeddings.

The results above reveal the following observations:

The two code-related benchmarks, MBPP and HumanEval, exhibit relatively large distances from nearly all domains, indicating that the proportion of code data in the training set is relatively small. Notably, their distance to the mathematics domain is comparatively smaller, suggesting a certain degree of overlap between mathematics data and code data.
Benchmarks such as Hellaswag, ARC, MMLU, and BoolQ have distances that are close to almost all domains, except for the gamble domain. This indicates that the samples in these benchmarks involve synergetic effects across multiple domains of knowledge, with a wide distribution.
GSM8K and TriviaQA show significant discrepancies with a small number of domains, suggesting that the distribution differences between domains are more pronounced for samples involving grade-school mathematics and fact-based question answering. Some domains contain a substantial amount of this type of data, while others do not.
The gamble domain exhibits substantial differences from other domains and has large distances from all benchmarks, indicating that pretraining data related to gambling provides limited benefits for these benchmarks.

Domain-Domain Duplication

Let $D_1, D_2, \dots, D_N$ represent $N$ distinct domains, where we select top-20 URLs for each domain $D_{i}$ , denoted as $\{U_{i1}, U_{i2}, \dots, U_{i20}\}$ ,. The total set of URLs across all domains is represented as $\mathcal{U}$ , and the total number of URLs is $M = |\mathcal{U}|$ .

For each URL $U_k \in \mathcal{U}$ , the term frequency (TF) is defined as the proportion of $U_{k}$ in the total set of URLs: $\text{TF}(U_k) = \frac{\text{count}(U_k)}{M}$

where $\text{count}(U_k)$ is the number of times $U_{k}$ appears in $\mathcal{U}$ . Additionally, the document frequency $K_{k}$ of $U_{k}$ is the number of domains in which $U_{k}$ appears. Based on this, the inverse document frequency (IDF) is calculated as: $\text{IDF}(U_k) = \log(\frac{N}{K_k})$

The TF-IDF value for each URL $U_{ij}$ in a specific domain $D_{i}$ is then computed as: $\text{TF-IDF}(U_{ij}) = \text{TF}(U_{ij}) \times \text{IDF}(U_{ij})$

Using the TF-IDF values of all URLs within a domain, the domain-domain duplicate rate can be analyzed by comparing the distribution of TF-IDF values across domains. If a domain has many URLs with high TF-IDF values, it indicates that the domain’s URLs are relatively unique and significant within the entire set of URLs. Conversely, if a domain has many URLs with low TF-IDF values, it suggests that the domain's URLs are more common across other domains. Analyzing these values helps assess how similar or redundant a domain's content is in relation to others based on its URL composition.

As shown in the figure, most domains have low duplication rates, except for topicality, pet, and atmospheric science.

Domain-Benchmark BPC-Acc Correlation

Experimental method: Using 28 models (see the paper), we first calculate BPC for all domains to obtain a model ranking $R_{D}$ . Similarly, we compute scores across all benchmarks to obtain a model ranking $R_{M}$ . We then calculate the Spearman correlation between $R_{D}$ and $R_{M}$ .

For benchmarks like ARC, MMLU, GSM8K, HumanEval, and MBPP, STEM-related domains show higher correlation rankings, particularly mathematics, physics, and systems science.
For TriviaQA, which emphasizes factual knowledge over reasoning, domains rich in world knowledge such as literature, history, and library science demonstrate higher correlation rankings.

Bibtex

@misc{
title={FineFineWeb: A Comprehensive Study on Fine-grained Domain Web Corpus},
url={[https://huggingface.co/datasets/m-a-p/FineFineWeb](https://huggingface.co/datasets/m-a-p/FineFineWeb)},
author = {M-A-P, Ge Zhang*, Xinrun Du*, Zhimiao Yu*, Zili Wang*, Zekun Wang, Shuyue Guo, Tianyu Zheng, Kang Zhu, Jerry Liu, Shawn Yue, Binbin Liu, Zhongyuan Peng, Yifan Yao, Jack Yang, Ziming Li, Bingni Zhang, Minghao Liu, Tianyu Liu, Yang Gao, Wenhu Chen, Xiaohuan Zhou, Qian Liu, Taifeng Wang+, Wenhao Huang+},
publisher={huggingface},
verision={v0.1.0},
month={December},
year={2024}
}

m-a-p
/

FineFineWeb-bert