Spaces:
Runtime error
Runtime error
victormiller
commited on
Update web.py
Browse files
web.py
CHANGED
@@ -352,6 +352,28 @@ attrs.fraction_of_characters_in_duplicate_lines = sum(
|
|
352 |
|
353 |
def web_data():
|
354 |
return Div(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
355 |
Div(
|
356 |
Ul(
|
357 |
Li(
|
@@ -374,23 +396,17 @@ def web_data():
|
|
374 |
padding: 15px 15px 0px 15px;
|
375 |
""",
|
376 |
),
|
377 |
-
|
378 |
-
P(
|
379 |
-
"To generate a high-quality dataset from large-scale webpages, we have investigated the processing steps used by the community and made our choices based on careful manual inspection. Starting from ",
|
380 |
-
A("Common Crawl", href="https://commoncrawl.org/"),
|
381 |
-
", our process can be summarized as five main steps: document preparation, line-level removal, document-level filtering, deduplication and PII removal.",
|
382 |
-
),
|
383 |
-
style="margin-top: 20px;",
|
384 |
-
),
|
385 |
-
H2("Web Data Processing Summary"),
|
386 |
P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
|
387 |
table_div_filter_data,
|
388 |
-
P("
|
389 |
table_div_qf_filter_data,
|
390 |
P("Our filtering rate is illustrated below. Before deduplication, our filtering rate is comparable to RefinedWeb. During global deduplication, we removed approximately 85.89% of the data, significantly higher than previous works, indicating a large number of duplicates across dumps. "),
|
391 |
Img(src="images/filter_rate.jpg", height = "300", width = "600" ),
|
392 |
P("Note: All percentages are based on the number of documents. The gray bars represent the relative percentages of removed documents at each step, while the colorful bars represent the percentages of retained documents relative to the total number of documents in the raw Common Crawl."),
|
393 |
-
|
|
|
|
|
394 |
Ul(
|
395 |
Li("the line is only composed of uppercase characters", style = "margin-bottom: 5px"),
|
396 |
Li("the line is only composed of numerical characters", style = "margin-bottom: 5px"),
|
@@ -419,9 +435,9 @@ def web_data():
|
|
419 |
P("Following C4, we remove any page where the phrase “lorem ipsum” appears since some pages have placeholder “lorem ipsum” text."),
|
420 |
|
421 |
|
422 |
-
|
423 |
|
424 |
-
|
425 |
P("""
|
426 |
Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
|
427 |
WARC files contain the raw data from the crawl, which store the full HTTP response and request metadata.
|
@@ -442,7 +458,7 @@ def web_data():
|
|
442 |
),
|
443 |
#DV2("data/sample_wet.json", "data/sample_warc.json", 3),
|
444 |
|
445 |
-
|
446 |
P("""
|
447 |
After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
|
448 |
This step removes over 60% of the whole data.
|
@@ -461,16 +477,16 @@ def web_data():
|
|
461 |
DV("data/sample_en_low.json", 3, "Sample documents that are classified as English but with score less than 0.65"),
|
462 |
),
|
463 |
|
464 |
-
|
465 |
P("""
|
466 |
-
|
467 |
-
We also
|
468 |
"""),
|
469 |
-
|
470 |
P("""
|
471 |
-
Following RefinedWeb [3], we
|
472 |
articles, sex education, technical blogs, etc. Specifically, we randomly took 903M URLs and matched them with
|
473 |
-
4.6M domain names in the UT1 blocklist. 24
|
474 |
"""),
|
475 |
|
476 |
Details(
|
@@ -495,7 +511,7 @@ def web_data():
|
|
495 |
),
|
496 |
),
|
497 |
|
498 |
-
|
499 |
P("""
|
500 |
To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
|
501 |
"""),
|
@@ -514,20 +530,23 @@ def web_data():
|
|
514 |
),
|
515 |
|
516 |
|
517 |
-
|
518 |
P("""
|
519 |
-
Before
|
520 |
-
|
521 |
"""),
|
522 |
-
|
523 |
P("""
|
524 |
The terminal punctuation has been used in C4 [5] and Dolma [6] to remove lines that do not end with a terminal
|
525 |
punctuation mark (i.e., “.”, “?”, “!”, or “"”). However, we found it could be too aggressive to remove these
|
526 |
-
lines, especially when
|
|
|
|
|
|
|
527 |
CC-MAIN-20230126210844-20230127000844-00000.warc.jsonl, the terminal punctuation rule led to the removal
|
528 |
of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
|
529 |
documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
|
530 |
-
""")
|
531 |
|
532 |
Details(
|
533 |
Summary("Sample documents with lines that are removed by the rule of terminal punctuation"),
|
@@ -539,14 +558,17 @@ def web_data():
|
|
539 |
),
|
540 |
|
541 |
|
542 |
-
|
543 |
P("""
|
544 |
In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
|
545 |
pages contained warnings stating that Javascript should be enabled. However, this filtering strategy is too
|
546 |
-
strict, which will filter out many lines that are really talking about “Javascript”.
|
|
|
|
|
|
|
547 |
propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
|
548 |
The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
|
549 |
-
""")
|
550 |
Details(
|
551 |
Summary("Sample documents that are removed by original C4 javascript rule but are kept after our refinement"),
|
552 |
DV(
|
@@ -555,14 +577,16 @@ def web_data():
|
|
555 |
"Sample documents that are removed by original C4 javascript rule but are kept after our refinement",
|
556 |
),
|
557 |
),
|
558 |
-
|
559 |
P("""
|
560 |
We also adopt rules from RefinedWeb [3] to remove lines if they satisfy any of the following criteria:
|
561 |
-
- The line is only composed of uppercase characters,
|
562 |
-
- The line is only composed of numerical characters,
|
563 |
-
- The line matches the pattern “r'^\\d+\\s+likes$'”,
|
564 |
-
- The line contains only one word.
|
565 |
"""),
|
|
|
|
|
|
|
|
|
|
|
|
|
566 |
Details(
|
567 |
Summary("Sample documents with lines that are removed by the RefinedWeb rules"),
|
568 |
DV(
|
@@ -571,7 +595,7 @@ def web_data():
|
|
571 |
"Sample documents with lines that are removed by the RefinedWeb rules",
|
572 |
),
|
573 |
),
|
574 |
-
|
575 |
P("""
|
576 |
When doing manual inspection on the data, we found that there are some adult ads in the beginning or end of the
|
577 |
document (with a sample shown below), which are hard to remove via document-level filtering strategies. Inspired
|
@@ -587,10 +611,10 @@ def web_data():
|
|
587 |
),
|
588 |
),
|
589 |
|
590 |
-
|
591 |
P("""
|
592 |
-
In this section, we introduce
|
593 |
-
|
594 |
Details(
|
595 |
Summary("Overview of all the quality signals that are used for filtering"),
|
596 |
DVS(
|
@@ -599,21 +623,21 @@ def web_data():
|
|
599 |
),
|
600 |
),
|
601 |
P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
|
602 |
-
Most
|
603 |
studies ([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
|
604 |
of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
|
605 |
outcomes for the same quality signals.
|
606 |
In our pipeline, we referenced earlier implementations that were publicly available such as Dolma [6], DataTrove [4],
|
607 |
-
and RedPajama V2 [7],
|
608 |
"""),
|
609 |
-
|
610 |
P("""
|
611 |
-
|
612 |
work ([2], [3], [6]), we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
|
613 |
"""),
|
614 |
-
|
615 |
P("""
|
616 |
-
Following Gopher [2], we remove documents containing
|
617 |
but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
|
618 |
that are duplicates, and the fraction of characters contained within those duplicated passages.
|
619 |
"""),
|
@@ -674,24 +698,24 @@ def web_data():
|
|
674 |
After evaluating the implementations of Dolma and DataTrove (note: RedPajama V2 does not implement these two quality
|
675 |
signals), we have made the following decisions:
|
676 |
"""),
|
677 |
-
|
678 |
P("""
|
679 |
Our manual review of the data revealed that documents extracted using trafilatura do not feature more than one newline
|
680 |
symbol separating passages. Testing the splitting pattern "\\n(2,)" on 10,000 sample documents resulted in no more than
|
681 |
one split. Consequently, we decided to disregard the distinction between lines and paragraphs in our implementation,
|
682 |
opting instead to use a single newline symbol to segment the text into passages.
|
683 |
"""),
|
684 |
-
|
685 |
P("""
|
686 |
In line with DataTrove's implementation, we chose to exclude the first occurrence. This more conservative strategy
|
687 |
helps retain a larger number of documents.
|
688 |
"""),
|
689 |
-
|
690 |
P("""
|
691 |
We adjusted the method in Dolma for counting characters within lines by excluding whitespace. This modification
|
692 |
ensures consistency with the overall document character count calculation.
|
693 |
"""),
|
694 |
-
|
695 |
Details(
|
696 |
Summary("TxT360 Implementation"),
|
697 |
D_code("""
|
@@ -719,7 +743,7 @@ def web_data():
|
|
719 |
"Sample documents filtered by excessive line repetitions / characters in repeated lines",
|
720 |
),
|
721 |
),
|
722 |
-
|
723 |
P("""
|
724 |
Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
|
725 |
fraction of characters contained within the most frequently-occurring n-gram.
|
@@ -804,7 +828,7 @@ def web_data():
|
|
804 |
""", block="block", language="python"),
|
805 |
),
|
806 |
P("""
|
807 |
-
There are almost no contradictions between
|
808 |
n-gram. The main process involves counting the occurrences of each n-gram and selecting the most common one. The
|
809 |
fraction is then determined by dividing the number of characters in the most common n-gram by the total number of
|
810 |
characters. One minor difference is that Dolma and DataTrove calculate the fraction of the most common n-gram even
|
@@ -838,7 +862,7 @@ def web_data():
|
|
838 |
"Sample documents filtered by the fraction of characters in the most common n-grams (n=2,3,4)",
|
839 |
),
|
840 |
),
|
841 |
-
|
842 |
P("""
|
843 |
Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (5, ..., 10), we calculate the
|
844 |
fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
|
@@ -1020,7 +1044,7 @@ def web_data():
|
|
1020 |
"Sample documents filtered by the fraction of characters in duplicated n-grams (n=5,...,10)",
|
1021 |
),
|
1022 |
),
|
1023 |
-
|
1024 |
P("""
|
1025 |
Some line-wise information could also be helpful to distinguish low-quality and high-quality documents. Following
|
1026 |
RefinedWeb [3], we remove the document if the corrected lines represent more than 5% of words. In line with previous
|
@@ -1101,7 +1125,7 @@ def web_data():
|
|
1101 |
),
|
1102 |
),
|
1103 |
|
1104 |
-
|
1105 |
P("We summarize other statistics-based rules originated from Gopher [7] in this section. The statistics can be used include:"),
|
1106 |
Ul(
|
1107 |
Li("the word count in the document", style = "margin-bottom: 5px"),
|
@@ -1120,7 +1144,7 @@ def web_data():
|
|
1120 |
Li("the words that contain at least one alphabetic character are less than 80% of the whole words", style = "margin-bottom: 5px"),
|
1121 |
Li("it contains less than two of the stop words (the, be, to, of, and, that, have, with", style = "margin-bottom: 5px"),
|
1122 |
),
|
1123 |
-
|
1124 |
Details(
|
1125 |
Summary("Implementations from Dolma"),
|
1126 |
D_code("""
|
@@ -1178,7 +1202,7 @@ def web_data():
|
|
1178 |
We decided to use simple `len(text.split())` to compute the word count.
|
1179 |
"""),
|
1180 |
|
1181 |
-
|
1182 |
P("""
|
1183 |
There is minimal variation among existing pipeline implementations. We simply compute the mean word length as follows:
|
1184 |
"""),
|
@@ -1189,13 +1213,13 @@ def web_data():
|
|
1189 |
mean_word_length = character_count / word_count
|
1190 |
""", block="block", language="python"),
|
1191 |
P("""
|
1192 |
-
It's worth noting that Dolma used the median word length instead of the mean
|
1193 |
"""),
|
1194 |
D_code("""
|
1195 |
from statistics import median
|
1196 |
median_word_length = median(len(word) for word in words)
|
1197 |
""", block="block", language="python"),
|
1198 |
-
|
1199 |
P("""
|
1200 |
The only publicly available implementation of this quality signal is from RedPajama V2, which uses regular expressions
|
1201 |
to split text into sentences.
|
@@ -1232,7 +1256,7 @@ def web_data():
|
|
1232 |
""", block="block", language="python"),
|
1233 |
),
|
1234 |
|
1235 |
-
|
1236 |
P("""
|
1237 |
Following RedPajama-V2 and DataTrove, we use the symbols of ("#", "...", "…").
|
1238 |
We calculate the ratio as the number of symbols divided by the total number of words.
|
@@ -1294,7 +1318,7 @@ def web_data():
|
|
1294 |
""", block="block", language="python"),
|
1295 |
),
|
1296 |
|
1297 |
-
|
1298 |
Details(
|
1299 |
Summary("Implementations from Dolma"),
|
1300 |
D_code("""
|
@@ -1355,7 +1379,7 @@ def web_data():
|
|
1355 |
attrs.num_of_stop_words = sum(1 for word in words if stop_words_pattern.search(word))
|
1356 |
|
1357 |
""", block="block", language="python"),
|
1358 |
-
|
1359 |
Details(
|
1360 |
Summary("Sample documents that are filtered out by statistics-based heuristics"),
|
1361 |
DV(
|
@@ -1364,7 +1388,7 @@ def web_data():
|
|
1364 |
"Sample documents that are filtered out by statistics-based heuristics",
|
1365 |
),
|
1366 |
),
|
1367 |
-
|
1368 |
P("""
|
1369 |
Following C4, we remove any page where the phrase “lorem ipsum” appeared since some pages had placeholder “lorem ipsum”
|
1370 |
text.
|
@@ -1374,7 +1398,7 @@ def web_data():
|
|
1374 |
Summary("Sample documents containing 'lorem ipsum'"),
|
1375 |
DV("data/lorem_ipsum.json", 0, "Sample documents containing 'lorem ipsum'"),
|
1376 |
),
|
1377 |
-
|
1378 |
P("""
|
1379 |
After careful filtering, although data quality has improved, a large fraction of the content is repeated across documents. This may be due to the crawler indirectly hitting the same page multiple times, to boilerplate content being repeated (e.g., licences), or even to plagiarism. These duplicates can strongly impact models, favoring memorization instead of generalization.
|
1380 |
"""), # Add detailed content and images as needed
|
@@ -1383,6 +1407,6 @@ def web_data():
|
|
1383 |
P("To reduce the expensive cost of global deduplication, we apply a local exact deduplication before it. Specifically, each dump is split into 70 splits. A bloom filter is applied within each split."),
|
1384 |
P(B("Global Fuzzy Deduplication")),
|
1385 |
P("NEED TO UPDATE"),
|
1386 |
-
|
1387 |
P("..."), # Add detailed content and images as needed
|
1388 |
)
|
|
|
352 |
|
353 |
def web_data():
|
354 |
return Div(
|
355 |
+
Div(
|
356 |
+
H2("Common Crawl Snapshot Processing"),
|
357 |
+
H3("What This Section Contains"),
|
358 |
+
P("This section provides a complete discussion on the filtering applied to the 99 Common Crawl snapshots that comprise the web data section of TxT360. The section is split into the following topic areas: "),
|
359 |
+
Ul(
|
360 |
+
Li("Web Data Processing Summary", style = "margin-bottom: 5px"),
|
361 |
+
Li("Document Preperation", style = "margin-bottom: 5px"),
|
362 |
+
Li("Line-Level Filtering", style = "margin-bottom: 5px"),
|
363 |
+
Li("Local Deduplication", style = "margin-bottom: 5px"),
|
364 |
+
Li("Each section is complete with code and comparisons to Dolma, DataTrove, and/or RedPajama-V-2", style = "margin-bottom: 5px"),
|
365 |
+
),
|
366 |
+
),
|
367 |
+
Div
|
368 |
+
H2("Common Crawl Data Processing Summary"),
|
369 |
+
Div(
|
370 |
+
P(
|
371 |
+
"To generate a high-quality dataset from large-scale webpages, we have investigated the processing steps used by the community and made our choices based on careful manual inspection. Starting from ",
|
372 |
+
A("Common Crawl", href="https://commoncrawl.org/"),
|
373 |
+
", our process can be summarized as five main steps: document preparation, line-level removal, document-level filtering, deduplication and PII removal.",
|
374 |
+
),
|
375 |
+
style="margin-top: 20px;",
|
376 |
+
),
|
377 |
Div(
|
378 |
Ul(
|
379 |
Li(
|
|
|
396 |
padding: 15px 15px 0px 15px;
|
397 |
""",
|
398 |
),
|
399 |
+
H3("TxT360 CommonCrawl Filtering vs Other Pretraining Datasets")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
400 |
P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
|
401 |
table_div_filter_data,
|
402 |
+
P("The table below provides a comparison of the quality filters that have been applied to each dataset."),
|
403 |
table_div_qf_filter_data,
|
404 |
P("Our filtering rate is illustrated below. Before deduplication, our filtering rate is comparable to RefinedWeb. During global deduplication, we removed approximately 85.89% of the data, significantly higher than previous works, indicating a large number of duplicates across dumps. "),
|
405 |
Img(src="images/filter_rate.jpg", height = "300", width = "600" ),
|
406 |
P("Note: All percentages are based on the number of documents. The gray bars represent the relative percentages of removed documents at each step, while the colorful bars represent the percentages of retained documents relative to the total number of documents in the raw Common Crawl."),
|
407 |
+
H3("TxT360 Filter Summary")
|
408 |
+
P("This section provides highlevel details into the filtering that is applied to CommonCrawl in TxT360. Each decision listed is discussed in detail further on in this section.")
|
409 |
+
P("We adopt rules from RefinedWeb [1] to remove lines if they satisfy any of the following criteria:"),
|
410 |
Ul(
|
411 |
Li("the line is only composed of uppercase characters", style = "margin-bottom: 5px"),
|
412 |
Li("the line is only composed of numerical characters", style = "margin-bottom: 5px"),
|
|
|
435 |
P("Following C4, we remove any page where the phrase “lorem ipsum” appears since some pages have placeholder “lorem ipsum” text."),
|
436 |
|
437 |
|
438 |
+
H2("1. Document Preparation"),
|
439 |
|
440 |
+
H3("1.1 Text Extraction"),
|
441 |
P("""
|
442 |
Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
|
443 |
WARC files contain the raw data from the crawl, which store the full HTTP response and request metadata.
|
|
|
458 |
),
|
459 |
#DV2("data/sample_wet.json", "data/sample_warc.json", 3),
|
460 |
|
461 |
+
H3("1.2 Language Identification"),
|
462 |
P("""
|
463 |
After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
|
464 |
This step removes over 60% of the whole data.
|
|
|
477 |
DV("data/sample_en_low.json", 3, "Sample documents that are classified as English but with score less than 0.65"),
|
478 |
),
|
479 |
|
480 |
+
H3("1.3 URL Filtering"),
|
481 |
P("""
|
482 |
+
The following section details the decisions behind utilizing the UT1 blocklist. We chose to use the UT1 blocklist as a simple method for filtering
|
483 |
+
out potentially harmful content such as adult content. We also excluded URLs that contained the digital version of the curated curated data (e.g. wikipedia.org) to avoid duplication.
|
484 |
"""),
|
485 |
+
H3("1.3.1 URL Blocklist"),
|
486 |
P("""
|
487 |
+
Following RefinedWeb [3], we manually inspected the UT1 blocklist to reduce false positives like news
|
488 |
articles, sex education, technical blogs, etc. Specifically, we randomly took 903M URLs and matched them with
|
489 |
+
4.6M domain names in the UT1 blocklist. Of note, 24 URLs were detected with more than 4k matches and are shown below.
|
490 |
"""),
|
491 |
|
492 |
Details(
|
|
|
511 |
),
|
512 |
),
|
513 |
|
514 |
+
H3("1.3.2 Excluded High Quality Sources"),
|
515 |
P("""
|
516 |
To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
|
517 |
"""),
|
|
|
530 |
),
|
531 |
|
532 |
|
533 |
+
H2("2. Line-Level Removal"),
|
534 |
P("""
|
535 |
+
Before filtering low-quality documents, we perform the line-level removal to remove low-quality lines.
|
536 |
+
This ensured that computing quality signals would align with the final kept texts.
|
537 |
"""),
|
538 |
+
H3("Terminal Punctuation"),
|
539 |
P("""
|
540 |
The terminal punctuation has been used in C4 [5] and Dolma [6] to remove lines that do not end with a terminal
|
541 |
punctuation mark (i.e., “.”, “?”, “!”, or “"”). However, we found it could be too aggressive to remove these
|
542 |
+
lines, especially when the text extraction tool “trafilatura”.
|
543 |
+
"""),
|
544 |
+
P("""
|
545 |
+
For instance, in the CommonCrawl file
|
546 |
CC-MAIN-20230126210844-20230127000844-00000.warc.jsonl, the terminal punctuation rule led to the removal
|
547 |
of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
|
548 |
documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
|
549 |
+
""")
|
550 |
|
551 |
Details(
|
552 |
Summary("Sample documents with lines that are removed by the rule of terminal punctuation"),
|
|
|
558 |
),
|
559 |
|
560 |
|
561 |
+
H3('2.1 Word "Javascript"'),
|
562 |
P("""
|
563 |
In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
|
564 |
pages contained warnings stating that Javascript should be enabled. However, this filtering strategy is too
|
565 |
+
strict, which will filter out many lines that are really talking about “Javascript”.
|
566 |
+
"""),
|
567 |
+
P("""
|
568 |
+
In our pipeline, we
|
569 |
propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
|
570 |
The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
|
571 |
+
""")
|
572 |
Details(
|
573 |
Summary("Sample documents that are removed by original C4 javascript rule but are kept after our refinement"),
|
574 |
DV(
|
|
|
577 |
"Sample documents that are removed by original C4 javascript rule but are kept after our refinement",
|
578 |
),
|
579 |
),
|
580 |
+
H3("2.2 Other Rules from RefinedWeb"),
|
581 |
P("""
|
582 |
We also adopt rules from RefinedWeb [3] to remove lines if they satisfy any of the following criteria:
|
|
|
|
|
|
|
|
|
583 |
"""),
|
584 |
+
Ul(
|
585 |
+
Li("The line is only composed of uppercase characters,", style = "margin-bottom: 5px"),
|
586 |
+
Li("the line is only composed of numerical characters", style = "margin-bottom: 5px"),
|
587 |
+
Li("the line matches the pattern “r'^\d+\s+likes$", style = "margin-bottom: 5px"),
|
588 |
+
Li("the line only contains one word.", style = "margin-bottom: 5px"),
|
589 |
+
),
|
590 |
Details(
|
591 |
Summary("Sample documents with lines that are removed by the RefinedWeb rules"),
|
592 |
DV(
|
|
|
595 |
"Sample documents with lines that are removed by the RefinedWeb rules",
|
596 |
),
|
597 |
),
|
598 |
+
H3("2.3 Toxic Lines"),
|
599 |
P("""
|
600 |
When doing manual inspection on the data, we found that there are some adult ads in the beginning or end of the
|
601 |
document (with a sample shown below), which are hard to remove via document-level filtering strategies. Inspired
|
|
|
611 |
),
|
612 |
),
|
613 |
|
614 |
+
H2("3. Document-Level Filtering"),
|
615 |
P("""
|
616 |
+
In this section, we introduce each quality signal used to filter out low-quality documents.
|
617 |
+
"""),
|
618 |
Details(
|
619 |
Summary("Overview of all the quality signals that are used for filtering"),
|
620 |
DVS(
|
|
|
623 |
),
|
624 |
),
|
625 |
P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
|
626 |
+
Most quality signals were initially introduced by Gopher [2] and subsequently adopted by later
|
627 |
studies ([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
|
628 |
of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
|
629 |
outcomes for the same quality signals.
|
630 |
In our pipeline, we referenced earlier implementations that were publicly available such as Dolma [6], DataTrove [4],
|
631 |
+
and RedPajama V2 [7], and selected the most suitable method based on manual inspections.
|
632 |
"""),
|
633 |
+
H3("3.1 Repetition-based Heuristics"),
|
634 |
P("""
|
635 |
+
Many documents contain repeated sequences, potentially due to crawling errors or low-quality sources. In line with previous
|
636 |
work ([2], [3], [6]), we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
|
637 |
"""),
|
638 |
+
H3("3.1.1 Fraction of (Characters in) Repeated Lines"),
|
639 |
P("""
|
640 |
+
Following Gopher [2], we remove documents containing mupltiple, short duplicate passages, as well as those with few,
|
641 |
but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
|
642 |
that are duplicates, and the fraction of characters contained within those duplicated passages.
|
643 |
"""),
|
|
|
698 |
After evaluating the implementations of Dolma and DataTrove (note: RedPajama V2 does not implement these two quality
|
699 |
signals), we have made the following decisions:
|
700 |
"""),
|
701 |
+
H3("Passage Separation"),
|
702 |
P("""
|
703 |
Our manual review of the data revealed that documents extracted using trafilatura do not feature more than one newline
|
704 |
symbol separating passages. Testing the splitting pattern "\\n(2,)" on 10,000 sample documents resulted in no more than
|
705 |
one split. Consequently, we decided to disregard the distinction between lines and paragraphs in our implementation,
|
706 |
opting instead to use a single newline symbol to segment the text into passages.
|
707 |
"""),
|
708 |
+
H3("First Occurrence"),
|
709 |
P("""
|
710 |
In line with DataTrove's implementation, we chose to exclude the first occurrence. This more conservative strategy
|
711 |
helps retain a larger number of documents.
|
712 |
"""),
|
713 |
+
H3("Character Count"),
|
714 |
P("""
|
715 |
We adjusted the method in Dolma for counting characters within lines by excluding whitespace. This modification
|
716 |
ensures consistency with the overall document character count calculation.
|
717 |
"""),
|
718 |
+
H3("TxT360 Implementation"),
|
719 |
Details(
|
720 |
Summary("TxT360 Implementation"),
|
721 |
D_code("""
|
|
|
743 |
"Sample documents filtered by excessive line repetitions / characters in repeated lines",
|
744 |
),
|
745 |
),
|
746 |
+
H3("3.1.2 Fraction of Characters in the Most Common N-grams (n=2,3,4)"),
|
747 |
P("""
|
748 |
Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
|
749 |
fraction of characters contained within the most frequently-occurring n-gram.
|
|
|
828 |
""", block="block", language="python"),
|
829 |
),
|
830 |
P("""
|
831 |
+
There are almost no contradictions between each implementations of fractions of characters in the most common
|
832 |
n-gram. The main process involves counting the occurrences of each n-gram and selecting the most common one. The
|
833 |
fraction is then determined by dividing the number of characters in the most common n-gram by the total number of
|
834 |
characters. One minor difference is that Dolma and DataTrove calculate the fraction of the most common n-gram even
|
|
|
862 |
"Sample documents filtered by the fraction of characters in the most common n-grams (n=2,3,4)",
|
863 |
),
|
864 |
),
|
865 |
+
H3("3.1.3 Fraction of Characters in Duplicated N-grams (n=5,...,10)"),
|
866 |
P("""
|
867 |
Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (5, ..., 10), we calculate the
|
868 |
fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
|
|
|
1044 |
"Sample documents filtered by the fraction of characters in duplicated n-grams (n=5,...,10)",
|
1045 |
),
|
1046 |
),
|
1047 |
+
H3("3.2 Line-wise Heuristics"),
|
1048 |
P("""
|
1049 |
Some line-wise information could also be helpful to distinguish low-quality and high-quality documents. Following
|
1050 |
RefinedWeb [3], we remove the document if the corrected lines represent more than 5% of words. In line with previous
|
|
|
1125 |
),
|
1126 |
),
|
1127 |
|
1128 |
+
H3("3.3 Statistics-based Heuristics"),
|
1129 |
P("We summarize other statistics-based rules originated from Gopher [7] in this section. The statistics can be used include:"),
|
1130 |
Ul(
|
1131 |
Li("the word count in the document", style = "margin-bottom: 5px"),
|
|
|
1144 |
Li("the words that contain at least one alphabetic character are less than 80% of the whole words", style = "margin-bottom: 5px"),
|
1145 |
Li("it contains less than two of the stop words (the, be, to, of, and, that, have, with", style = "margin-bottom: 5px"),
|
1146 |
),
|
1147 |
+
H3("Word Count"),
|
1148 |
Details(
|
1149 |
Summary("Implementations from Dolma"),
|
1150 |
D_code("""
|
|
|
1202 |
We decided to use simple `len(text.split())` to compute the word count.
|
1203 |
"""),
|
1204 |
|
1205 |
+
H3("Mean Word Length"),
|
1206 |
P("""
|
1207 |
There is minimal variation among existing pipeline implementations. We simply compute the mean word length as follows:
|
1208 |
"""),
|
|
|
1213 |
mean_word_length = character_count / word_count
|
1214 |
""", block="block", language="python"),
|
1215 |
P("""
|
1216 |
+
It's worth noting that Dolma used the median word length instead of the mean:
|
1217 |
"""),
|
1218 |
D_code("""
|
1219 |
from statistics import median
|
1220 |
median_word_length = median(len(word) for word in words)
|
1221 |
""", block="block", language="python"),
|
1222 |
+
H3("Number of Sentences"),
|
1223 |
P("""
|
1224 |
The only publicly available implementation of this quality signal is from RedPajama V2, which uses regular expressions
|
1225 |
to split text into sentences.
|
|
|
1256 |
""", block="block", language="python"),
|
1257 |
),
|
1258 |
|
1259 |
+
H3("Symbol to Word Ratio"),
|
1260 |
P("""
|
1261 |
Following RedPajama-V2 and DataTrove, we use the symbols of ("#", "...", "…").
|
1262 |
We calculate the ratio as the number of symbols divided by the total number of words.
|
|
|
1318 |
""", block="block", language="python"),
|
1319 |
),
|
1320 |
|
1321 |
+
H3("Fraction of Alphabetic Words"),
|
1322 |
Details(
|
1323 |
Summary("Implementations from Dolma"),
|
1324 |
D_code("""
|
|
|
1379 |
attrs.num_of_stop_words = sum(1 for word in words if stop_words_pattern.search(word))
|
1380 |
|
1381 |
""", block="block", language="python"),
|
1382 |
+
H3("TxT360 Implementation"),
|
1383 |
Details(
|
1384 |
Summary("Sample documents that are filtered out by statistics-based heuristics"),
|
1385 |
DV(
|
|
|
1388 |
"Sample documents that are filtered out by statistics-based heuristics",
|
1389 |
),
|
1390 |
),
|
1391 |
+
H3("3.4 Others"),
|
1392 |
P("""
|
1393 |
Following C4, we remove any page where the phrase “lorem ipsum” appeared since some pages had placeholder “lorem ipsum”
|
1394 |
text.
|
|
|
1398 |
Summary("Sample documents containing 'lorem ipsum'"),
|
1399 |
DV("data/lorem_ipsum.json", 0, "Sample documents containing 'lorem ipsum'"),
|
1400 |
),
|
1401 |
+
H2("4. Deduplication"),
|
1402 |
P("""
|
1403 |
After careful filtering, although data quality has improved, a large fraction of the content is repeated across documents. This may be due to the crawler indirectly hitting the same page multiple times, to boilerplate content being repeated (e.g., licences), or even to plagiarism. These duplicates can strongly impact models, favoring memorization instead of generalization.
|
1404 |
"""), # Add detailed content and images as needed
|
|
|
1407 |
P("To reduce the expensive cost of global deduplication, we apply a local exact deduplication before it. Specifically, each dump is split into 70 splits. A bloom filter is applied within each split."),
|
1408 |
P(B("Global Fuzzy Deduplication")),
|
1409 |
P("NEED TO UPDATE"),
|
1410 |
+
H2("5. PII Removal"),
|
1411 |
P("..."), # Add detailed content and images as needed
|
1412 |
)
|