victormiller
commited on
Update web.py
Browse files
web.py
CHANGED
@@ -310,8 +310,8 @@ def web_data():
|
|
310 |
),
|
311 |
#DV2("data/sample_wet.json", "data/sample_warc.json", 3),
|
312 |
|
313 |
-
|
314 |
-
P("""
|
315 |
After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
|
316 |
This step removes over 60% of the whole data.
|
317 |
"""),
|
@@ -347,13 +347,13 @@ def web_data():
|
|
347 |
""",
|
348 |
),
|
349 |
|
350 |
-
|
351 |
-
P("""
|
352 |
The following section details the decisions behind utilizing the UT1 blocklist. We chose to use the UT1 blocklist as a simple method for filtering
|
353 |
out potentially harmful content such as adult content. We also excluded URLs that contained the digital version of the curated curated data (e.g. wikipedia.org) to avoid duplication.
|
354 |
"""),
|
355 |
-
|
356 |
-
P("""
|
357 |
Following RefinedWeb [3], we manually inspected the UT1 blocklist to reduce false positives like news
|
358 |
articles, sex education, technical blogs, etc. Specifically, we randomly took 903M URLs and matched them with
|
359 |
4.6M domain names in the UT1 blocklist. Of note, 24 URLs were detected with more than 4k matches and are shown below.
|
@@ -407,8 +407,7 @@ def web_data():
|
|
407 |
""",
|
408 |
),
|
409 |
|
410 |
-
|
411 |
-
P("""
|
412 |
To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
|
413 |
"""),
|
414 |
|
@@ -449,8 +448,7 @@ def web_data():
|
|
449 |
Before filtering low-quality documents, we perform the line-level removal to remove low-quality lines.
|
450 |
This ensured that computing quality signals would align with the final kept texts.
|
451 |
"""),
|
452 |
-
|
453 |
-
P("""
|
454 |
The terminal punctuation has been used in C4 [5] and Dolma [6] to remove lines that do not end with a terminal
|
455 |
punctuation mark (i.e., “.”, “?”, “!”, or “"”). However, we found it could be too aggressive to remove these
|
456 |
lines, especially when the text extraction tool “trafilatura”.
|
@@ -481,8 +479,7 @@ def web_data():
|
|
481 |
),
|
482 |
|
483 |
|
484 |
-
|
485 |
-
P("""
|
486 |
In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
|
487 |
pages contained warnings stating that Javascript should be enabled. However, this filtering strategy is too
|
488 |
strict, which will filter out many lines that are really talking about “Javascript”.
|
@@ -509,8 +506,7 @@ def web_data():
|
|
509 |
margin-bottom: 15px
|
510 |
""",
|
511 |
),
|
512 |
-
|
513 |
-
P("""
|
514 |
We also adopt rules from RefinedWeb [3] to remove lines if they satisfy any of the following criteria:
|
515 |
"""),
|
516 |
Ul(
|
@@ -536,8 +532,7 @@ def web_data():
|
|
536 |
margin-bottom: 15px
|
537 |
""",
|
538 |
),
|
539 |
-
|
540 |
-
P("""
|
541 |
When doing manual inspection on the data, we found that there are some adult ads in the beginning or end of the
|
542 |
document (with a sample shown below), which are hard to remove via document-level filtering strategies. Inspired
|
543 |
by this, we develop line-level detoxification using a bad word list from LDNOOBW (+ rule: word length < 10 + the
|
@@ -589,13 +584,11 @@ def web_data():
|
|
589 |
In our pipeline, we referenced earlier implementations that were publicly available such as Dolma [6], DataTrove [4],
|
590 |
and RedPajama V2 [7], and selected the most suitable method based on manual inspections.
|
591 |
"""),
|
592 |
-
|
593 |
-
P("""
|
594 |
Many documents contain repeated sequences, potentially due to crawling errors or low-quality sources. In line with previous
|
595 |
work ([2], [3], [6]), we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
|
596 |
"""),
|
597 |
-
|
598 |
-
P("""
|
599 |
Following Gopher [2], we remove documents containing mupltiple, short duplicate passages, as well as those with few,
|
600 |
but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
|
601 |
that are duplicates, and the fraction of characters contained within those duplicated passages.
|
@@ -675,20 +668,17 @@ def web_data():
|
|
675 |
After evaluating the implementations of Dolma and DataTrove (note: RedPajama V2 does not implement these two quality
|
676 |
signals), we have made the following decisions:
|
677 |
"""),
|
678 |
-
|
679 |
-
P("""
|
680 |
Our manual review of the data revealed that documents extracted using trafilatura do not feature more than one newline
|
681 |
symbol separating passages. Testing the splitting pattern "\\n(2,)" on 10,000 sample documents resulted in no more than
|
682 |
one split. Consequently, we decided to disregard the distinction between lines and paragraphs in our implementation,
|
683 |
opting instead to use a single newline symbol to segment the text into passages.
|
684 |
"""),
|
685 |
-
|
686 |
-
P("""
|
687 |
In line with DataTrove's implementation, we chose to exclude the first occurrence. This more conservative strategy
|
688 |
helps retain a larger number of documents.
|
689 |
"""),
|
690 |
-
|
691 |
-
P("""
|
692 |
We adjusted the method in Dolma for counting characters within lines by excluding whitespace. This modification
|
693 |
ensures consistency with the overall document character count calculation.
|
694 |
"""),
|
@@ -738,8 +728,7 @@ def web_data():
|
|
738 |
margin-bottom: 15px
|
739 |
""",
|
740 |
),
|
741 |
-
|
742 |
-
P("""
|
743 |
Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
|
744 |
fraction of characters contained within the most frequently-occurring n-gram.
|
745 |
"""),
|
@@ -902,8 +891,7 @@ def web_data():
|
|
902 |
margin-bottom: 15px
|
903 |
""",
|
904 |
),
|
905 |
-
|
906 |
-
P("""
|
907 |
Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (5, ..., 10), we calculate the
|
908 |
fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
|
909 |
overlapping n-grams more than once.
|
@@ -1135,8 +1123,7 @@ def web_data():
|
|
1135 |
margin-bottom: 15px
|
1136 |
""",
|
1137 |
),
|
1138 |
-
|
1139 |
-
P("""
|
1140 |
Some line-wise information could also be helpful to distinguish low-quality and high-quality documents. Following
|
1141 |
RefinedWeb [3], we remove the document if the corrected lines represent more than 5% of words. In line with previous
|
1142 |
works ([2], [3], [6]), we remove the documents if more than 30% of the lines end with an ellipsis or more than
|
@@ -1243,8 +1230,9 @@ def web_data():
|
|
1243 |
""",
|
1244 |
),
|
1245 |
|
1246 |
-
|
1247 |
-
|
|
|
1248 |
Ul(
|
1249 |
Li("the word count in the document", style = "margin-bottom: 5px"),
|
1250 |
Li("the mean word length", style = "margin-bottom: 5px"),
|
@@ -1338,8 +1326,7 @@ def web_data():
|
|
1338 |
We decided to use simple `len(text.split())` to compute the word count.
|
1339 |
"""),
|
1340 |
|
1341 |
-
|
1342 |
-
P("""
|
1343 |
There is minimal variation among existing pipeline implementations. We simply compute the mean word length as follows:
|
1344 |
"""),
|
1345 |
D_code("""
|
@@ -1355,8 +1342,7 @@ def web_data():
|
|
1355 |
from statistics import median
|
1356 |
median_word_length = median(len(word) for word in words)
|
1357 |
""", block="block", language="python"),
|
1358 |
-
|
1359 |
-
P("""
|
1360 |
The only publicly available implementation of this quality signal is from RedPajama V2, which uses regular expressions
|
1361 |
to split text into sentences.
|
1362 |
"""),
|
@@ -1410,8 +1396,7 @@ def web_data():
|
|
1410 |
""",
|
1411 |
),
|
1412 |
|
1413 |
-
|
1414 |
-
P("""
|
1415 |
Following RedPajama-V2 and DataTrove, we use the symbols of ("#", "...", "…").
|
1416 |
We calculate the ratio as the number of symbols divided by the total number of words.
|
1417 |
"""),
|
@@ -1584,8 +1569,7 @@ def web_data():
|
|
1584 |
RedPajama-V2 employs regular expressions for this purpose. We opt to use regular expressions since `char.isalpha()`
|
1585 |
can also match words in other languages as long as they are not punctuations.
|
1586 |
"""),
|
1587 |
-
|
1588 |
-
P("""
|
1589 |
The implementations across existing pipelines are largely identical. We adopt them and apply them to our pipeline.
|
1590 |
"""),
|
1591 |
D_code("""
|
@@ -1614,8 +1598,7 @@ def web_data():
|
|
1614 |
margin-bottom: 15px
|
1615 |
""",
|
1616 |
),
|
1617 |
-
|
1618 |
-
P("""
|
1619 |
Following C4, we remove any page where the phrase “lorem ipsum” appeared since some pages had placeholder “lorem ipsum”
|
1620 |
text.
|
1621 |
"""),
|
@@ -1633,15 +1616,4 @@ def web_data():
|
|
1633 |
margin-bottom: 15px
|
1634 |
""",
|
1635 |
),
|
1636 |
-
H2("4. Deduplication"),
|
1637 |
-
P("""
|
1638 |
-
After careful filtering, although data quality has improved, a large fraction of the content is repeated across documents. This may be due to the crawler indirectly hitting the same page multiple times, to boilerplate content being repeated (e.g., licences), or even to plagiarism. These duplicates can strongly impact models, favoring memorization instead of generalization.
|
1639 |
-
"""), # Add detailed content and images as needed
|
1640 |
-
P("We perform two-level deduplication: local exact deduplication and global fuzzy deduplication"),
|
1641 |
-
P(B("Local Exact Deduplication")),
|
1642 |
-
P("To reduce the expensive cost of global deduplication, we apply a local exact deduplication before it. Specifically, each dump is split into 70 splits. A bloom filter is applied within each split."),
|
1643 |
-
P(B("Global Fuzzy Deduplication")),
|
1644 |
-
P("NEED TO UPDATE"),
|
1645 |
-
H2("5. PII Removal"),
|
1646 |
-
P("..."), # Add detailed content and images as needed
|
1647 |
)
|
|
|
310 |
),
|
311 |
#DV2("data/sample_wet.json", "data/sample_warc.json", 3),
|
312 |
|
313 |
+
|
314 |
+
P(B("Language Identification: "), """
|
315 |
After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
|
316 |
This step removes over 60% of the whole data.
|
317 |
"""),
|
|
|
347 |
""",
|
348 |
),
|
349 |
|
350 |
+
|
351 |
+
P(B("URL Filtering: "), """
|
352 |
The following section details the decisions behind utilizing the UT1 blocklist. We chose to use the UT1 blocklist as a simple method for filtering
|
353 |
out potentially harmful content such as adult content. We also excluded URLs that contained the digital version of the curated curated data (e.g. wikipedia.org) to avoid duplication.
|
354 |
"""),
|
355 |
+
|
356 |
+
P(B("URL Blocklist: "), """
|
357 |
Following RefinedWeb [3], we manually inspected the UT1 blocklist to reduce false positives like news
|
358 |
articles, sex education, technical blogs, etc. Specifically, we randomly took 903M URLs and matched them with
|
359 |
4.6M domain names in the UT1 blocklist. Of note, 24 URLs were detected with more than 4k matches and are shown below.
|
|
|
407 |
""",
|
408 |
),
|
409 |
|
410 |
+
P(B("Excluded High Quality Sources: "), """
|
|
|
411 |
To avoid duplication with our high-quality curated datasets, we exclude the following domains from our dataset.
|
412 |
"""),
|
413 |
|
|
|
448 |
Before filtering low-quality documents, we perform the line-level removal to remove low-quality lines.
|
449 |
This ensured that computing quality signals would align with the final kept texts.
|
450 |
"""),
|
451 |
+
P(B("Terminal Punctuation: "), """
|
|
|
452 |
The terminal punctuation has been used in C4 [5] and Dolma [6] to remove lines that do not end with a terminal
|
453 |
punctuation mark (i.e., “.”, “?”, “!”, or “"”). However, we found it could be too aggressive to remove these
|
454 |
lines, especially when the text extraction tool “trafilatura”.
|
|
|
479 |
),
|
480 |
|
481 |
|
482 |
+
P(B('"Word "Javascript"'), """
|
|
|
483 |
In C4 [5], the authors remove any line with the word "Javascript" since they found that many of the scraped
|
484 |
pages contained warnings stating that Javascript should be enabled. However, this filtering strategy is too
|
485 |
strict, which will filter out many lines that are really talking about “Javascript”.
|
|
|
506 |
margin-bottom: 15px
|
507 |
""",
|
508 |
),
|
509 |
+
P(B("Other Rules from RefinedWeb: "), """
|
|
|
510 |
We also adopt rules from RefinedWeb [3] to remove lines if they satisfy any of the following criteria:
|
511 |
"""),
|
512 |
Ul(
|
|
|
532 |
margin-bottom: 15px
|
533 |
""",
|
534 |
),
|
535 |
+
P(B("Toxic Lines: "), """
|
|
|
536 |
When doing manual inspection on the data, we found that there are some adult ads in the beginning or end of the
|
537 |
document (with a sample shown below), which are hard to remove via document-level filtering strategies. Inspired
|
538 |
by this, we develop line-level detoxification using a bad word list from LDNOOBW (+ rule: word length < 10 + the
|
|
|
584 |
In our pipeline, we referenced earlier implementations that were publicly available such as Dolma [6], DataTrove [4],
|
585 |
and RedPajama V2 [7], and selected the most suitable method based on manual inspections.
|
586 |
"""),
|
587 |
+
P(B("Repetition-based Heuristics: "), """
|
|
|
588 |
Many documents contain repeated sequences, potentially due to crawling errors or low-quality sources. In line with previous
|
589 |
work ([2], [3], [6]), we choose to remove any document with excessive line, paragraph, or n-gram repetitions.
|
590 |
"""),
|
591 |
+
P(B("Fraction of Characters in Repeated Lines: "), """
|
|
|
592 |
Following Gopher [2], we remove documents containing mupltiple, short duplicate passages, as well as those with few,
|
593 |
but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
|
594 |
that are duplicates, and the fraction of characters contained within those duplicated passages.
|
|
|
668 |
After evaluating the implementations of Dolma and DataTrove (note: RedPajama V2 does not implement these two quality
|
669 |
signals), we have made the following decisions:
|
670 |
"""),
|
671 |
+
P(B("Passage Separation: "), """
|
|
|
672 |
Our manual review of the data revealed that documents extracted using trafilatura do not feature more than one newline
|
673 |
symbol separating passages. Testing the splitting pattern "\\n(2,)" on 10,000 sample documents resulted in no more than
|
674 |
one split. Consequently, we decided to disregard the distinction between lines and paragraphs in our implementation,
|
675 |
opting instead to use a single newline symbol to segment the text into passages.
|
676 |
"""),
|
677 |
+
P(B("First Occurrence: "), """
|
|
|
678 |
In line with DataTrove's implementation, we chose to exclude the first occurrence. This more conservative strategy
|
679 |
helps retain a larger number of documents.
|
680 |
"""),
|
681 |
+
P(B("Character Count: "), """
|
|
|
682 |
We adjusted the method in Dolma for counting characters within lines by excluding whitespace. This modification
|
683 |
ensures consistency with the overall document character count calculation.
|
684 |
"""),
|
|
|
728 |
margin-bottom: 15px
|
729 |
""",
|
730 |
),
|
731 |
+
P(B("Fraction of Characters in the Most Common N-grams (n=2,3,4): "), """
|
|
|
732 |
Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
|
733 |
fraction of characters contained within the most frequently-occurring n-gram.
|
734 |
"""),
|
|
|
891 |
margin-bottom: 15px
|
892 |
""",
|
893 |
),
|
894 |
+
P(B("Fraction of Characters in Duplicated N-grams (n=5,...,10): "), """
|
|
|
895 |
Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (5, ..., 10), we calculate the
|
896 |
fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
|
897 |
overlapping n-grams more than once.
|
|
|
1123 |
margin-bottom: 15px
|
1124 |
""",
|
1125 |
),
|
1126 |
+
P(B("Line-wise Heuristics: "), """
|
|
|
1127 |
Some line-wise information could also be helpful to distinguish low-quality and high-quality documents. Following
|
1128 |
RefinedWeb [3], we remove the document if the corrected lines represent more than 5% of words. In line with previous
|
1129 |
works ([2], [3], [6]), we remove the documents if more than 30% of the lines end with an ellipsis or more than
|
|
|
1230 |
""",
|
1231 |
),
|
1232 |
|
1233 |
+
P(B("Statistics-based Heuristics: "), """
|
1234 |
+
We summarize other statistics-based rules originated from Gopher [7] in this section. The statistics can be used include:
|
1235 |
+
"""),
|
1236 |
Ul(
|
1237 |
Li("the word count in the document", style = "margin-bottom: 5px"),
|
1238 |
Li("the mean word length", style = "margin-bottom: 5px"),
|
|
|
1326 |
We decided to use simple `len(text.split())` to compute the word count.
|
1327 |
"""),
|
1328 |
|
1329 |
+
P(B("Mean Word Length: "), """
|
|
|
1330 |
There is minimal variation among existing pipeline implementations. We simply compute the mean word length as follows:
|
1331 |
"""),
|
1332 |
D_code("""
|
|
|
1342 |
from statistics import median
|
1343 |
median_word_length = median(len(word) for word in words)
|
1344 |
""", block="block", language="python"),
|
1345 |
+
P(B("Number of Sentences: "), """
|
|
|
1346 |
The only publicly available implementation of this quality signal is from RedPajama V2, which uses regular expressions
|
1347 |
to split text into sentences.
|
1348 |
"""),
|
|
|
1396 |
""",
|
1397 |
),
|
1398 |
|
1399 |
+
P(B("Symbol to Word Ratio: "), """
|
|
|
1400 |
Following RedPajama-V2 and DataTrove, we use the symbols of ("#", "...", "…").
|
1401 |
We calculate the ratio as the number of symbols divided by the total number of words.
|
1402 |
"""),
|
|
|
1569 |
RedPajama-V2 employs regular expressions for this purpose. We opt to use regular expressions since `char.isalpha()`
|
1570 |
can also match words in other languages as long as they are not punctuations.
|
1571 |
"""),
|
1572 |
+
P(B("Number of Stop Words: "), """
|
|
|
1573 |
The implementations across existing pipelines are largely identical. We adopt them and apply them to our pipeline.
|
1574 |
"""),
|
1575 |
D_code("""
|
|
|
1598 |
margin-bottom: 15px
|
1599 |
""",
|
1600 |
),
|
1601 |
+
P(B("Additional Filters: "), """
|
|
|
1602 |
Following C4, we remove any page where the phrase “lorem ipsum” appeared since some pages had placeholder “lorem ipsum”
|
1603 |
text.
|
1604 |
"""),
|
|
|
1616 |
margin-bottom: 15px
|
1617 |
""",
|
1618 |
),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1619 |
)
|