victormiller
commited on
Update web.py
Browse files
web.py
CHANGED
@@ -432,7 +432,7 @@ def web_data():
|
|
432 |
P("We directly read WARC files instead of WET files and extracted text using Trafilatura. Similar to RefinedWeb, we avoid using Machine Learning (ML)-based metrics for filtering documents to prevent bias introduced by ML models. Importantly, we apply global deduplication across the entire dataset, whereas previous works only use local deduplication. Note that although The Pile also employed global deduplication on its web data (Pile-CC), this accounted for just 0.6\% of 74 snapshots."),
|
433 |
|
434 |
Details(
|
435 |
-
Summary("
|
436 |
DV2("data/sample_wet.json", "data/sample_warc.json", 3),
|
437 |
),
|
438 |
#DV2("data/sample_wet.json", "data/sample_warc.json", 3),
|
@@ -442,16 +442,16 @@ def web_data():
|
|
442 |
After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
|
443 |
This step removes over 60% of the whole data.
|
444 |
"""),
|
445 |
-
|
446 |
-
|
447 |
-
|
448 |
-
|
449 |
-
|
450 |
-
|
451 |
-
|
452 |
-
|
453 |
-
|
454 |
-
|
455 |
H4("1.3 URL Filtering"),
|
456 |
P("""
|
457 |
Following RefinedWeb [3], we use a manually inspected URL blocklist to filter fraudulent and/or adult websites.
|
@@ -463,6 +463,7 @@ def web_data():
|
|
463 |
articles, sex education, technical blogs, etc. Specifically, we randomly took 903M URLs and matched them with
|
464 |
4.6M domain names in the UT1 blocklist. 24 URL domains were detected with more than 4k matches, which are shown below.
|
465 |
"""),
|
|
|
466 |
DVS(urls_high_matches, "24 URL domains with more than 4k matches"),
|
467 |
P("""
|
468 |
We manually removed the following 6 domains from the UT1 blocklist so that they will not be removed from our dataset.
|
@@ -481,11 +482,11 @@ def web_data():
|
|
481 |
non_web_urls,
|
482 |
"curated url domains that are excluded from our dataset",
|
483 |
),
|
484 |
-
|
485 |
-
|
486 |
-
|
487 |
-
|
488 |
-
|
489 |
H3("2. Line-Level Removal"),
|
490 |
P("""
|
491 |
Before computing the quality signals that can be used for filtering low-quality documents, we perform the line-level
|
|
|
432 |
P("We directly read WARC files instead of WET files and extracted text using Trafilatura. Similar to RefinedWeb, we avoid using Machine Learning (ML)-based metrics for filtering documents to prevent bias introduced by ML models. Importantly, we apply global deduplication across the entire dataset, whereas previous works only use local deduplication. Note that although The Pile also employed global deduplication on its web data (Pile-CC), this accounted for just 0.6\% of 74 snapshots."),
|
433 |
|
434 |
Details(
|
435 |
+
Summary("Text Extraction Examples"),
|
436 |
DV2("data/sample_wet.json", "data/sample_warc.json", 3),
|
437 |
),
|
438 |
#DV2("data/sample_wet.json", "data/sample_warc.json", 3),
|
|
|
442 |
After text extraction, the non-English texts are then filtered out by fastText language identifier with a threshold of 0.65.
|
443 |
This step removes over 60% of the whole data.
|
444 |
"""),
|
445 |
+
Details(
|
446 |
+
Summary("Sample documents that are classified as non-English"),
|
447 |
+
DV("data/sample_non_en.json", 3),
|
448 |
+
),
|
449 |
+
|
450 |
+
Details(
|
451 |
+
Summary("Sample documents that are classified as English but with score less than 0.65"),
|
452 |
+
DV("data/sample_en_low.json",3),
|
453 |
+
),
|
454 |
+
|
455 |
H4("1.3 URL Filtering"),
|
456 |
P("""
|
457 |
Following RefinedWeb [3], we use a manually inspected URL blocklist to filter fraudulent and/or adult websites.
|
|
|
463 |
articles, sex education, technical blogs, etc. Specifically, we randomly took 903M URLs and matched them with
|
464 |
4.6M domain names in the UT1 blocklist. 24 URL domains were detected with more than 4k matches, which are shown below.
|
465 |
"""),
|
466 |
+
|
467 |
DVS(urls_high_matches, "24 URL domains with more than 4k matches"),
|
468 |
P("""
|
469 |
We manually removed the following 6 domains from the UT1 blocklist so that they will not be removed from our dataset.
|
|
|
482 |
non_web_urls,
|
483 |
"curated url domains that are excluded from our dataset",
|
484 |
),
|
485 |
+
|
486 |
+
Details(
|
487 |
+
Summary("Sample documents whose urls are in our curated url domain list"),
|
488 |
+
DV("data/sample_url_exclusion.json", 0,),
|
489 |
+
),
|
490 |
H3("2. Line-Level Removal"),
|
491 |
P("""
|
492 |
Before computing the quality signals that can be used for filtering low-quality documents, we perform the line-level
|