Spaces:
Sleeping
Sleeping
Commit
·
370837e
1
Parent(s):
2783986
updates
Browse files
web.py
CHANGED
@@ -272,9 +272,10 @@ def web_data():
|
|
272 |
of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
|
273 |
documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
|
274 |
"""),
|
275 |
-
|
276 |
-
|
277 |
-
|
|
|
278 |
),
|
279 |
H4('2.1 Word "Javascript"'),
|
280 |
P("""
|
@@ -284,9 +285,10 @@ def web_data():
|
|
284 |
propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
|
285 |
The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
|
286 |
"""),
|
287 |
-
|
288 |
-
|
289 |
-
|
|
|
290 |
),
|
291 |
H4("2.2 Other Rules from RefinedWeb"),
|
292 |
P("""
|
@@ -296,9 +298,10 @@ def web_data():
|
|
296 |
- The line matches the pattern “r'^\\d+\\s+likes$'”,
|
297 |
- The line contains only one word.
|
298 |
"""),
|
299 |
-
|
300 |
-
|
301 |
-
|
|
|
302 |
),
|
303 |
H4("2.3 Toxic Lines"),
|
304 |
P("""
|
@@ -308,15 +311,19 @@ def web_data():
|
|
308 |
line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
|
309 |
the bad words from English but also consider the bad words from other languages.
|
310 |
"""),
|
311 |
-
|
312 |
-
|
313 |
-
|
314 |
),
|
315 |
H3("3. Document-Level Filtering"),
|
316 |
P("""
|
317 |
In this section, we introduce all the quality signals that we have used to filter out low-quality documents.
|
318 |
-
Overview of all the quality signals that are used for filtering.
|
319 |
-
|
|
|
|
|
|
|
|
|
320 |
Most of these quality signals were initially introduced by Gopher [2] and subsequently adopted by later
|
321 |
studies ([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
|
322 |
of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
|
|
|
272 |
of 56,292 additional lines, resulting in the complete exclusion of 2,203 documents from a total of 13,560
|
273 |
documents (16.25%). Accordingly, we choose to not use terminal punctuation as a signal to remove lines.
|
274 |
"""),
|
275 |
+
view_data(
|
276 |
+
"data/sample_terminal_punc.json",
|
277 |
+
0,
|
278 |
+
"Sample documents with lines that are removed by the rule of terminal punctuation",
|
279 |
),
|
280 |
H4('2.1 Word "Javascript"'),
|
281 |
P("""
|
|
|
285 |
propose to refine the strategy by adding one more keyword to the word "javascript" to avoid false positives.
|
286 |
The additional keyword could be any one of “enable” / “disable” / “require” / “activate” / “browser”.
|
287 |
"""),
|
288 |
+
view_data(
|
289 |
+
"data/sample_java.jsonl",
|
290 |
+
0,
|
291 |
+
"Sample documents that are removed by original C4 javascript rule but are kept after our refinement",
|
292 |
),
|
293 |
H4("2.2 Other Rules from RefinedWeb"),
|
294 |
P("""
|
|
|
298 |
- The line matches the pattern “r'^\\d+\\s+likes$'”,
|
299 |
- The line contains only one word.
|
300 |
"""),
|
301 |
+
view_data(
|
302 |
+
"data/sample_refinedweb_line.json",
|
303 |
+
0,
|
304 |
+
"Sample documents with lines that are removed by the RefinedWeb rules",
|
305 |
),
|
306 |
H4("2.3 Toxic Lines"),
|
307 |
P("""
|
|
|
311 |
line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
|
312 |
the bad words from English but also consider the bad words from other languages.
|
313 |
"""),
|
314 |
+
view_data_static(
|
315 |
+
json.load(open("data/toxic_lines.json")),
|
316 |
+
"Sample documents with toxic lines",
|
317 |
),
|
318 |
H3("3. Document-Level Filtering"),
|
319 |
P("""
|
320 |
In this section, we introduce all the quality signals that we have used to filter out low-quality documents.
|
321 |
+
Overview of all the quality signals that are used for filtering."""),
|
322 |
+
view_data_static(
|
323 |
+
json.load(open("data/all_signals.json")),
|
324 |
+
"Overview of all the quality signals that are used for filtering",
|
325 |
+
),
|
326 |
+
P("""Similar to previous sections, we will present sample documents filtered out by the given quality signals.
|
327 |
Most of these quality signals were initially introduced by Gopher [2] and subsequently adopted by later
|
328 |
studies ([3], [6], [4]). However, we observed that, despite following the same descriptions, the implementation
|
329 |
of each quality signal can vary significantly among different dataset pipelines, resulting in disparate
|