omkarenator
commited on
Commit
•
09bef6a
1
Parent(s):
0698fac
move to spa, smooth srolling
Browse files- common.py +8 -8
- curated.py +4 -24
- main.py +34 -92
- results.py +4 -4
- style.css +4 -0
- web.py +5 -5
common.py
CHANGED
@@ -299,7 +299,7 @@ global_div = Div(
|
|
299 |
),
|
300 |
Li("Normalization Form C Discussion", style="margin-bottom: 5px"),
|
301 |
),
|
302 |
-
id="
|
303 |
),
|
304 |
Section(
|
305 |
H2("Motivation Behind Global Deduplication"),
|
@@ -331,7 +331,7 @@ global_div = Div(
|
|
331 |
P(
|
332 |
"Additionally, we maintained statistics about each matching clusters as they were formed during the final stage of deduplication. Below are the details of all four stages of our deduplication pipeline. We use Dask extensively throughout all stages of the deduplication. We have included the size of results of each stage on disk to give an idea about the scale:"
|
333 |
),
|
334 |
-
id="
|
335 |
),
|
336 |
Section(
|
337 |
H3("MinHash Generation"),
|
@@ -339,7 +339,7 @@ global_div = Div(
|
|
339 |
"We use the datasketch library to generate MinHash signatures with the number of permutations to 128. Each signature is signature represented as a MinHash object for each document. Before calculating the signature, the text is cleaned by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, a list of 13-grams is generated to use as features for creating a document signature. The globally-unique document IDs and signatures are then saved to disk. The documented ID is designed by an encoding scheme which converts file names and line numbers (there is one document per line) to unique document IDs. This also helped a lot in saving disk and memory for this stage."
|
340 |
),
|
341 |
P(B("This step produced 20 TB of hashes.")),
|
342 |
-
id="
|
343 |
),
|
344 |
Section(
|
345 |
H3("Matching Pairs Generation"),
|
@@ -351,7 +351,7 @@ global_div = Div(
|
|
351 |
),
|
352 |
D_code(dask_algo, block="block", language="python"),
|
353 |
P(B("This step produced 9.2 TB of matching pairs from all bands.")),
|
354 |
-
id="
|
355 |
),
|
356 |
Section(
|
357 |
H3("Finding Duplicate Pairs"),
|
@@ -369,7 +369,7 @@ global_div = Div(
|
|
369 |
"The resulting unique pairs are then used to identify clusters of near-duplicates by finding connected components in a graph, where the vertices represent documents and the edges represent matches."
|
370 |
),
|
371 |
P(B("This step produced 1.9 TB of unique pairs.")),
|
372 |
-
id="
|
373 |
),
|
374 |
Section(
|
375 |
H3("Finding Connected Components using MapReduce"),
|
@@ -389,7 +389,7 @@ global_div = Div(
|
|
389 |
"Below is the distribution of duplicate documents found across different snapshots of CommonCrawl. The distribution is skewed to the right because the documents are bucketed by the dump ID of the document we retain, and we prefer documents from higher dump IDs."
|
390 |
),
|
391 |
plotly2fasthtml(dup_docs_count_graph()),
|
392 |
-
id="
|
393 |
),
|
394 |
Section(
|
395 |
H3("Analysis of Near-Duplicate Clusters"),
|
@@ -434,7 +434,7 @@ global_div = Div(
|
|
434 |
style="list-style-type: none",
|
435 |
),
|
436 |
),
|
437 |
-
id="
|
438 |
),
|
439 |
Section(
|
440 |
H2("Normalization Form C"),
|
@@ -454,7 +454,7 @@ global_div = Div(
|
|
454 |
style="list-style-type: none",
|
455 |
)
|
456 |
), # "background-color= gray" "color= blue" maybe add this later
|
457 |
-
id="
|
458 |
),
|
459 |
Section(
|
460 |
H3("NFC Examples"),
|
|
|
299 |
),
|
300 |
Li("Normalization Form C Discussion", style="margin-bottom: 5px"),
|
301 |
),
|
302 |
+
id="section41",
|
303 |
),
|
304 |
Section(
|
305 |
H2("Motivation Behind Global Deduplication"),
|
|
|
331 |
P(
|
332 |
"Additionally, we maintained statistics about each matching clusters as they were formed during the final stage of deduplication. Below are the details of all four stages of our deduplication pipeline. We use Dask extensively throughout all stages of the deduplication. We have included the size of results of each stage on disk to give an idea about the scale:"
|
333 |
),
|
334 |
+
id="section42",
|
335 |
),
|
336 |
Section(
|
337 |
H3("MinHash Generation"),
|
|
|
339 |
"We use the datasketch library to generate MinHash signatures with the number of permutations to 128. Each signature is signature represented as a MinHash object for each document. Before calculating the signature, the text is cleaned by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, a list of 13-grams is generated to use as features for creating a document signature. The globally-unique document IDs and signatures are then saved to disk. The documented ID is designed by an encoding scheme which converts file names and line numbers (there is one document per line) to unique document IDs. This also helped a lot in saving disk and memory for this stage."
|
340 |
),
|
341 |
P(B("This step produced 20 TB of hashes.")),
|
342 |
+
id="section43",
|
343 |
),
|
344 |
Section(
|
345 |
H3("Matching Pairs Generation"),
|
|
|
351 |
),
|
352 |
D_code(dask_algo, block="block", language="python"),
|
353 |
P(B("This step produced 9.2 TB of matching pairs from all bands.")),
|
354 |
+
id="section44",
|
355 |
),
|
356 |
Section(
|
357 |
H3("Finding Duplicate Pairs"),
|
|
|
369 |
"The resulting unique pairs are then used to identify clusters of near-duplicates by finding connected components in a graph, where the vertices represent documents and the edges represent matches."
|
370 |
),
|
371 |
P(B("This step produced 1.9 TB of unique pairs.")),
|
372 |
+
id="section45",
|
373 |
),
|
374 |
Section(
|
375 |
H3("Finding Connected Components using MapReduce"),
|
|
|
389 |
"Below is the distribution of duplicate documents found across different snapshots of CommonCrawl. The distribution is skewed to the right because the documents are bucketed by the dump ID of the document we retain, and we prefer documents from higher dump IDs."
|
390 |
),
|
391 |
plotly2fasthtml(dup_docs_count_graph()),
|
392 |
+
id="section46",
|
393 |
),
|
394 |
Section(
|
395 |
H3("Analysis of Near-Duplicate Clusters"),
|
|
|
434 |
style="list-style-type: none",
|
435 |
),
|
436 |
),
|
437 |
+
id="section47",
|
438 |
),
|
439 |
Section(
|
440 |
H2("Normalization Form C"),
|
|
|
454 |
style="list-style-type: none",
|
455 |
)
|
456 |
), # "background-color= gray" "color= blue" maybe add this later
|
457 |
+
id="section48",
|
458 |
),
|
459 |
Section(
|
460 |
H3("NFC Examples"),
|
curated.py
CHANGED
@@ -1554,27 +1554,7 @@ table_html_data_pipe = data_pipeline_table.to_html(index=False, border=0)
|
|
1554 |
table_div_data_pipe = Div(NotStr(table_html_data_pipe), style="margin: 40px;")
|
1555 |
|
1556 |
|
1557 |
-
def
|
1558 |
-
params = request.query_params
|
1559 |
-
if data_source := params.get(f"data_source_{target}"):
|
1560 |
-
return get_data(data_source, params.get(f"doc_id_{target}", 3), target)
|
1561 |
-
if doc_id := params.get(f"doc_id_{target}"):
|
1562 |
-
return get_data(params.get(f"data_source_{target}"), doc_id, target)
|
1563 |
-
|
1564 |
-
|
1565 |
-
def curated(request):
|
1566 |
-
# Partial Updates
|
1567 |
-
params = dict(request.query_params)
|
1568 |
-
if target := params.get("target"):
|
1569 |
-
if data_source := params.get(f"data_source_{target}"):
|
1570 |
-
return get_data(
|
1571 |
-
data_source, params.get(f"doc_id_{target}", 3), params.get("target")
|
1572 |
-
)
|
1573 |
-
if doc_id := params.get(f"doc_id_{target}"):
|
1574 |
-
return get_data(
|
1575 |
-
params.get(f"data_source_{target}"), doc_id, params.get("target")
|
1576 |
-
)
|
1577 |
-
|
1578 |
data_preparation_steps = pd.DataFrame(
|
1579 |
{
|
1580 |
"Method": [
|
@@ -1623,15 +1603,15 @@ def curated(request):
|
|
1623 |
Section(
|
1624 |
curated_sources_intro,
|
1625 |
plotly2fasthtml(treemap_chart),
|
1626 |
-
id="
|
1627 |
),
|
1628 |
Section(
|
1629 |
data_preprocessing_div,
|
1630 |
-
id="
|
1631 |
),
|
1632 |
Section(
|
1633 |
filtering_process,
|
1634 |
-
id="
|
1635 |
),
|
1636 |
id="inner-text",
|
1637 |
)
|
|
|
1554 |
table_div_data_pipe = Div(NotStr(table_html_data_pipe), style="margin: 40px;")
|
1555 |
|
1556 |
|
1557 |
+
def curated():
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1558 |
data_preparation_steps = pd.DataFrame(
|
1559 |
{
|
1560 |
"Method": [
|
|
|
1603 |
Section(
|
1604 |
curated_sources_intro,
|
1605 |
plotly2fasthtml(treemap_chart),
|
1606 |
+
id="section31",
|
1607 |
),
|
1608 |
Section(
|
1609 |
data_preprocessing_div,
|
1610 |
+
id="section32",
|
1611 |
),
|
1612 |
Section(
|
1613 |
filtering_process,
|
1614 |
+
id="section33",
|
1615 |
),
|
1616 |
id="inner-text",
|
1617 |
)
|
main.py
CHANGED
@@ -175,9 +175,7 @@ def main():
|
|
175 |
Div(
|
176 |
A(
|
177 |
"TxT360",
|
178 |
-
href="
|
179 |
-
hx_get="/intro#section1",
|
180 |
-
hx_target="#inner-text",
|
181 |
)
|
182 |
),
|
183 |
Div(
|
@@ -185,25 +183,19 @@ def main():
|
|
185 |
Li(
|
186 |
A(
|
187 |
"About TxT360",
|
188 |
-
href="
|
189 |
-
hx_get="/intro#section1",
|
190 |
-
hx_target="#inner-text",
|
191 |
)
|
192 |
),
|
193 |
Li(
|
194 |
A(
|
195 |
"Motivation Behind TxT360",
|
196 |
-
href="
|
197 |
-
hx_get="/intro#section2",
|
198 |
-
hx_target="#inner-text",
|
199 |
)
|
200 |
),
|
201 |
Li(
|
202 |
A(
|
203 |
"Generalizable Approach to Data Processing",
|
204 |
-
href="
|
205 |
-
hx_get="/intro#section3",
|
206 |
-
hx_target="#inner-text",
|
207 |
)
|
208 |
),
|
209 |
),
|
@@ -211,9 +203,7 @@ def main():
|
|
211 |
Div(
|
212 |
A(
|
213 |
"Web Data Processing",
|
214 |
-
href="
|
215 |
-
hx_get="/webdata#section1",
|
216 |
-
hx_target="#inner-text",
|
217 |
)
|
218 |
),
|
219 |
Div(
|
@@ -221,41 +211,31 @@ def main():
|
|
221 |
Li(
|
222 |
A(
|
223 |
"Common Crawl Snapshot Processing",
|
224 |
-
href="
|
225 |
-
hx_get="/webdata#section1",
|
226 |
-
hx_target="#inner-text",
|
227 |
)
|
228 |
),
|
229 |
Li(
|
230 |
A(
|
231 |
"Common Crawl Data Processing Summary",
|
232 |
-
href="
|
233 |
-
hx_get="/webdata#section2",
|
234 |
-
hx_target="#inner-text",
|
235 |
)
|
236 |
),
|
237 |
Li(
|
238 |
A(
|
239 |
"Document Preparation",
|
240 |
-
href="
|
241 |
-
hx_get="/webdata#section3",
|
242 |
-
hx_target="#inner-text",
|
243 |
)
|
244 |
),
|
245 |
Li(
|
246 |
A(
|
247 |
"Line-Level Removal",
|
248 |
-
href="
|
249 |
-
hx_get="/webdata#section4",
|
250 |
-
hx_target="#inner-text",
|
251 |
)
|
252 |
),
|
253 |
Li(
|
254 |
A(
|
255 |
"Document-Level Filtering",
|
256 |
-
href="
|
257 |
-
hx_get="/webdata#section5",
|
258 |
-
hx_target="#inner-text",
|
259 |
)
|
260 |
),
|
261 |
),
|
@@ -263,9 +243,7 @@ def main():
|
|
263 |
Div(
|
264 |
A(
|
265 |
"Curated Sources Processing",
|
266 |
-
href="
|
267 |
-
hx_get="/curated#section1",
|
268 |
-
hx_target="#inner-text",
|
269 |
)
|
270 |
),
|
271 |
Div(
|
@@ -273,25 +251,19 @@ def main():
|
|
273 |
Li(
|
274 |
A(
|
275 |
"Curated Sources in TxT360",
|
276 |
-
href="
|
277 |
-
hx_get="/curated#section1",
|
278 |
-
hx_target="#inner-text",
|
279 |
)
|
280 |
),
|
281 |
Li(
|
282 |
A(
|
283 |
"Filtering Steps and Definitions",
|
284 |
-
href="
|
285 |
-
hx_get="/curated#section2",
|
286 |
-
hx_target="#inner-text",
|
287 |
)
|
288 |
),
|
289 |
Li(
|
290 |
A(
|
291 |
"Filtering Discussion on All Curated Sources",
|
292 |
-
href="
|
293 |
-
hx_get="/curated#section3",
|
294 |
-
hx_target="#inner-text",
|
295 |
)
|
296 |
),
|
297 |
),
|
@@ -299,9 +271,7 @@ def main():
|
|
299 |
Div(
|
300 |
A(
|
301 |
"Shared Processing Steps",
|
302 |
-
href="
|
303 |
-
hx_get="/common#section1",
|
304 |
-
hx_target="#inner-text",
|
305 |
)
|
306 |
),
|
307 |
Div(
|
@@ -309,65 +279,49 @@ def main():
|
|
309 |
Li(
|
310 |
A(
|
311 |
"Overview",
|
312 |
-
href="
|
313 |
-
hx_get="/common#section1",
|
314 |
-
hx_target="#inner-text",
|
315 |
)
|
316 |
),
|
317 |
Li(
|
318 |
A(
|
319 |
"Motivation Behind Global Deduplication",
|
320 |
-
href="
|
321 |
-
hx_get="/common#section2",
|
322 |
-
hx_target="#inner-text",
|
323 |
)
|
324 |
),
|
325 |
Li(
|
326 |
A(
|
327 |
"MinHash Generation",
|
328 |
-
href="
|
329 |
-
hx_get="/common#section3",
|
330 |
-
hx_target="#inner-text",
|
331 |
)
|
332 |
),
|
333 |
Li(
|
334 |
A(
|
335 |
"Matching Pairs Generation",
|
336 |
-
href="
|
337 |
-
hx_get="/common#section4",
|
338 |
-
hx_target="#inner-text",
|
339 |
)
|
340 |
),
|
341 |
Li(
|
342 |
A(
|
343 |
"Finding Duplicate Pairs",
|
344 |
-
href="
|
345 |
-
hx_get="/common#section5",
|
346 |
-
hx_target="#inner-text",
|
347 |
)
|
348 |
),
|
349 |
Li(
|
350 |
A(
|
351 |
"Finding Connected Components using MapReduce",
|
352 |
-
href="
|
353 |
-
hx_get="/common#section6",
|
354 |
-
hx_target="#inner-text",
|
355 |
)
|
356 |
),
|
357 |
Li(
|
358 |
A(
|
359 |
"Personally Identifiable Information Removal",
|
360 |
-
href="
|
361 |
-
hx_get="/common#section7",
|
362 |
-
hx_target="#inner-text",
|
363 |
)
|
364 |
),
|
365 |
Li(
|
366 |
A(
|
367 |
"Normalization Form C",
|
368 |
-
href="
|
369 |
-
hx_get="/common#section8",
|
370 |
-
hx_target="#inner-text",
|
371 |
)
|
372 |
),
|
373 |
),
|
@@ -375,9 +329,7 @@ def main():
|
|
375 |
Div(
|
376 |
A(
|
377 |
"TxT360 Studies",
|
378 |
-
href="
|
379 |
-
hx_get="/results#section1",
|
380 |
-
hx_target="#inner-text",
|
381 |
),
|
382 |
),
|
383 |
Div(
|
@@ -385,25 +337,19 @@ def main():
|
|
385 |
Li(
|
386 |
A(
|
387 |
"Overview",
|
388 |
-
href="
|
389 |
-
hx_get="/results#section1",
|
390 |
-
hx_target="#inner-text",
|
391 |
)
|
392 |
),
|
393 |
Li(
|
394 |
A(
|
395 |
"Upsampling Experiment",
|
396 |
-
href="
|
397 |
-
hx_get="/results#section2",
|
398 |
-
hx_target="#inner-text",
|
399 |
)
|
400 |
),
|
401 |
Li(
|
402 |
A(
|
403 |
"Perplexity Analysis",
|
404 |
-
href="
|
405 |
-
hx_get="/results#section3",
|
406 |
-
hx_target="#inner-text",
|
407 |
)
|
408 |
),
|
409 |
),
|
@@ -413,6 +359,10 @@ def main():
|
|
413 |
),
|
414 |
),
|
415 |
intro(),
|
|
|
|
|
|
|
|
|
416 |
),
|
417 |
D_appendix(
|
418 |
D_bibliography(src="bibliography.bib"),
|
@@ -905,7 +855,7 @@ def intro():
|
|
905 |
P(
|
906 |
"We documented all implementation details in this blog post and are open sourcing the code. Examples of each filter and rationale supporting each decision are included."
|
907 |
),
|
908 |
-
id="
|
909 |
),
|
910 |
Section(
|
911 |
H2("Motivation Behind TxT360"),
|
@@ -923,7 +873,7 @@ def intro():
|
|
923 |
),
|
924 |
# P("Table 2: Basic TxT360 Statistics."),
|
925 |
# table_div_data,
|
926 |
-
id="
|
927 |
),
|
928 |
Section(
|
929 |
H2("Our Generalizable Approach to Data Processing"),
|
@@ -944,7 +894,7 @@ def intro():
|
|
944 |
# P(
|
945 |
# "Figure 1: Data processing pipeline. All the steps are adopted for processing web data while the yellow blocks are adopted for processing curated sources."
|
946 |
# ),
|
947 |
-
id="
|
948 |
),
|
949 |
id="inner-text",
|
950 |
)
|
@@ -952,12 +902,4 @@ def intro():
|
|
952 |
|
953 |
rt("/update/{target}")(data_viewer.update)
|
954 |
|
955 |
-
rt("/curated")(curated.curated)
|
956 |
-
|
957 |
-
rt("/webdata")(web.web_data)
|
958 |
-
|
959 |
-
rt("/common")(common.common_steps)
|
960 |
-
|
961 |
-
rt("/results")(results.results)
|
962 |
-
|
963 |
serve()
|
|
|
175 |
Div(
|
176 |
A(
|
177 |
"TxT360",
|
178 |
+
href="#section1",
|
|
|
|
|
179 |
)
|
180 |
),
|
181 |
Div(
|
|
|
183 |
Li(
|
184 |
A(
|
185 |
"About TxT360",
|
186 |
+
href="#section11",
|
|
|
|
|
187 |
)
|
188 |
),
|
189 |
Li(
|
190 |
A(
|
191 |
"Motivation Behind TxT360",
|
192 |
+
href="#section12",
|
|
|
|
|
193 |
)
|
194 |
),
|
195 |
Li(
|
196 |
A(
|
197 |
"Generalizable Approach to Data Processing",
|
198 |
+
href="#section13",
|
|
|
|
|
199 |
)
|
200 |
),
|
201 |
),
|
|
|
203 |
Div(
|
204 |
A(
|
205 |
"Web Data Processing",
|
206 |
+
href="#section21",
|
|
|
|
|
207 |
)
|
208 |
),
|
209 |
Div(
|
|
|
211 |
Li(
|
212 |
A(
|
213 |
"Common Crawl Snapshot Processing",
|
214 |
+
href="#section21",
|
|
|
|
|
215 |
)
|
216 |
),
|
217 |
Li(
|
218 |
A(
|
219 |
"Common Crawl Data Processing Summary",
|
220 |
+
href="#section22",
|
|
|
|
|
221 |
)
|
222 |
),
|
223 |
Li(
|
224 |
A(
|
225 |
"Document Preparation",
|
226 |
+
href="#section23",
|
|
|
|
|
227 |
)
|
228 |
),
|
229 |
Li(
|
230 |
A(
|
231 |
"Line-Level Removal",
|
232 |
+
href="#section24",
|
|
|
|
|
233 |
)
|
234 |
),
|
235 |
Li(
|
236 |
A(
|
237 |
"Document-Level Filtering",
|
238 |
+
href="#section25",
|
|
|
|
|
239 |
)
|
240 |
),
|
241 |
),
|
|
|
243 |
Div(
|
244 |
A(
|
245 |
"Curated Sources Processing",
|
246 |
+
href="#section31",
|
|
|
|
|
247 |
)
|
248 |
),
|
249 |
Div(
|
|
|
251 |
Li(
|
252 |
A(
|
253 |
"Curated Sources in TxT360",
|
254 |
+
href="#section31",
|
|
|
|
|
255 |
)
|
256 |
),
|
257 |
Li(
|
258 |
A(
|
259 |
"Filtering Steps and Definitions",
|
260 |
+
href="#section32",
|
|
|
|
|
261 |
)
|
262 |
),
|
263 |
Li(
|
264 |
A(
|
265 |
"Filtering Discussion on All Curated Sources",
|
266 |
+
href="#section33",
|
|
|
|
|
267 |
)
|
268 |
),
|
269 |
),
|
|
|
271 |
Div(
|
272 |
A(
|
273 |
"Shared Processing Steps",
|
274 |
+
href="#section41",
|
|
|
|
|
275 |
)
|
276 |
),
|
277 |
Div(
|
|
|
279 |
Li(
|
280 |
A(
|
281 |
"Overview",
|
282 |
+
href="#section41",
|
|
|
|
|
283 |
)
|
284 |
),
|
285 |
Li(
|
286 |
A(
|
287 |
"Motivation Behind Global Deduplication",
|
288 |
+
href="#section42",
|
|
|
|
|
289 |
)
|
290 |
),
|
291 |
Li(
|
292 |
A(
|
293 |
"MinHash Generation",
|
294 |
+
href="#section43",
|
|
|
|
|
295 |
)
|
296 |
),
|
297 |
Li(
|
298 |
A(
|
299 |
"Matching Pairs Generation",
|
300 |
+
href="#section44",
|
|
|
|
|
301 |
)
|
302 |
),
|
303 |
Li(
|
304 |
A(
|
305 |
"Finding Duplicate Pairs",
|
306 |
+
href="#section45",
|
|
|
|
|
307 |
)
|
308 |
),
|
309 |
Li(
|
310 |
A(
|
311 |
"Finding Connected Components using MapReduce",
|
312 |
+
href="#section46",
|
|
|
|
|
313 |
)
|
314 |
),
|
315 |
Li(
|
316 |
A(
|
317 |
"Personally Identifiable Information Removal",
|
318 |
+
href="#section47",
|
|
|
|
|
319 |
)
|
320 |
),
|
321 |
Li(
|
322 |
A(
|
323 |
"Normalization Form C",
|
324 |
+
href="#section48",
|
|
|
|
|
325 |
)
|
326 |
),
|
327 |
),
|
|
|
329 |
Div(
|
330 |
A(
|
331 |
"TxT360 Studies",
|
332 |
+
href="#section51",
|
|
|
|
|
333 |
),
|
334 |
),
|
335 |
Div(
|
|
|
337 |
Li(
|
338 |
A(
|
339 |
"Overview",
|
340 |
+
href="#section51",
|
|
|
|
|
341 |
)
|
342 |
),
|
343 |
Li(
|
344 |
A(
|
345 |
"Upsampling Experiment",
|
346 |
+
href="#section52",
|
|
|
|
|
347 |
)
|
348 |
),
|
349 |
Li(
|
350 |
A(
|
351 |
"Perplexity Analysis",
|
352 |
+
href="#section53",
|
|
|
|
|
353 |
)
|
354 |
),
|
355 |
),
|
|
|
359 |
),
|
360 |
),
|
361 |
intro(),
|
362 |
+
curated.curated(),
|
363 |
+
web.web_data(),
|
364 |
+
common.common_steps(),
|
365 |
+
results.results(),
|
366 |
),
|
367 |
D_appendix(
|
368 |
D_bibliography(src="bibliography.bib"),
|
|
|
855 |
P(
|
856 |
"We documented all implementation details in this blog post and are open sourcing the code. Examples of each filter and rationale supporting each decision are included."
|
857 |
),
|
858 |
+
id="section11",
|
859 |
),
|
860 |
Section(
|
861 |
H2("Motivation Behind TxT360"),
|
|
|
873 |
),
|
874 |
# P("Table 2: Basic TxT360 Statistics."),
|
875 |
# table_div_data,
|
876 |
+
id="section12",
|
877 |
),
|
878 |
Section(
|
879 |
H2("Our Generalizable Approach to Data Processing"),
|
|
|
894 |
# P(
|
895 |
# "Figure 1: Data processing pipeline. All the steps are adopted for processing web data while the yellow blocks are adopted for processing curated sources."
|
896 |
# ),
|
897 |
+
id="section13",
|
898 |
),
|
899 |
id="inner-text",
|
900 |
)
|
|
|
902 |
|
903 |
rt("/update/{target}")(data_viewer.update)
|
904 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
905 |
serve()
|
results.py
CHANGED
@@ -979,19 +979,19 @@ def results():
|
|
979 |
return Div(
|
980 |
Section(
|
981 |
intro_div,
|
982 |
-
id="
|
983 |
),
|
984 |
Section(
|
985 |
upsampling_exp,
|
986 |
-
id="
|
987 |
),
|
988 |
Section(
|
989 |
preplexity_intro_div,
|
990 |
-
id="
|
991 |
),
|
992 |
Section(
|
993 |
perp1_div,
|
994 |
-
id="
|
995 |
),
|
996 |
Section(
|
997 |
llama_div,
|
|
|
979 |
return Div(
|
980 |
Section(
|
981 |
intro_div,
|
982 |
+
id="section51"
|
983 |
),
|
984 |
Section(
|
985 |
upsampling_exp,
|
986 |
+
id="section52"
|
987 |
),
|
988 |
Section(
|
989 |
preplexity_intro_div,
|
990 |
+
id="section53"
|
991 |
),
|
992 |
Section(
|
993 |
perp1_div,
|
994 |
+
id="section54"
|
995 |
),
|
996 |
Section(
|
997 |
llama_div,
|
style.css
CHANGED
@@ -288,3 +288,7 @@ d-appendix .citation {
|
|
288 |
white-space: pre-wrap;
|
289 |
word-wrap: break-word;
|
290 |
}
|
|
|
|
|
|
|
|
|
|
288 |
white-space: pre-wrap;
|
289 |
word-wrap: break-word;
|
290 |
}
|
291 |
+
|
292 |
+
html {
|
293 |
+
scroll-behavior: smooth;
|
294 |
+
}
|
web.py
CHANGED
@@ -390,7 +390,7 @@ def web_data():
|
|
390 |
),
|
391 |
P("To generate a high-quality dataset from large-scale webpages, we have investigated the processing steps used by the community and made our choices based on careful manual inspection. Below is a comprehensive list of datasets we reviewed the comparison of filters we have applied."),
|
392 |
),
|
393 |
-
id="
|
394 |
Section(
|
395 |
H3("TxT360 CommonCrawl Filtering vs Other Pretraining Datasets"),
|
396 |
P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
|
@@ -402,7 +402,7 @@ def web_data():
|
|
402 |
# The sankey diagram of the filtering percentage
|
403 |
plotly2fasthtml(filtering_sankey_fig),
|
404 |
P("A significant portion of the documents is filtered after the whole process. This figure illustrates the percentage of documents filtered at each step. The grey bars represent the filtered documents. The statistics are largely consistent with prior work (e.g., RefinedWeb) across most steps, though we have incorporated some custom filtering steps."),
|
405 |
-
id="
|
406 |
Section(
|
407 |
H2("Document Preparation"),
|
408 |
|
@@ -563,7 +563,7 @@ def web_data():
|
|
563 |
""",
|
564 |
),
|
565 |
|
566 |
-
id="
|
567 |
Section(
|
568 |
H2("Line-Level Removal"),
|
569 |
P("""
|
@@ -677,7 +677,7 @@ def web_data():
|
|
677 |
margin-bottom: 15px
|
678 |
""",
|
679 |
),
|
680 |
-
id="
|
681 |
Section(
|
682 |
H2("Document-Level Filtering"),
|
683 |
P("""
|
@@ -1748,5 +1748,5 @@ def web_data():
|
|
1748 |
margin-bottom: 15px
|
1749 |
""",
|
1750 |
),
|
1751 |
-
id="
|
1752 |
)
|
|
|
390 |
),
|
391 |
P("To generate a high-quality dataset from large-scale webpages, we have investigated the processing steps used by the community and made our choices based on careful manual inspection. Below is a comprehensive list of datasets we reviewed the comparison of filters we have applied."),
|
392 |
),
|
393 |
+
id="section21"),
|
394 |
Section(
|
395 |
H3("TxT360 CommonCrawl Filtering vs Other Pretraining Datasets"),
|
396 |
P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
|
|
|
402 |
# The sankey diagram of the filtering percentage
|
403 |
plotly2fasthtml(filtering_sankey_fig),
|
404 |
P("A significant portion of the documents is filtered after the whole process. This figure illustrates the percentage of documents filtered at each step. The grey bars represent the filtered documents. The statistics are largely consistent with prior work (e.g., RefinedWeb) across most steps, though we have incorporated some custom filtering steps."),
|
405 |
+
id="section22",),
|
406 |
Section(
|
407 |
H2("Document Preparation"),
|
408 |
|
|
|
563 |
""",
|
564 |
),
|
565 |
|
566 |
+
id="section23",),
|
567 |
Section(
|
568 |
H2("Line-Level Removal"),
|
569 |
P("""
|
|
|
677 |
margin-bottom: 15px
|
678 |
""",
|
679 |
),
|
680 |
+
id="section24",),
|
681 |
Section(
|
682 |
H2("Document-Level Filtering"),
|
683 |
P("""
|
|
|
1748 |
margin-bottom: 15px
|
1749 |
""",
|
1750 |
),
|
1751 |
+
id="section25",),
|
1752 |
)
|