victormiller
commited on
Commit
•
b488013
1
Parent(s):
146aa07
Update web.py
Browse files
web.py
CHANGED
@@ -217,7 +217,6 @@ def web_data():
|
|
217 |
),
|
218 |
H3("1. Document Preparation"),
|
219 |
|
220 |
-
button( Div(
|
221 |
H4("1.1 Text Extraction"),
|
222 |
P("""
|
223 |
Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
|
@@ -226,7 +225,7 @@ def web_data():
|
|
226 |
we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
|
227 |
Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
|
228 |
"""),
|
229 |
-
DV2("data/sample_wet.json", "data/sample_warc.json", 3),
|
230 |
|
231 |
H4("1.2 Language Identification"),
|
232 |
P("""
|
|
|
217 |
),
|
218 |
H3("1. Document Preparation"),
|
219 |
|
|
|
220 |
H4("1.1 Text Extraction"),
|
221 |
P("""
|
222 |
Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
|
|
|
225 |
we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
|
226 |
Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
|
227 |
"""),
|
228 |
+
DV2("data/sample_wet.json", "data/sample_warc.json", 3),
|
229 |
|
230 |
H4("1.2 Language Identification"),
|
231 |
P("""
|