Spaces:
Running
Running
omkarenator
commited on
Commit
•
e58e006
1
Parent(s):
84a7120
Update arxiv examples (#3)
Browse files- Update arxiv examples (8c01f34cc02a8d64ce4c84c7911320ed878ffcab)
- curated.py +41 -34
- data/curated_samples/arxiv_markdown.json +0 -0
- data/curated_samples/arxiv_raw.json +0 -0
curated.py
CHANGED
@@ -24,7 +24,10 @@ overview = (
|
|
24 |
"Individual Filtering Discussion for Each Source",
|
25 |
style="margin-bottom: 5px",
|
26 |
),
|
27 |
-
Li(
|
|
|
|
|
|
|
28 |
),
|
29 |
),
|
30 |
)
|
@@ -34,9 +37,10 @@ curated_sources_intro = Div(
|
|
34 |
P(
|
35 |
"While massive amount of data can be crawled and obtained from the Internet, there are certain sources contain data in additional formats (e.g. PDF documents), or organized and published as official dumps (e.g. Wikipedia). We refer to these sources as curated sources. These dataset often comprises high-quality data that contain domain-specificity, such as academic publications or domain specific discussions. TxT360 was strongly influenced by The Pile",
|
36 |
D_cite(bibtex_key="thepile"),
|
37 |
-
" regarding both inclusion of the dataset and filtering techniques.",
|
38 |
),
|
39 |
-
P(
|
|
|
40 |
),
|
41 |
P(
|
42 |
"TxT360 respects the copyright of the data sources and have not included the controversial data that was used in The Pile like YouTube and Opensubtitles, Reddit threads, and book3."
|
@@ -566,7 +570,7 @@ se_examples = DV2(
|
|
566 |
)
|
567 |
phil_examples = DV("data/curated_samples/philpapers_raw.json", 2, "PhilPapers")
|
568 |
arx_examples = DV2(
|
569 |
-
"data/curated_samples/arxiv_raw.json", "data/curated_samples/
|
570 |
)
|
571 |
s2o_examples = DV("data/curated_samples/s2orc_raw.json", 0, "S2ORC")
|
572 |
s2oa_examples = DV("data/curated_samples/s2orc_abstract_raw.json", 0, "S2ORC Abstract")
|
@@ -859,19 +863,19 @@ filtering_process = Div(
|
|
859 |
),
|
860 |
),
|
861 |
table_div_s2o,
|
862 |
-
|
863 |
-
|
864 |
-
|
865 |
-
|
866 |
-
|
867 |
-
|
868 |
-
|
869 |
-
|
870 |
-
|
871 |
-
|
872 |
-
|
873 |
-
|
874 |
-
|
875 |
),
|
876 |
),
|
877 |
Section(
|
@@ -912,19 +916,19 @@ filtering_process = Div(
|
|
912 |
),
|
913 |
),
|
914 |
table_div_s2oa,
|
915 |
-
#Details(
|
916 |
# Summary("S2ORC Abstract Filtering Examples "),
|
917 |
-
|
918 |
-
|
919 |
-
|
920 |
-
|
921 |
-
|
922 |
-
|
923 |
-
|
924 |
-
|
925 |
-
|
926 |
-
|
927 |
-
|
928 |
)
|
929 |
),
|
930 |
Section(
|
@@ -1201,9 +1205,9 @@ filtering_process = Div(
|
|
1201 |
P(B("Unique Data Preparation Challenges: ")),
|
1202 |
Ul(
|
1203 |
Li(
|
1204 |
-
"The converesation and forum style structure can be a very helpful signal for language model training. During processing the dataset, we try to encode such structure but without introducing too much noise. We choose to use an",
|
1205 |
D_code("<AUTHOR>", language="html"),
|
1206 |
-
" tag to encode the main thread text by the original poster, and use a ",
|
1207 |
D_code("<COMMENT>", language="html"),
|
1208 |
" tag to encode the replies. We initially choose ",
|
1209 |
D_code("<P>", language="html"),
|
@@ -1289,7 +1293,9 @@ filtering_process = Div(
|
|
1289 |
"All content was downloaded leading to high number of documents filtered during local deduplication. Following The Pile, priority was given to plain_text first, followed by the columns in the table in reverse order."
|
1290 |
),
|
1291 |
P(B("Unique Data Preparation Challenges: ")),
|
1292 |
-
P(
|
|
|
|
|
1293 |
Ul(
|
1294 |
Li(
|
1295 |
"Consecutive whitespaces and tabs were found. Consecutive Whitespaces and tabes were reduce to one, single whitespace.",
|
@@ -1309,8 +1315,9 @@ filtering_process = Div(
|
|
1309 |
),
|
1310 |
Li(
|
1311 |
"All form feed (",
|
1312 |
-
D_code("\\f", language="bash"),
|
1313 |
-
")characters were removed.",
|
|
|
1314 |
),
|
1315 |
),
|
1316 |
P(B("Filters Applied: ")),
|
|
|
24 |
"Individual Filtering Discussion for Each Source",
|
25 |
style="margin-bottom: 5px",
|
26 |
),
|
27 |
+
Li(
|
28 |
+
B("Estimated Reading Time: 25 minutes"),
|
29 |
+
style="margin-bottom: 5px",
|
30 |
+
),
|
31 |
),
|
32 |
),
|
33 |
)
|
|
|
37 |
P(
|
38 |
"While massive amount of data can be crawled and obtained from the Internet, there are certain sources contain data in additional formats (e.g. PDF documents), or organized and published as official dumps (e.g. Wikipedia). We refer to these sources as curated sources. These dataset often comprises high-quality data that contain domain-specificity, such as academic publications or domain specific discussions. TxT360 was strongly influenced by The Pile",
|
39 |
D_cite(bibtex_key="thepile"),
|
40 |
+
" regarding both inclusion of the dataset and filtering techniques.",
|
41 |
),
|
42 |
+
P(
|
43 |
+
"These sources, such as Arxiv, Wikipedia, and Stack Exchange, provide high quality data. And as mentioned above, they are excluded from the web dataset via URL matching. Details about each of the sources are provided below. ",
|
44 |
),
|
45 |
P(
|
46 |
"TxT360 respects the copyright of the data sources and have not included the controversial data that was used in The Pile like YouTube and Opensubtitles, Reddit threads, and book3."
|
|
|
570 |
)
|
571 |
phil_examples = DV("data/curated_samples/philpapers_raw.json", 2, "PhilPapers")
|
572 |
arx_examples = DV2(
|
573 |
+
"data/curated_samples/arxiv_raw.json", "data/curated_samples/arxiv_markdown.json", 3
|
574 |
)
|
575 |
s2o_examples = DV("data/curated_samples/s2orc_raw.json", 0, "S2ORC")
|
576 |
s2oa_examples = DV("data/curated_samples/s2orc_abstract_raw.json", 0, "S2ORC Abstract")
|
|
|
863 |
),
|
864 |
),
|
865 |
table_div_s2o,
|
866 |
+
# Details(
|
867 |
+
# Summary("S2ORC Filtering Examples -- need to update"),
|
868 |
+
# Div(
|
869 |
+
# P("examples are missing"),
|
870 |
+
# style="background-color: white; padding: 15px; margin-top: 10px; margin-bottom: 10px; border-radius: 8px; border: none; ", # Styling for the DV2 part
|
871 |
+
# ),
|
872 |
+
# style="""
|
873 |
+
# background-color: #FFFAEA; /* Light yellow background */
|
874 |
+
# padding: 15px;
|
875 |
+
# border-radius: 12px;
|
876 |
+
# margin-bottom: 15px
|
877 |
+
# """,
|
878 |
+
# ),
|
879 |
),
|
880 |
),
|
881 |
Section(
|
|
|
916 |
),
|
917 |
),
|
918 |
table_div_s2oa,
|
919 |
+
# Details(
|
920 |
# Summary("S2ORC Abstract Filtering Examples "),
|
921 |
+
# Div(
|
922 |
+
# P("examples are missing"),
|
923 |
+
# style="background-color: white; padding: 15px; margin-top: 10px; margin-bottom: 10px; border-radius: 8px; border: none; ", # Styling for the DV2 part
|
924 |
+
# ),
|
925 |
+
# style="""
|
926 |
+
# background-color: #FFFAEA; /* Light yellow background */
|
927 |
+
# padding: 15px;
|
928 |
+
# border-radius: 12px;
|
929 |
+
# margin-bottom: 15px
|
930 |
+
# """,
|
931 |
+
# ),
|
932 |
)
|
933 |
),
|
934 |
Section(
|
|
|
1205 |
P(B("Unique Data Preparation Challenges: ")),
|
1206 |
Ul(
|
1207 |
Li(
|
1208 |
+
"The converesation and forum style structure can be a very helpful signal for language model training. During processing the dataset, we try to encode such structure but without introducing too much noise. We choose to use an",
|
1209 |
D_code("<AUTHOR>", language="html"),
|
1210 |
+
" tag to encode the main thread text by the original poster, and use a ",
|
1211 |
D_code("<COMMENT>", language="html"),
|
1212 |
" tag to encode the replies. We initially choose ",
|
1213 |
D_code("<P>", language="html"),
|
|
|
1293 |
"All content was downloaded leading to high number of documents filtered during local deduplication. Following The Pile, priority was given to plain_text first, followed by the columns in the table in reverse order."
|
1294 |
),
|
1295 |
P(B("Unique Data Preparation Challenges: ")),
|
1296 |
+
P(
|
1297 |
+
"The Freelaw text uses a lot of whitespaces and newlines to format the document visually. These lines are not necessary for language model learning and sometimes have confusing semantic meanings. We attempt to unify how whitespaces appear in this dataset with the following heuristics."
|
1298 |
+
),
|
1299 |
Ul(
|
1300 |
Li(
|
1301 |
"Consecutive whitespaces and tabs were found. Consecutive Whitespaces and tabes were reduce to one, single whitespace.",
|
|
|
1315 |
),
|
1316 |
Li(
|
1317 |
"All form feed (",
|
1318 |
+
D_code("\\f", language="bash"),
|
1319 |
+
")characters were removed.",
|
1320 |
+
style="margin-bottom: -3px",
|
1321 |
),
|
1322 |
),
|
1323 |
P(B("Filters Applied: ")),
|
data/curated_samples/arxiv_markdown.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
data/curated_samples/arxiv_raw.json
CHANGED
The diff for this file is too large to render.
See raw diff
|
|