victormiller
commited on
Commit
•
79041d0
1
Parent(s):
316946b
Non-web revision (#2)
Browse files- Update the non-web part. (0017b35903ea001bbe897adb930e8baaf73e558e)
- Ensure format (34cb4c41ce3dff5b182aea673bfd5d95628e0a9d)
- Merge branch 'main' into pr/2 (076f23717c366794b26f39e44ef1d1cca5220c58)
- Merge conflicts (0863523047d15f2c2d61e52657fb146d06ec3762)
- curated.py +47 -13
curated.py
CHANGED
@@ -34,7 +34,9 @@ curated_sources_intro = Div(
|
|
34 |
P(
|
35 |
"Curated sources comprise high-quality datasets that contain domain-specificity.",
|
36 |
B(
|
37 |
-
" TxT360 was strongly influenced by The Pile",
|
|
|
|
|
38 |
),
|
39 |
" These sources, such as Arxiv, Wikipedia, and Stack Exchange, provide valuable data that is excluded from the web dataset mentioned above. Analyzing and processing non-web data can yield insights and opportunities for various applications. Details about each of the sources are provided below. ",
|
40 |
),
|
@@ -682,24 +684,51 @@ filtering_process = Div(
|
|
682 |
P(
|
683 |
B("Download and Extraction: "),
|
684 |
"All the data was downloaded in original latex format from ArXiv official S3 repo: ",
|
685 |
-
A("s3://
|
686 |
-
". We
|
|
|
|
|
687 |
D_code(
|
688 |
-
"pandoc -s
|
689 |
-
language="
|
690 |
),
|
691 |
-
".
|
692 |
),
|
693 |
P(B("Unique Data Preparation Challenges: ")),
|
|
|
|
|
|
|
694 |
Ul(
|
695 |
Li(
|
696 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
697 |
style="margin-bottom: -3px",
|
698 |
),
|
699 |
),
|
700 |
P(
|
701 |
B(" Filters Applied: "),
|
702 |
-
"multiple filters are used here after manually verifying output of all the filters as suggested by peS2o dataset",
|
|
|
703 |
),
|
704 |
Ul(
|
705 |
Li(
|
@@ -852,13 +881,16 @@ filtering_process = Div(
|
|
852 |
href="ttps://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/",
|
853 |
),
|
854 |
". PubMed Central (PMC) files are downloaded in an xml.tar format. The tar files are opened and converted to markdown format using pandoc",
|
855 |
-
D_code(
|
856 |
-
|
|
|
|
|
|
|
857 |
),
|
858 |
P(B("Unique Data Preparation Challenges: ")),
|
859 |
Ul(
|
860 |
Li(
|
861 |
-
"
|
862 |
style="margin-bottom: -3px",
|
863 |
),
|
864 |
),
|
@@ -1585,7 +1617,8 @@ def curated():
|
|
1585 |
table_html = data_preparation_steps.to_html(index=False, border=0)
|
1586 |
table_div = Div(NotStr(table_html), style="margin: 40px;")
|
1587 |
|
1588 |
-
text = P(
|
|
|
1589 |
process. Here, we focus on acquiring and extracting the raw data, which can
|
1590 |
come from various sources such as crawling websites, using HTTP/FTP dumps,
|
1591 |
or working with archive dumps. For instance, to download and prepare a
|
@@ -1595,7 +1628,8 @@ def curated():
|
|
1595 |
preparation process: It is worth noting that some pipelines might require
|
1596 |
invoking additional functions or scripts to handle specific data sources or
|
1597 |
formats. These helper scripts can be located within specific directories
|
1598 |
-
or modules dedicated to the dataset."""
|
|
|
1599 |
|
1600 |
return Div(
|
1601 |
Section(
|
|
|
34 |
P(
|
35 |
"Curated sources comprise high-quality datasets that contain domain-specificity.",
|
36 |
B(
|
37 |
+
" TxT360 was strongly influenced by The Pile",
|
38 |
+
D_cite(bibtex_key="thepile"),
|
39 |
+
" regarding both inclusion of the dataset and filtering techniques.",
|
40 |
),
|
41 |
" These sources, such as Arxiv, Wikipedia, and Stack Exchange, provide valuable data that is excluded from the web dataset mentioned above. Analyzing and processing non-web data can yield insights and opportunities for various applications. Details about each of the sources are provided below. ",
|
42 |
),
|
|
|
684 |
P(
|
685 |
B("Download and Extraction: "),
|
686 |
"All the data was downloaded in original latex format from ArXiv official S3 repo: ",
|
687 |
+
A("s3://arxiv/src", href="s3://arxiv/src"),
|
688 |
+
". We aim to encode the downloaded data in UTF-8 format, and when necessary, utilize the chardet library to infer the appropriate encoding. After that, we use ",
|
689 |
+
A("Pandoc", href="https://pandoc.org/"),
|
690 |
+
" to extract information from the latex files into markdown format. The command we use is",
|
691 |
D_code(
|
692 |
+
"pandoc <raw_tex_path> -s -o <output_markdown_path> -f latex+raw_tex -t markdown_mmd [--lua-filter <lua_filter_path>]",
|
693 |
+
language="bash",
|
694 |
),
|
695 |
+
". Finally, all markdowns were combined to create jsonl files.",
|
696 |
),
|
697 |
P(B("Unique Data Preparation Challenges: ")),
|
698 |
+
P(
|
699 |
+
"When converting LaTeX files into Markdown using Pandoc, it is crucial to account for different data formats to minimize information loss while also filtering out noisy content in LaTeX. Below, we outline our considerations and methods for handling various data types during this conversion process:"
|
700 |
+
),
|
701 |
Ul(
|
702 |
Li(
|
703 |
+
B("Tables: "),
|
704 |
+
"The process for handling tables follows three main approaches. First, tables compatible with Pandoc’s built-in formats are directly converted into standard Markdown tables. Notably, LaTeX’s '\\multicolumn' and '\\multirow' commands can be successfully translated into valid Markdown tables. Second, tables unsupported by Pandoc’s native functionality, such as deluxetable or other complex LaTeX types, are preserved in their original LaTeX format to maintain the integrity of complex structures. Third, only a few remaining tables have been converted to HTML web tables.",
|
705 |
+
style="margin-bottom: -3px",
|
706 |
+
),
|
707 |
+
Li(
|
708 |
+
B("Mathematical Expressions: "),
|
709 |
+
"Inline mathematical expressions are rendered in Markdown. More complex equations remain unchanged, e.g., presented as '\\begin{aligned}' blocks, to ensure accuracy and readability.",
|
710 |
+
style="margin-bottom: -3px",
|
711 |
+
),
|
712 |
+
Li(
|
713 |
+
B("Figures: "),
|
714 |
+
"All figures are removed during the conversion process. Placeholder figures might not contribute to the paper’s data quality and, as such, have been omitted to streamline the output.",
|
715 |
+
style="margin-bottom: -3px",
|
716 |
+
),
|
717 |
+
Li(
|
718 |
+
B("Section Headers: "),
|
719 |
+
"Section headers are converted into markdown format, using leading '#' symbols to represent the heading levels.",
|
720 |
+
style="margin-bottom: -3px",
|
721 |
+
),
|
722 |
+
Li(
|
723 |
+
B("References: "),
|
724 |
+
"References are removed. Although they may be informative, references often introduce formatting inconsistencies or add little value compared to the core content of the paper.",
|
725 |
style="margin-bottom: -3px",
|
726 |
),
|
727 |
),
|
728 |
P(
|
729 |
B(" Filters Applied: "),
|
730 |
+
"multiple filters are used here after manually verifying output of all the filters as suggested by peS2o dataset",
|
731 |
+
D_cite(bibtex_key="peS2o"),
|
732 |
),
|
733 |
Ul(
|
734 |
Li(
|
|
|
881 |
href="ttps://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/",
|
882 |
),
|
883 |
". PubMed Central (PMC) files are downloaded in an xml.tar format. The tar files are opened and converted to markdown format using pandoc",
|
884 |
+
D_code(
|
885 |
+
"pandoc <raw_xml_path> -s -o <output_markdown_path> -f jats -t markdown_mmd [--lua-filter <lua_filter_path>]",
|
886 |
+
language="bash",
|
887 |
+
),
|
888 |
+
". The markdown files are combined to create jsonl files. PubMed Abstract (PMA) files were downloaded in xml. The BeautifulSoup library was used to extract the abstract, title, and PMID. All files were stored in jsonl format.",
|
889 |
),
|
890 |
P(B("Unique Data Preparation Challenges: ")),
|
891 |
Ul(
|
892 |
Li(
|
893 |
+
"We tried similar attempts on PMC as we did on ArXiv. The resulted markdown might have slight difference due to the different structure of the XML files.",
|
894 |
style="margin-bottom: -3px",
|
895 |
),
|
896 |
),
|
|
|
1617 |
table_html = data_preparation_steps.to_html(index=False, border=0)
|
1618 |
table_div = Div(NotStr(table_html), style="margin: 40px;")
|
1619 |
|
1620 |
+
text = P(
|
1621 |
+
"""This initial stage serves as the foundation for the entire
|
1622 |
process. Here, we focus on acquiring and extracting the raw data, which can
|
1623 |
come from various sources such as crawling websites, using HTTP/FTP dumps,
|
1624 |
or working with archive dumps. For instance, to download and prepare a
|
|
|
1628 |
preparation process: It is worth noting that some pipelines might require
|
1629 |
invoking additional functions or scripts to handle specific data sources or
|
1630 |
formats. These helper scripts can be located within specific directories
|
1631 |
+
or modules dedicated to the dataset."""
|
1632 |
+
)
|
1633 |
|
1634 |
return Div(
|
1635 |
Section(
|