Spaces:

LLM360
/

TxT360

Running

App Files Files Community

omkarenator commited on Oct 25, 2024

Commit

7bde7c0

1 Parent(s): 149e56a

more fixes

Browse files

Files changed (3) hide show

curated.py +1 -1
main.py +2 -2
web.py +2 -2

curated.py CHANGED Viewed

@@ -609,7 +609,7 @@ data_preprocessing_div = Div(
         B("Unigram Log Probability Filter"),
         " calculates the log probability of each unigram to measure the significance of individual words. This step quantifies the importance of individual words but may not capture the semantic meaning of words. To calculate the average log word probability, we use word frequencies extracted from the ",
         A("1T Web-gram corpus", href="https://catalog.ldc.upenn.edu/LDC2006T13"),
-        ". Specifically, we use the list available created by ",
         A(
             "Rachel Tatman",
             href="https://www.kaggle.com/datasets/rtatman/english-word-frequency",

         B("Unigram Log Probability Filter"),
         " calculates the log probability of each unigram to measure the significance of individual words. This step quantifies the importance of individual words but may not capture the semantic meaning of words. To calculate the average log word probability, we use word frequencies extracted from the ",
         A("1T Web-gram corpus", href="https://catalog.ldc.upenn.edu/LDC2006T13"),
+        ". Specifically, we use the available list created by ",
         A(
             "Rachel Tatman",
             href="https://www.kaggle.com/datasets/rtatman/english-word-frequency",

main.py CHANGED Viewed

@@ -864,7 +864,7 @@ def intro():
                 A(B("TxT360 (Trillion eXtracted Text),"), href="https://huggingface.co/datasets/LLM360/TxT360"),
                 " the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 high-quality data sources from diverse domains (e.g., FreeLaw, PG-19, etc.). The large-scale deduplication process and rich metadata stored enables precise control over data distribution. We demonstrate a simple but effective upsampling recipe that creates a 15+ trillion-token corpus, outperforming FineWeb 15T on several key metrics. With the information, TxT360 empowers pre-trainers to explore more advanced weighting techniques, a feature not commonly available in previous pre-training datasets. Our findings highlight the importance of both high-quality data sources and appropriate weighting for optimal blending in LLM training."
             ),
-            P("In line with our 360° open source spirit, we document all detailed steps, reasons of our decisions, detailed statistics, our code (stay tuned!), analysis results and more, in additional to the dataset itself. We hope this can serve as a useful resource for future developers."
             ),
             plotly2fasthtml(all_eval_res_figs["MMLU"]),
             P(
@@ -899,7 +899,7 @@ def intro():
                 "In LLM pretraining, it is common to combine all possible text sources due to the Scaling Law. Crawled web pages are included to provide a vast quantity of data which can cover long tail and diverse information, while curated datasets such as Wikipedia are also used, which often provide the 'deep-dive' domain information. By integrating the reach of web data with the quality of curated sources, TxT360 meets and surpasses the rigorous standards required for state-of-the-art LLM pre-training."
             ),
             P(
-                "** TxT360 does not include very specific domains such as code and math. This decision was made due to the perceived low duplication code with other sources, and the different logic requiring to build those datasets. We leave those work to future work and recommend users refer to existing projects such as Stack V2",
                 D_cite(bibtex_key="lozhkov2024starcoder2stackv2"),
                 ".",
             ),

                 A(B("TxT360 (Trillion eXtracted Text),"), href="https://huggingface.co/datasets/LLM360/TxT360"),
                 " the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 high-quality data sources from diverse domains (e.g., FreeLaw, PG-19, etc.). The large-scale deduplication process and rich metadata stored enables precise control over data distribution. We demonstrate a simple but effective upsampling recipe that creates a 15+ trillion-token corpus, outperforming FineWeb 15T on several key metrics. With the information, TxT360 empowers pre-trainers to explore more advanced weighting techniques, a feature not commonly available in previous pre-training datasets. Our findings highlight the importance of both high-quality data sources and appropriate weighting for optimal blending in LLM training."
             ),
+            P("In line with our 360° open source spirit, we document all detailed steps, reasons of our decisions, detailed statistics, our code (stay tuned!), analysis results and more, in addition to the dataset itself. We hope this can serve as a useful resource for future developers."
             ),
             plotly2fasthtml(all_eval_res_figs["MMLU"]),
             P(
                 "In LLM pretraining, it is common to combine all possible text sources due to the Scaling Law. Crawled web pages are included to provide a vast quantity of data which can cover long tail and diverse information, while curated datasets such as Wikipedia are also used, which often provide the 'deep-dive' domain information. By integrating the reach of web data with the quality of curated sources, TxT360 meets and surpasses the rigorous standards required for state-of-the-art LLM pre-training."
             ),
             P(
+                "** TxT360 does not include very specific domains such as code and math. This decision was made due to the perceived low duplication code with other sources, and the different logic required to build those datasets. We leave that to future work and recommend users refer to existing projects such as Stack V2",
                 D_cite(bibtex_key="lozhkov2024starcoder2stackv2"),
                 ".",
             ),

web.py CHANGED Viewed

@@ -657,7 +657,7 @@ def web_data():
             """,
         ),
         P(B("Toxic Lines: "), """
-        When doing manual inspection on the data, we found that there are some adult ads in the beginning or end of the
         document (with a sample shown below), which are hard to remove via document-level filtering strategies. Inspired
         by this, we develop line-level detoxification using a bad word list from LDNOOBW (+ rule: word length < 10 + the
         line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
@@ -1455,7 +1455,7 @@ def web_data():
         ),
         P("""
         Both Dolma and RedPajama V2 split texts into words using white spaces and newline symbols. However,
-        DataTrove employs a tokenizer to split texts into words and ignore punctuations, resulting in a higher
         word count compared to simple `text.split()`.
         We decided to use simple `len(text.split())` to compute the word count.
         """),

             """,
         ),
         P(B("Toxic Lines: "), """
+        When manually inspecting the data, we found that there are some adult ads in the beginning or end of the
         document (with a sample shown below), which are hard to remove via document-level filtering strategies. Inspired
         by this, we develop line-level detoxification using a bad word list from LDNOOBW (+ rule: word length < 10 + the
         line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
         ),
         P("""
         Both Dolma and RedPajama V2 split texts into words using white spaces and newline symbols. However,
+        DataTrove employs a tokenizer to split texts into words and ignore punctuation, resulting in a higher
         word count compared to simple `text.split()`.
         We decided to use simple `len(text.split())` to compute the word count.
         """),