omkarenator commited on
Commit
7bde7c0
1 Parent(s): 149e56a

more fixes

Browse files
Files changed (3) hide show
  1. curated.py +1 -1
  2. main.py +2 -2
  3. web.py +2 -2
curated.py CHANGED
@@ -609,7 +609,7 @@ data_preprocessing_div = Div(
609
  B("Unigram Log Probability Filter"),
610
  " calculates the log probability of each unigram to measure the significance of individual words. This step quantifies the importance of individual words but may not capture the semantic meaning of words. To calculate the average log word probability, we use word frequencies extracted from the ",
611
  A("1T Web-gram corpus", href="https://catalog.ldc.upenn.edu/LDC2006T13"),
612
- ". Specifically, we use the list available created by ",
613
  A(
614
  "Rachel Tatman",
615
  href="https://www.kaggle.com/datasets/rtatman/english-word-frequency",
 
609
  B("Unigram Log Probability Filter"),
610
  " calculates the log probability of each unigram to measure the significance of individual words. This step quantifies the importance of individual words but may not capture the semantic meaning of words. To calculate the average log word probability, we use word frequencies extracted from the ",
611
  A("1T Web-gram corpus", href="https://catalog.ldc.upenn.edu/LDC2006T13"),
612
+ ". Specifically, we use the available list created by ",
613
  A(
614
  "Rachel Tatman",
615
  href="https://www.kaggle.com/datasets/rtatman/english-word-frequency",
main.py CHANGED
@@ -864,7 +864,7 @@ def intro():
864
  A(B("TxT360 (Trillion eXtracted Text),"), href="https://huggingface.co/datasets/LLM360/TxT360"),
865
  " the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 high-quality data sources from diverse domains (e.g., FreeLaw, PG-19, etc.). The large-scale deduplication process and rich metadata stored enables precise control over data distribution. We demonstrate a simple but effective upsampling recipe that creates a 15+ trillion-token corpus, outperforming FineWeb 15T on several key metrics. With the information, TxT360 empowers pre-trainers to explore more advanced weighting techniques, a feature not commonly available in previous pre-training datasets. Our findings highlight the importance of both high-quality data sources and appropriate weighting for optimal blending in LLM training."
866
  ),
867
- P("In line with our 360° open source spirit, we document all detailed steps, reasons of our decisions, detailed statistics, our code (stay tuned!), analysis results and more, in additional to the dataset itself. We hope this can serve as a useful resource for future developers."
868
  ),
869
  plotly2fasthtml(all_eval_res_figs["MMLU"]),
870
  P(
@@ -899,7 +899,7 @@ def intro():
899
  "In LLM pretraining, it is common to combine all possible text sources due to the Scaling Law. Crawled web pages are included to provide a vast quantity of data which can cover long tail and diverse information, while curated datasets such as Wikipedia are also used, which often provide the 'deep-dive' domain information. By integrating the reach of web data with the quality of curated sources, TxT360 meets and surpasses the rigorous standards required for state-of-the-art LLM pre-training."
900
  ),
901
  P(
902
- "** TxT360 does not include very specific domains such as code and math. This decision was made due to the perceived low duplication code with other sources, and the different logic requiring to build those datasets. We leave those work to future work and recommend users refer to existing projects such as Stack V2",
903
  D_cite(bibtex_key="lozhkov2024starcoder2stackv2"),
904
  ".",
905
  ),
 
864
  A(B("TxT360 (Trillion eXtracted Text),"), href="https://huggingface.co/datasets/LLM360/TxT360"),
865
  " the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 high-quality data sources from diverse domains (e.g., FreeLaw, PG-19, etc.). The large-scale deduplication process and rich metadata stored enables precise control over data distribution. We demonstrate a simple but effective upsampling recipe that creates a 15+ trillion-token corpus, outperforming FineWeb 15T on several key metrics. With the information, TxT360 empowers pre-trainers to explore more advanced weighting techniques, a feature not commonly available in previous pre-training datasets. Our findings highlight the importance of both high-quality data sources and appropriate weighting for optimal blending in LLM training."
866
  ),
867
+ P("In line with our 360° open source spirit, we document all detailed steps, reasons of our decisions, detailed statistics, our code (stay tuned!), analysis results and more, in addition to the dataset itself. We hope this can serve as a useful resource for future developers."
868
  ),
869
  plotly2fasthtml(all_eval_res_figs["MMLU"]),
870
  P(
 
899
  "In LLM pretraining, it is common to combine all possible text sources due to the Scaling Law. Crawled web pages are included to provide a vast quantity of data which can cover long tail and diverse information, while curated datasets such as Wikipedia are also used, which often provide the 'deep-dive' domain information. By integrating the reach of web data with the quality of curated sources, TxT360 meets and surpasses the rigorous standards required for state-of-the-art LLM pre-training."
900
  ),
901
  P(
902
+ "** TxT360 does not include very specific domains such as code and math. This decision was made due to the perceived low duplication code with other sources, and the different logic required to build those datasets. We leave that to future work and recommend users refer to existing projects such as Stack V2",
903
  D_cite(bibtex_key="lozhkov2024starcoder2stackv2"),
904
  ".",
905
  ),
web.py CHANGED
@@ -657,7 +657,7 @@ def web_data():
657
  """,
658
  ),
659
  P(B("Toxic Lines: "), """
660
- When doing manual inspection on the data, we found that there are some adult ads in the beginning or end of the
661
  document (with a sample shown below), which are hard to remove via document-level filtering strategies. Inspired
662
  by this, we develop line-level detoxification using a bad word list from LDNOOBW (+ rule: word length < 10 + the
663
  line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
@@ -1455,7 +1455,7 @@ def web_data():
1455
  ),
1456
  P("""
1457
  Both Dolma and RedPajama V2 split texts into words using white spaces and newline symbols. However,
1458
- DataTrove employs a tokenizer to split texts into words and ignore punctuations, resulting in a higher
1459
  word count compared to simple `text.split()`.
1460
  We decided to use simple `len(text.split())` to compute the word count.
1461
  """),
 
657
  """,
658
  ),
659
  P(B("Toxic Lines: "), """
660
+ When manually inspecting the data, we found that there are some adult ads in the beginning or end of the
661
  document (with a sample shown below), which are hard to remove via document-level filtering strategies. Inspired
662
  by this, we develop line-level detoxification using a bad word list from LDNOOBW (+ rule: word length < 10 + the
663
  line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
 
1455
  ),
1456
  P("""
1457
  Both Dolma and RedPajama V2 split texts into words using white spaces and newline symbols. However,
1458
+ DataTrove employs a tokenizer to split texts into words and ignore punctuation, resulting in a higher
1459
  word count compared to simple `text.split()`.
1460
  We decided to use simple `len(text.split())` to compute the word count.
1461
  """),