victormiller commited on
Commit
aa13e37
•
1 Parent(s): ae1d7f9

Update web.py

Browse files
Files changed (1) hide show
  1. web.py +478 -25
web.py CHANGED
@@ -612,11 +612,59 @@ def web_data():
612
  but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
613
  that are duplicates, and the fraction of characters contained within those duplicated passages.
614
  """),
615
- H6("Implementations from Dolma"),
616
- D_code(dolma311, block="block", language="python"),
617
- P("..."), # Add specific implementation details if available
618
- H6("Implementations from DataTrove"),
619
- P("..."), # Add specific implementation details if available
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
620
  P("""
621
  After evaluating the implementations of Dolma and DataTrove (note: RedPajama V2 does not implement these two quality
622
  signals), we have made the following decisions:
@@ -639,6 +687,25 @@ def web_data():
639
  ensures consistency with the overall document character count calculation.
640
  """),
641
  H5("Our Implementation"),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
642
  Details(
643
  Summary("Sample documents filtered by excessive line repetitions / characters in repeated lines"),
644
  DV(
@@ -652,12 +719,85 @@ def web_data():
652
  Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
653
  fraction of characters contained within the most frequently-occurring n-gram.
654
  """),
655
- H6("Implementations from Dolma"),
656
- P("..."), # Add specific implementation details if available
657
- H6("Implementations from RedPajama-V2"),
658
- P("..."), # Add specific implementation details if available
659
- H6("Implementations from DataTrove"),
660
- P("..."), # Add specific implementation details if available
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
661
  P("""
662
  There are almost no contradictions between above implementations of fractions of characters in the most common
663
  n-gram. The main process involves counting the occurrences of each n-gram and selecting the most common one. The
@@ -668,7 +808,23 @@ def web_data():
668
  In practice, documents affected by this rule — where the most common n-gram exceeds a given threshold and occurs
669
  only once — tend to be short.
670
  """),
671
- H5("Our Implementations"),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
672
  Details(
673
  Summary("Sample documents filtered by the fraction of characters in the most common n-grams (n=2,3,4)"),
674
  DV(
@@ -683,27 +839,172 @@ def web_data():
683
  fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
684
  overlapping n-grams more than once.
685
  """),
686
- H6("Implementations from Dolma"),
687
- P("..."), # Add specific implementation details if available
688
- H6("Implementations from RedPajama-V2"),
689
- P("..."), # Add specific implementation details if available
690
- H6("Implementations from DataTrove"),
691
- P("..."), # Add specific implementation details if available
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
692
  P("""
693
  For the computation of fraction of characters in duplicate n-gram, Dolma uses the number of characters in all
694
  n-grams (with overlapping) as the denominator, and uses the number of characters in all duplicated n-grams
695
- (with overlapping) as the numerator. RedPajama V2 uses the number of all characters in (the words of) the document
 
696
  (without overlapping) as the denominator, and uses the number of characters that are recognized as part of the
697
- duplicate n-gram as the numerator. Datatrove uses the number of all characters in the document (including white
 
698
  spaces, without overlapping) as the denominator, and uses the number of characters that are recognized as
699
  duplicate n-gram as the numerator. However, there is a mismatch in DataTrove’s calculation, as the number of
700
  characters in the duplicated n-grams excludes white spaces, while the total character count of the document
701
- does not.
702
- We decided to use the RedPajama V2 implementation but skip the 1st occurrence of the duplicate n-gram.
 
703
  """),
704
- H5("Our Implementations"),
705
- H5("An Example to Show the Difference Between Above Implementations"),
706
- P("..."), # Add specific examples if available
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
707
  H5(
708
  "Sample Documents Filtered by the Fraction of Characters in Duplicated N-grams (n=5,...,10)"
709
  ),
@@ -722,6 +1023,71 @@ def web_data():
722
  works ([2], [3], [6]), we remove the documents if more than 30% of the lines end with an ellipsis or more than
723
  90% of lines start with a bullet point.
724
  """),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
725
  Details(
726
  Summary("Sample documents that are filtered out by line-wise heuristics"),
727
  DV(
@@ -730,6 +1096,7 @@ def web_data():
730
  "Sample documents that are filtered out by line-wise heuristics",
731
  ),
732
  ),
 
733
  H4("3.3 Statistics-based Heuristics"),
734
  P("We summarize other statistics-based rules originated from Gopher [7] in this section. The statistics can be used include:"),
735
  Ul(
@@ -753,17 +1120,51 @@ def web_data():
753
  Details(
754
  Summary("Implementations from Dolma"),
755
  D_code("""
 
 
756
  """, block="block", language="python"),
757
  ),
758
  Details(
759
  Summary("Implementations from RedPajama-V2"),
760
  D_code("""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
761
  """, block="block", language="python"),
762
  ),
763
 
764
  Details(
765
  Summary("Implementations from DataTrove"),
766
  D_code("""
 
 
 
 
 
767
  """, block="block", language="python"),
768
  ),
769
  P("""
@@ -798,6 +1199,16 @@ def web_data():
798
  Details(
799
  Summary("Implementations from RedPajama-V2"),
800
  D_code("""
 
 
 
 
 
 
 
 
 
 
801
  """, block="block", language="python"),
802
  ),
803
  P("""
@@ -807,6 +1218,13 @@ def web_data():
807
  Details(
808
  Summary("TxT360 Implementation"),
809
  D_code("""
 
 
 
 
 
 
 
810
  """, block="block", language="python"),
811
  ),
812
 
@@ -818,22 +1236,57 @@ def web_data():
818
  Details(
819
  Summary("Implementations from Dolma"),
820
  D_code("""
 
 
 
 
 
821
  """, block="block", language="python"),
822
  ),
823
  Details(
824
  Summary("Implementations from RedPajama-V2"),
825
  D_code("""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
826
  """, block="block", language="python"),
827
  ),
828
 
829
  Details(
830
  Summary("Implementations from DataTrove"),
831
  D_code("""
 
 
 
 
832
  """, block="block", language="python"),
833
  ),
834
  Details(
835
  Summary("TxT360 Implementation"),
836
  D_code("""
 
 
 
 
 
837
  """, block="block", language="python"),
838
  ),
839
 
 
612
  but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
613
  that are duplicates, and the fraction of characters contained within those duplicated passages.
614
  """),
615
+ Details(
616
+ Summary("Implementations from Dolma"),
617
+ D_code("""
618
+ words = text.split()
619
+ word_count = len(words)
620
+ character_count = sum(len(word) for word in words)
621
+ ...
622
+ lines = text.split("\n")
623
+ line_count = len(lines)
624
+ ...
625
+ line_counts = Counter(lines)
626
+ attrs.fraction_of_duplicate_lines = sum(count for line, count in line_counts.items() if count > 1) / max(
627
+ line_count, 1
628
+ )
629
+ attrs.fraction_of_characters_in_duplicate_lines = sum(
630
+ len(line) * count for line, count in line_counts.items() if count > 1
631
+ ) / max(character_count, 1)
632
+ """, block="block", language="python"),
633
+ ),
634
+ Details(
635
+ Summary("Implementations from DataTrove"),
636
+ D_code("""
637
+ def find_duplicates(x: list[str]) -> tuple[int, int]:
638
+ unique_x = set()
639
+ duplicate_chars = 0
640
+ duplicate_elements = 0
641
+ for element in x:
642
+ if element in unique_x:
643
+ duplicate_chars += len(element)
644
+ duplicate_elements += 1
645
+
646
+ else:
647
+ unique_x.add(element)
648
+ return duplicate_elements, duplicate_chars
649
+ ...
650
+ self.paragraph_exp = re.compile(r"\n{2,}")
651
+ self._line_splitter = re.compile("\n+")
652
+ ...
653
+ paragraphs = self.paragraph_exp.split(text.strip())
654
+ paragraphs_duplicates, char_duplicates = find_duplicates(paragraphs)
655
+ if self.dup_para_frac and paragraphs_duplicates / len(paragraphs) > self.dup_para_frac:
656
+ return False, "dup_para_frac"
657
+ if self.dup_para_char_frac and char_duplicates / len(text) > self.dup_para_char_frac:
658
+ return False, "dup_para_char_frac"
659
+
660
+ lines = self._line_splitter.split(text)
661
+ line_duplicates, char_duplicates = find_duplicates(lines)
662
+ if self.dup_line_frac and line_duplicates / len(lines) > self.dup_line_frac:
663
+ return False, "dup_line_frac"
664
+ if self.dup_line_char_frac and char_duplicates / len(text) > self.dup_line_char_frac:
665
+ return False, "dup_line_char_frac"
666
+ """, block="block", language="python"),
667
+ ),
668
  P("""
669
  After evaluating the implementations of Dolma and DataTrove (note: RedPajama V2 does not implement these two quality
670
  signals), we have made the following decisions:
 
687
  ensures consistency with the overall document character count calculation.
688
  """),
689
  H5("Our Implementation"),
690
+ Details(
691
+ Summary("TxT360 Implementation"),
692
+ D_code("""
693
+ words = text.split()
694
+ word_count = len(words)
695
+ character_count = sum(len(word) for word in words)
696
+ ...
697
+ lines = text.split("\n")
698
+ line_count = len(lines)
699
+
700
+ line_counts = Counter(lines)
701
+ attrs.fraction_of_duplicate_lines = (
702
+ sum((count - 1) for line, count in line_counts.items() if count > 1) / line_count
703
+ )
704
+ attrs.fraction_of_characters_in_duplicate_lines = (
705
+ sum(sum(len(w) for w in line.split()) * (count - 1) for line, count in
706
+ line_counts.items() if count > 1) / character_count
707
+ """, block="block", language="python"),
708
+ ),
709
  Details(
710
  Summary("Sample documents filtered by excessive line repetitions / characters in repeated lines"),
711
  DV(
 
719
  Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
720
  fraction of characters contained within the most frequently-occurring n-gram.
721
  """),
722
+ Details(
723
+ Summary("Implementations from Dolma"),
724
+ D_code("""
725
+ def all_ngram_counts(words) -> List[Tuple[int, CounterType[Tuple[str, ...]]]]:
726
+ return [(n, Counter(list(zip(*[words[i:] for i in range(n)])))) for n in range(2, 11)]
727
+ ...
728
+ all_counts = all_ngram_counts(words)
729
+
730
+ count_most_common_ngrams = (2, 3, 4)
731
+ for n, ngram_counts in all_counts:
732
+ if not ngram_counts:
733
+ continue
734
+ if n in count_most_common_ngrams:
735
+ most_common_ngram, count = ngram_counts.most_common(1)[0]
736
+ value = count * sum(len(w) for w in most_common_ngram) / max(character_count, 1)
737
+ attrs.fraction_of_characters_in_most_common_ngram.append((n, value))
738
+ """, block="block", language="python"),
739
+ ),
740
+ Details(
741
+ Summary("Implementations from RedPajama-V2"),
742
+ D_code("""
743
+ class Base_RPS_Frac_Chars_In_Top_NGram(RPSBase): # noqa
744
+ ## Base class for calculating the fraction of characters in the top N-gram. This operates on the lower-cased, punctation removed content.
745
+ NGRAM_SIZE: int = None
746
+
747
+ __slots__ = []
748
+
749
+ def __call__(self, document: Document) -> SignalType:
750
+ if self.NGRAM_SIZE is None:
751
+ raise NotImplementedError(
752
+ "NGRAM_SIZE must be set in the subclass"
753
+ )
754
+
755
+ # get the most common ngram
756
+ most_common_ngram = Counter(
757
+ # fetch the ngrams from the document if they exist, otherwise
758
+ # compute them
759
+ getattr(document, f"norm_self.NGRAM_SIZEgrams", None)
760
+ or
761
+ form_ngrams(iter(document.normalized_words), self.NGRAM_SIZE)
762
+ ).most_common(1)
763
+
764
+ if len(most_common_ngram) == 0:
765
+ return [(0, len(document), 0.0)]
766
+
767
+ ngram, count = most_common_ngram[0]
768
+
769
+ if count <= 1:
770
+ return [(0, len(document), 0.0)]
771
+
772
+ total_chars = sum(len(w) for w in document.normalized_words)
773
+ score = sum(len(w) for w in ngram) * count / total_chars
774
+ score = round(score, PRECISION)
775
+ return [(0, len(document), score)]
776
+ """, block="block", language="python"),
777
+ ),
778
+
779
+ Details(
780
+ Summary("Implementations from DataTrove"),
781
+ D_code("""
782
+ def get_n_grams(words: list[str], n: int) -> list[str]:
783
+ return [" ".join(words[i : i + n]) for i in range(len(words) - n + 1)]
784
+
785
+ def find_top_duplicate(x: list[str]) -> int:
786
+ counter = Counter()
787
+ for element in x:
788
+ counter[element] += 1
789
+ top_n_gram = counter.most_common(1)[0]
790
+ return len(top_n_gram[0]) * top_n_gram[1]
791
+ ...
792
+ for n, n_frac in self.top_n_grams:
793
+ n_grams = get_n_grams(words, n)
794
+ if not n_grams:
795
+ continue
796
+ top_char_length = find_top_duplicate(n_grams)
797
+ if top_char_length / len(text) > n_frac:
798
+ return False, f"top_n_gram"
799
+ """, block="block", language="python"),
800
+ ),
801
  P("""
802
  There are almost no contradictions between above implementations of fractions of characters in the most common
803
  n-gram. The main process involves counting the occurrences of each n-gram and selecting the most common one. The
 
808
  In practice, documents affected by this rule — where the most common n-gram exceeds a given threshold and occurs
809
  only once — tend to be short.
810
  """),
811
+ Details(
812
+ Summary("TxT360 Implementation"),
813
+ D_code("""
814
+ def all_ngram_counts_new(words) -> List[Tuple[int, CounterType[Tuple[str, ...]]]]:
815
+ return [(n, list(zip(*[words[i:] for i in range(n)]))) for n in range(2, 11)]
816
+ ...
817
+ all_counts = all_ngram_counts_new(words)
818
+ count_most_common_ngrams = (2, 3, 4)
819
+ for n, ngram_counts in all_counts:
820
+ if not ngram_counts:
821
+ continue
822
+ if n in count_most_common_ngrams:
823
+ most_common_ngram, count = Counter(ngram_counts).most_common(1)[0]
824
+ value = count * sum(len(w) for w in most_common_ngram) / character_count
825
+ attrs.fraction_of_characters_in_most_common_ngram.append((n, value))
826
+ """, block="block", language="python"),
827
+ ),
828
  Details(
829
  Summary("Sample documents filtered by the fraction of characters in the most common n-grams (n=2,3,4)"),
830
  DV(
 
839
  fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
840
  overlapping n-grams more than once.
841
  """),
842
+ Details(
843
+ Summary("Implementations from Dolma"),
844
+ D_code("""
845
+ def all_ngram_counts(words) -> List[Tuple[int, CounterType[Tuple[str, ...]]]]:
846
+ return [(n, Counter(list(zip(*[words[i:] for i in range(n)])))) for n in range(2, 11)]
847
+ ...
848
+ all_counts = all_ngram_counts(words)
849
+ for n, ngram_counts in all_counts:
850
+ if not ngram_counts:
851
+ continue
852
+ if n in count_most_common_ngrams:
853
+ ...
854
+ else:
855
+ ng_char_count = sum(count * sum(len(w) for w in ng) for ng, count in ngram_counts.items())
856
+ value = sum(
857
+ count * sum(len(w) for w in ng) for ng, count in ngram_counts.items() if count > 1
858
+ ) / max(ng_char_count, 1)
859
+ attrs.fraction_of_characters_in_duplicate_ngrams.append((n, value))
860
+ """, block="block", language="python"),
861
+ ),
862
+ Details(
863
+ Summary("Implementations from RedPajama-V2"),
864
+ D_code("""
865
+ class Base_RPS_Frac_Chars_In_Dupe_NGrams(RPSBase): # noqa
866
+ ## Base class for calculating the fraction of characters in duplicate word N-grams. This operates on the lower-cased, punctation removed content. The function also ensures that characters in overlapping ngrams are only counted once.
867
+ NGRAM_SIZE: int = None
868
+ __slots__ = []
869
+
870
+ def __call__(self, document: Document) -> SignalType:
871
+ if self.NGRAM_SIZE is None:
872
+ raise NotImplementedError(
873
+ "NGRAM_SIZE must be set in the subclass"
874
+ )
875
+
876
+ if len(document.normalized_words) < self.NGRAM_SIZE:
877
+ return [(0, len(document), 0.0)]
878
+
879
+ # fetch the ngrams from the document if they exist, otherwise
880
+ # compute them
881
+ doc_n_grams = (
882
+ getattr(document, f"norm_self.NGRAM_SIZEgrams", None)
883
+ or
884
+ tuple(form_ngrams(
885
+ iter(document.normalized_words), self.NGRAM_SIZE
886
+ ))
887
+ )
888
+
889
+ # keep only ngrams which occur at least twice
890
+ ngram_dupes =
891
+ ngram for ngram, count in Counter(doc_n_grams).items() if count > 1
892
+
893
+
894
+ duplicated_grams = np.zeros(len(document.normalized_words), dtype=int)
895
+
896
+ i = 0
897
+ for ngram in doc_n_grams:
898
+ if ngram in ngram_dupes:
899
+ duplicated_grams[i: i + self.NGRAM_SIZE] = 1
900
+
901
+ i += 1
902
+
903
+ word_lengths = np.array(list(map(len, document.normalized_words)))
904
+ chars_duped = np.sum(word_lengths * duplicated_grams)
905
+ total_chars = np.sum(word_lengths)
906
+
907
+ if total_chars == 0:
908
+ return [(0, len(document), 0.0)]
909
+
910
+ score = float(chars_duped / total_chars)
911
+ score = round(score, PRECISION)
912
+ return [(0, len(document), score)]
913
+ """, block="block", language="python"),
914
+ ),
915
+
916
+ Details(
917
+ Summary("Implementations from DataTrove"),
918
+ D_code("""
919
+ def find_all_duplicate(words: list[str], n: int) -> int:
920
+ n_words = len(words)
921
+ unique = set()
922
+ repeated_chars, idx = 0, 0
923
+ while idx < n_words - n + 1:
924
+ n_gram = "".join(words[idx : idx + n])
925
+ if n_gram in unique:
926
+ repeated_chars += len(n_gram)
927
+ idx += n
928
+ else:
929
+ unique.add(n_gram)
930
+ idx += 1
931
+ assert repeated_chars <= len("".join(words))
932
+ return repeated_chars
933
+ ...
934
+ for n, n_frac in self.dup_n_grams:
935
+ n_duplicates_char = find_all_duplicate(words, n)
936
+ if n_duplicates_char / len(text) > n_frac:
937
+ return False, f"duplicated_n_grams"
938
+ """, block="block", language="python"),
939
+ ),
940
  P("""
941
  For the computation of fraction of characters in duplicate n-gram, Dolma uses the number of characters in all
942
  n-grams (with overlapping) as the denominator, and uses the number of characters in all duplicated n-grams
943
+ (with overlapping) as the numerator."""),
944
+ P("""RedPajama V2 uses the number of all characters in (the words of) the document
945
  (without overlapping) as the denominator, and uses the number of characters that are recognized as part of the
946
+ duplicate n-gram as the numerator."""),
947
+ P("""Datatrove uses the number of all characters in the document (including white
948
  spaces, without overlapping) as the denominator, and uses the number of characters that are recognized as
949
  duplicate n-gram as the numerator. However, there is a mismatch in DataTrove’s calculation, as the number of
950
  characters in the duplicated n-grams excludes white spaces, while the total character count of the document
951
+ does not."""),
952
+
953
+ P("""We decided to use the RedPajama V2 implementation but skip the 1st occurrence of the duplicate n-gram.
954
  """),
955
+ Details(
956
+ Summary("TxT360 Implementation")
957
+ D_code("""
958
+ def get_dup_ngram_frac(n, doc_n_grams, text):
959
+ # fetch the ngrams from the document if they exist, otherwise compute them
960
+ # doc_n_grams = list(zip(*[words[i:] for i in range(n)]))
961
+
962
+ duplicated_grams = np.zeros(len(text.split()), dtype=int)
963
+
964
+ unique_ngrams = set()
965
+
966
+ for i, ngram in enumerate(doc_n_grams):
967
+ if ngram in unique_ngrams:
968
+ duplicated_grams[i: i + n] = 1
969
+ else:
970
+ unique_ngrams.add(ngram)
971
+
972
+ word_lengths = np.array(list(map(len, text.split())))
973
+ chars_duped = np.sum(word_lengths * duplicated_grams)
974
+ total_chars = np.sum(word_lengths)
975
+
976
+ return float(chars_duped / total_chars)
977
+
978
+ def all_ngram_counts_new(words) -> List[Tuple[int, CounterType[Tuple[str, ...]]]]:
979
+ return [(n, list(zip(*[words[i:] for i in range(n)]))) for n in range(2, 11)]
980
+ ...
981
+ all_counts = all_ngram_counts_new(words)
982
+ count_most_common_ngrams = (2, 3, 4)
983
+ for n, ngram_counts in all_counts:
984
+ if not ngram_counts:
985
+ continue
986
+ if n in count_most_common_ngrams:
987
+ ...
988
+ else:
989
+ score = get_dup_ngram_frac(n, ngram_counts, text)
990
+ attrs.fraction_of_characters_in_duplicate_ngrams.append((n, score))
991
+ """, block="block", language="python"),
992
+ ),
993
+ Details(
994
+ Summary("An example to show the difference between above implementations"),
995
+ P("""
996
+ Considering n = 5 and the sample sentence:
997
+
998
+ "word_a word_b word_c word_d word_e word_f word_g word_a word_b word_c word_d word_e word_f word_g word_a word_b word_c"
999
+
1000
+ In Dolma's implementation, there are 13 5-grams in total with 6 duplicated 5-grams. The resulting fraction of characters in duplicate 5-gram is 6/13.
1001
+ In RedPajama's V2 implementation, there are 17*6 characters in total and 14*6 characters that are contained in duplicate 5-grams. The fraction is 14/17.
1002
+ In DataTrove's implementation, there are 17*6 + 16(white spaces) characters in total and 10 duplicated 5-grams after excluding the first occurrence. The resulting fraction number is 10*6/(17*6+16).
1003
+
1004
+ In our implementation, there are 17*6 characters in total with 10*6 characters that are duplicated after excluding the first occurence. This results in a fraction of 10/17.
1005
+ """),
1006
+ ),
1007
+ H4("
1008
  H5(
1009
  "Sample Documents Filtered by the Fraction of Characters in Duplicated N-grams (n=5,...,10)"
1010
  ),
 
1023
  works ([2], [3], [6]), we remove the documents if more than 30% of the lines end with an ellipsis or more than
1024
  90% of lines start with a bullet point.
1025
  """),
1026
+ Details(
1027
+ Summary("Ellipsis Symbol Identification Implemetations"),
1028
+ P("Dolma: "),
1029
+ D_code("""
1030
+ ELLIPSIS_SYMBOLS = ("…")
1031
+ """, block="block", language="python"),
1032
+ P("RedPajamaV2: "),
1033
+ D_code("""
1034
+ ELLIPSIS_SYMBOLS = ("...", "…")
1035
+ """, block="block", language="python"),
1036
+ P("DataTrove: "),
1037
+ D_code("""
1038
+ ELLIPSIS_SYMBOLS = ("...", "…")
1039
+ """, block="block", language="python"),
1040
+ P("TxT360: "),
1041
+ D_code("""
1042
+ ELLIPSIS_SYMBOLS = ("...", "…", "[...]", "[…]")
1043
+ """, block="block", language="python"),
1044
+ ),
1045
+ Details(
1046
+ Summary("Bullet Point Identification Implemetations"),
1047
+ P("Dolma: ")
1048
+ D_code("""
1049
+ BULLET_POINTS = ("*", "-"
1050
+ """, block="block", language="python"),
1051
+ P("RedPajamaV2: ")
1052
+ D_code("""
1053
+ BULLET_POINT_SYMBOLS = (
1054
+ "•", # bullet point
1055
+ "‣", # triangular bullet point
1056
+ "â–¶", # black right pointing triangle
1057
+ "â—€", # black left pointing triangle
1058
+ "â—¦", # white bullet point
1059
+ "â– ", # black square
1060
+ "â–¡", # white square
1061
+ "â–ª", # black small square
1062
+ "â–«", # white small square
1063
+ "–", # en dash
1064
+ )
1065
+ """, block="block", language="python"),
1066
+ P("DataTrove: "),
1067
+ D_code("""
1068
+ BULLET_POINT_SYMBOLS = ("•" , "-")
1069
+ """, block="block", language="python"),
1070
+ P("TxT360: "),
1071
+ D_code("""
1072
+ BULLET_POINT_SYMBOLS = (
1073
+ "•", # • bullet point
1074
+ "‣", # ‣ triangular bullet point
1075
+ "â–¶", # â–¶ black right pointing triangle
1076
+ "â—€", # â—€ black left pointing triangle
1077
+ "â—¦", # â—¦ white bullet point
1078
+ "â– ", # â–  black square
1079
+ "â–¡", # â–¡ white square
1080
+ "â–ª", # â–ª black small square
1081
+ "â–«", # â–« white small square
1082
+ "-", # - en dash
1083
+ "–", # – dash
1084
+ "—", # — zh dash
1085
+ "*", # * star
1086
+ )
1087
+ """, block="block", language="python"),
1088
+ ),
1089
+
1090
+
1091
  Details(
1092
  Summary("Sample documents that are filtered out by line-wise heuristics"),
1093
  DV(
 
1096
  "Sample documents that are filtered out by line-wise heuristics",
1097
  ),
1098
  ),
1099
+
1100
  H4("3.3 Statistics-based Heuristics"),
1101
  P("We summarize other statistics-based rules originated from Gopher [7] in this section. The statistics can be used include:"),
1102
  Ul(
 
1120
  Details(
1121
  Summary("Implementations from Dolma"),
1122
  D_code("""
1123
+ words = text.split()
1124
+ word_count = len(words)
1125
  """, block="block", language="python"),
1126
  ),
1127
  Details(
1128
  Summary("Implementations from RedPajama-V2"),
1129
  D_code("""
1130
+ # the normalized content: lowercased and punctuation removed
1131
+ self._normalized_content = normalize(content)
1132
+ self._normalized_words = tuple(self._normalized_content.split())
1133
+ self._num_normalized_words = len(self._normalized_words)
1134
+
1135
+ ...
1136
+ def normalize(
1137
+ text: str,
1138
+ remove_punct: bool = True,
1139
+ lowercase: bool = True,
1140
+ nfd_unicode: bool = True,
1141
+ white_space: bool = True
1142
+ ) -> str:
1143
+ #Normalize the text by lowercasing and removing punctuation.
1144
+ # remove punctuation
1145
+ if remove_punct:
1146
+ text = text.translate(TRANSLATION_TABLE_PUNCTUATION)
1147
+ # lowercase
1148
+ if lowercase:
1149
+ text = text.lower()
1150
+ if white_space:
1151
+ text = text.strip()
1152
+ text = re.sub(r"\s+", " ", text)
1153
+ # NFD unicode normalization
1154
+ if nfd_unicode:
1155
+ text = unicodedata.normalize("NFD", text)
1156
+ return text
1157
  """, block="block", language="python"),
1158
  ),
1159
 
1160
  Details(
1161
  Summary("Implementations from DataTrove"),
1162
  D_code("""
1163
+ words = self.tokenizer.word_tokenize(text)
1164
+ n_words = len(words)
1165
+
1166
+ non_symbol_words = [w for w in words if any(ch not in PUNCTUATION_SET for ch in w)]
1167
+ n_non_symbol_words_words = len(non_symbol_words)
1168
  """, block="block", language="python"),
1169
  ),
1170
  P("""
 
1199
  Details(
1200
  Summary("Implementations from RedPajama-V2"),
1201
  D_code("""
1202
+ class RPS_Doc_Num_Sentences(RPSBase): # noqa
1203
+ ##The number of sentences in the content. This is calculated using the regex r'[^.!?]+[.!?]*'
1204
+ SENT_PATTERN = re.compile(r'[^.!?]+[.!?]*', flags=re.UNICODE)
1205
+
1206
+ __slots__ = ()
1207
+
1208
+ def __call__(self, document: Document) -> SignalType:
1209
+ ##count the number of sentences in the content using regex
1210
+ score = float(len(self.SENT_PATTERN.findall(document.raw_content)))
1211
+ return [(0, len(document), score)]
1212
  """, block="block", language="python"),
1213
  ),
1214
  P("""
 
1218
  Details(
1219
  Summary("TxT360 Implementation"),
1220
  D_code("""
1221
+ from nltk.tokenize import sent_tokenize
1222
+ ...
1223
+ def count_sentences(text):
1224
+ sentences = sent_tokenize(text)
1225
+ return len(sentences)
1226
+ ...
1227
+ attrs.num_of_sentences = count_sentences(text)
1228
  """, block="block", language="python"),
1229
  ),
1230
 
 
1236
  Details(
1237
  Summary("Implementations from Dolma"),
1238
  D_code("""
1239
+ SYMBOLS = ("#", "…")
1240
+ ...
1241
+ attrs.symbol_to_word_ratio = sum(1 for word in words if any(s in word for s in SYMBOLS)) / max(
1242
+ word_count, 1
1243
+ )
1244
  """, block="block", language="python"),
1245
  ),
1246
  Details(
1247
  Summary("Implementations from RedPajama-V2"),
1248
  D_code("""
1249
+ class RPS_Doc_Symbol_To_Word_Ratio(RPSBase): # noqa
1250
+ ##The ratio of symbols to words in the content. This is analogous to
1251
+ ##the signal used in Gopher. Symbols are defined "#", "...", and "…".
1252
+ SYMBOLS = ("#", "...", "…")
1253
+
1254
+ __slots__ = ()
1255
+
1256
+ def __call__(self, document: Document) -> SignalType:
1257
+ num_words = document.num_raw_words
1258
+
1259
+ if num_words == 0:
1260
+ return [(0, len(document), None)]
1261
+
1262
+ # count the number of symbols in the content
1263
+ num_symbols = float(sum(
1264
+ document.raw_content.count(x) for x in self.SYMBOLS
1265
+ ))
1266
+
1267
+ score = num_symbols / num_words
1268
+ score = round(score, PRECISION)
1269
+ return [(0, len(document), score)]
1270
  """, block="block", language="python"),
1271
  ),
1272
 
1273
  Details(
1274
  Summary("Implementations from DataTrove"),
1275
  D_code("""
1276
+ if self.max_symbol_word_ratio and text.count("#") / n_words > self.max_symbol_word_ratio:
1277
+ return False, "gopher_too_many_hashes"
1278
+ if self.max_symbol_word_ratio and (text.count("...") + text.count("…")) / n_words > self.max_symbol_word_ratio:
1279
+ return False, "gopher_too_many_ellipsis"
1280
  """, block="block", language="python"),
1281
  ),
1282
  Details(
1283
  Summary("TxT360 Implementation"),
1284
  D_code("""
1285
+ SYMBOLS = ("#", "...", "…")
1286
+ ...
1287
+ symbol_pattern = re.compile("|".join(re.escape(symbol) for symbol in SYMBOLS))
1288
+ ...
1289
+ attrs.symbol_to_word_ratio = sum(1 for word in words if symbol_pattern.search(word)) / word_count
1290
  """, block="block", language="python"),
1291
  ),
1292