HugoLaurencon HF staff commited on
Commit
a3825e5
1 Parent(s): 5b8f851

first commit

Browse files
.gitattributes CHANGED
@@ -25,3 +25,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
25
  *.zip filter=lfs diff=lfs merge=lfs -text
26
  *.zstandard filter=lfs diff=lfs merge=lfs -text
27
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
25
  *.zip filter=lfs diff=lfs merge=lfs -text
26
  *.zstandard filter=lfs diff=lfs merge=lfs -text
27
  *tfevents* filter=lfs diff=lfs merge=lfs -text
28
+ *.json filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
1
+ *cpython-39.pyc
2
+ .DS_Store
LICENSE ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ------------- LICENSE FOR Bigscience code --------------
2
+
3
+
4
+ Apache License
5
+ Version 2.0, January 2004
6
+ http://www.apache.org/licenses/
7
+
8
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
9
+
10
+ 1. Definitions.
11
+
12
+ "License" shall mean the terms and conditions for use, reproduction,
13
+ and distribution as defined by Sections 1 through 9 of this document.
14
+
15
+ "Licensor" shall mean the copyright owner or entity authorized by
16
+ the copyright owner that is granting the License.
17
+
18
+ "Legal Entity" shall mean the union of the acting entity and all
19
+ other entities that control, are controlled by, or are under common
20
+ control with that entity. For the purposes of this definition,
21
+ "control" means (i) the power, direct or indirect, to cause the
22
+ direction or management of such entity, whether by contract or
23
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
24
+ outstanding shares, or (iii) beneficial ownership of such entity.
25
+
26
+ "You" (or "Your") shall mean an individual or Legal Entity
27
+ exercising permissions granted by this License.
28
+
29
+ "Source" form shall mean the preferred form for making modifications,
30
+ including but not limited to software source code, documentation
31
+ source, and configuration files.
32
+
33
+ "Object" form shall mean any form resulting from mechanical
34
+ transformation or translation of a Source form, including but
35
+ not limited to compiled object code, generated documentation,
36
+ and conversions to other media types.
37
+
38
+ "Work" shall mean the work of authorship, whether in Source or
39
+ Object form, made available under the License, as indicated by a
40
+ copyright notice that is included in or attached to the work
41
+ (an example is provided in the Appendix below).
42
+
43
+ "Derivative Works" shall mean any work, whether in Source or Object
44
+ form, that is based on (or derived from) the Work and for which the
45
+ editorial revisions, annotations, elaborations, or other modifications
46
+ represent, as a whole, an original work of authorship. For the purposes
47
+ of this License, Derivative Works shall not include works that remain
48
+ separable from, or merely link (or bind by name) to the interfaces of,
49
+ the Work and Derivative Works thereof.
50
+
51
+ "Contribution" shall mean any work of authorship, including
52
+ the original version of the Work and any modifications or additions
53
+ to that Work or Derivative Works thereof, that is intentionally
54
+ submitted to Licensor for inclusion in the Work by the copyright owner
55
+ or by an individual or Legal Entity authorized to submit on behalf of
56
+ the copyright owner. For the purposes of this definition, "submitted"
57
+ means any form of electronic, verbal, or written communication sent
58
+ to the Licensor or its representatives, including but not limited to
59
+ communication on electronic mailing lists, source code control systems,
60
+ and issue tracking systems that are managed by, or on behalf of, the
61
+ Licensor for the purpose of discussing and improving the Work, but
62
+ excluding communication that is conspicuously marked or otherwise
63
+ designated in writing by the copyright owner as "Not a Contribution."
64
+
65
+ "Contributor" shall mean Licensor and any individual or Legal Entity
66
+ on behalf of whom a Contribution has been received by Licensor and
67
+ subsequently incorporated within the Work.
68
+
69
+ 2. Grant of Copyright License. Subject to the terms and conditions of
70
+ this License, each Contributor hereby grants to You a perpetual,
71
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
72
+ copyright license to reproduce, prepare Derivative Works of,
73
+ publicly display, publicly perform, sublicense, and distribute the
74
+ Work and such Derivative Works in Source or Object form.
75
+
76
+ 3. Grant of Patent License. Subject to the terms and conditions of
77
+ this License, each Contributor hereby grants to You a perpetual,
78
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
79
+ (except as stated in this section) patent license to make, have made,
80
+ use, offer to sell, sell, import, and otherwise transfer the Work,
81
+ where such license applies only to those patent claims licensable
82
+ by such Contributor that are necessarily infringed by their
83
+ Contribution(s) alone or by combination of their Contribution(s)
84
+ with the Work to which such Contribution(s) was submitted. If You
85
+ institute patent litigation against any entity (including a
86
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
87
+ or a Contribution incorporated within the Work constitutes direct
88
+ or contributory patent infringement, then any patent licenses
89
+ granted to You under this License for that Work shall terminate
90
+ as of the date such litigation is filed.
91
+
92
+ 4. Redistribution. You may reproduce and distribute copies of the
93
+ Work or Derivative Works thereof in any medium, with or without
94
+ modifications, and in Source or Object form, provided that You
95
+ meet the following conditions:
96
+
97
+ (a) You must give any other recipients of the Work or
98
+ Derivative Works a copy of this License; and
99
+
100
+ (b) You must cause any modified files to carry prominent notices
101
+ stating that You changed the files; and
102
+
103
+ (c) You must retain, in the Source form of any Derivative Works
104
+ that You distribute, all copyright, patent, trademark, and
105
+ attribution notices from the Source form of the Work,
106
+ excluding those notices that do not pertain to any part of
107
+ the Derivative Works; and
108
+
109
+ (d) If the Work includes a "NOTICE" text file as part of its
110
+ distribution, then any Derivative Works that You distribute must
111
+ include a readable copy of the attribution notices contained
112
+ within such NOTICE file, excluding those notices that do not
113
+ pertain to any part of the Derivative Works, in at least one
114
+ of the following places: within a NOTICE text file distributed
115
+ as part of the Derivative Works; within the Source form or
116
+ documentation, if provided along with the Derivative Works; or,
117
+ within a display generated by the Derivative Works, if and
118
+ wherever such third-party notices normally appear. The contents
119
+ of the NOTICE file are for informational purposes only and
120
+ do not modify the License. You may add Your own attribution
121
+ notices within Derivative Works that You distribute, alongside
122
+ or as an addendum to the NOTICE text from the Work, provided
123
+ that such additional attribution notices cannot be construed
124
+ as modifying the License.
125
+
126
+ You may add Your own copyright statement to Your modifications and
127
+ may provide additional or different license terms and conditions
128
+ for use, reproduction, or distribution of Your modifications, or
129
+ for any such Derivative Works as a whole, provided Your use,
130
+ reproduction, and distribution of the Work otherwise complies with
131
+ the conditions stated in this License.
132
+
133
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
134
+ any Contribution intentionally submitted for inclusion in the Work
135
+ by You to the Licensor shall be under the terms and conditions of
136
+ this License, without any additional terms or conditions.
137
+ Notwithstanding the above, nothing herein shall supersede or modify
138
+ the terms of any separate license agreement you may have executed
139
+ with Licensor regarding such Contributions.
140
+
141
+ 6. Trademarks. This License does not grant permission to use the trade
142
+ names, trademarks, service marks, or product names of the Licensor,
143
+ except as required for reasonable and customary use in describing the
144
+ origin of the Work and reproducing the content of the NOTICE file.
145
+
146
+ 7. Disclaimer of Warranty. Unless required by applicable law or
147
+ agreed to in writing, Licensor provides the Work (and each
148
+ Contributor provides its Contributions) on an "AS IS" BASIS,
149
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
150
+ implied, including, without limitation, any warranties or conditions
151
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
152
+ PARTICULAR PURPOSE. You are solely responsible for determining the
153
+ appropriateness of using or redistributing the Work and assume any
154
+ risks associated with Your exercise of permissions under this License.
155
+
156
+ 8. Limitation of Liability. In no event and under no legal theory,
157
+ whether in tort (including negligence), contract, or otherwise,
158
+ unless required by applicable law (such as deliberate and grossly
159
+ negligent acts) or agreed to in writing, shall any Contributor be
160
+ liable to You for damages, including any direct, indirect, special,
161
+ incidental, or consequential damages of any character arising as a
162
+ result of this License or out of the use or inability to use the
163
+ Work (including but not limited to damages for loss of goodwill,
164
+ work stoppage, computer failure or malfunction, or any and all
165
+ other commercial damages or losses), even if such Contributor
166
+ has been advised of the possibility of such damages.
167
+
168
+ 9. Accepting Warranty or Additional Liability. While redistributing
169
+ the Work or Derivative Works thereof, You may choose to offer,
170
+ and charge a fee for, acceptance of support, warranty, indemnity,
171
+ or other liability obligations and/or rights consistent with this
172
+ License. However, in accepting such obligations, You may act only
173
+ on Your own behalf and on Your sole responsibility, not on behalf
174
+ of any other Contributor, and only if You agree to indemnify,
175
+ defend, and hold each Contributor harmless for any liability
176
+ incurred by, or claims asserted against, such Contributor by reason
177
+ of your accepting any such warranty or additional liability.
178
+
179
+ END OF TERMS AND CONDITIONS
180
+
181
+ APPENDIX: How to apply the Apache License to your work.
182
+
183
+ To apply the Apache License to your work, attach the following
184
+ boilerplate notice, with the fields enclosed by brackets "[]"
185
+ replaced with your own identifying information. (Don't include
186
+ the brackets!) The text should be enclosed in the appropriate
187
+ comment syntax for the file format. We also recommend that a
188
+ file or class name and description of purpose be included on the
189
+ same "printed page" as the copyright notice for easier
190
+ identification within third-party archives.
191
+
192
+ Copyright [2021] [Bigscience]
193
+
194
+ Licensed under the Apache License, Version 2.0 (the "License");
195
+ you may not use this file except in compliance with the License.
196
+ You may obtain a copy of the License at
197
+
198
+ http://www.apache.org/licenses/LICENSE-2.0
199
+
200
+ Unless required by applicable law or agreed to in writing, software
201
+ distributed under the License is distributed on an "AS IS" BASIS,
202
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
203
+ See the License for the specific language governing permissions and
204
+ limitations under the License.
README.md CHANGED
@@ -1,8 +1,8 @@
1
  ---
2
  title: Text Data Filtering
3
- emoji:
4
- colorFrom: red
5
- colorTo: yellow
6
  sdk: streamlit
7
  app_file: app.py
8
  pinned: false
@@ -10,36 +10,28 @@ pinned: false
10
 
11
  # Configuration
12
 
13
- `title`: _string_
14
  Display title for the Space
15
 
16
- `emoji`: _string_
17
  Space emoji (emoji-only character allowed)
18
 
19
- `colorFrom`: _string_
20
  Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
21
 
22
- `colorTo`: _string_
23
  Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
24
 
25
- `sdk`: _string_
26
- Can be either `gradio`, `streamlit`, or `static`
27
 
28
- `sdk_version` : _string_
29
  Only applicable for `streamlit` SDK.
30
  See [doc](https://hf.co/docs/hub/spaces) for more info on supported versions.
31
 
32
- `app_file`: _string_
33
- Path to your main application file (which contains either `gradio` or `streamlit` Python code, or `static` html code).
34
  Path is relative to the root of the repository.
35
 
36
- `models`: _List[string]_
37
- HF model IDs (like "gpt2" or "deepset/roberta-base-squad2") used in the Space.
38
- Will be parsed automatically from your code if not specified here.
39
-
40
- `datasets`: _List[string]_
41
- HF dataset IDs (like "common_voice" or "oscar-corpus/OSCAR-2109") used in the Space.
42
- Will be parsed automatically from your code if not specified here.
43
-
44
- `pinned`: _boolean_
45
  Whether the Space stays on top of your list.
1
  ---
2
  title: Text Data Filtering
3
+ emoji: 👁
4
+ colorFrom: blue
5
+ colorTo: pink
6
  sdk: streamlit
7
  app_file: app.py
8
  pinned: false
10
 
11
  # Configuration
12
 
13
+ `title`: _string_
14
  Display title for the Space
15
 
16
+ `emoji`: _string_
17
  Space emoji (emoji-only character allowed)
18
 
19
+ `colorFrom`: _string_
20
  Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
21
 
22
+ `colorTo`: _string_
23
  Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
24
 
25
+ `sdk`: _string_
26
+ Can be either `gradio` or `streamlit`
27
 
28
+ `sdk_version` : _string_
29
  Only applicable for `streamlit` SDK.
30
  See [doc](https://hf.co/docs/hub/spaces) for more info on supported versions.
31
 
32
+ `app_file`: _string_
33
+ Path to your main application file (which contains either `gradio` or `streamlit` Python code).
34
  Path is relative to the root of the repository.
35
 
36
+ `pinned`: _boolean_
 
 
 
 
 
 
 
 
37
  Whether the Space stays on top of your list.
app.py ADDED
@@ -0,0 +1,916 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Run with: streamlit run visualization.py
2
+
3
+ import streamlit as st
4
+
5
+ import os
6
+
7
+ from io import StringIO
8
+ import base64
9
+ import json
10
+ import pandas as pd
11
+
12
+ pd.options.mode.chained_assignment = None
13
+
14
+ import numpy as np
15
+
16
+ import matplotlib.pyplot as plt
17
+
18
+ from filtering import LoadParameters, ModifyingDocuments, Filtering
19
+ from languages_id import langs_id
20
+
21
+
22
+ class Visualization_for_lang:
23
+ def __init__(
24
+ self,
25
+ path_data,
26
+ lang,
27
+ num_docs,
28
+ num_docs_for_words,
29
+ max_len_text_display,
30
+ lang_dataset_id,
31
+ path_fasttext_model,
32
+ path_sentencepiece_model,
33
+ path_kenlm_model,
34
+ ):
35
+ self.path_data = path_data
36
+ self.lang = lang
37
+ self.num_docs = num_docs
38
+ self.num_docs_for_words = num_docs_for_words
39
+ self.max_len_text_display = max_len_text_display
40
+
41
+ self.lang_dataset_id = lang_dataset_id
42
+ self.param = LoadParameters.load_parameters(lang_dataset_id)
43
+ self.stopwords = LoadParameters.load_stopwords(lang_dataset_id)
44
+ self.flagged_words = LoadParameters.load_flagged_words(lang_dataset_id)
45
+ self.model_lang_id = LoadParameters.load_model_lang_id(
46
+ lang_dataset_id, path_fasttext_model
47
+ )
48
+ self.sentencepiece_model = LoadParameters.load_sentencepiece_model(
49
+ lang_dataset_id, path_sentencepiece_model
50
+ )
51
+ self.sentencepiece_model_tok = (
52
+ self.sentencepiece_model if self.param["tokenization"] else None
53
+ )
54
+ self.kenlm_model = LoadParameters.load_kenlm_model(
55
+ lang_dataset_id, path_kenlm_model
56
+ )
57
+
58
+ def set_title(self):
59
+ st.title(f"Filtering visualization for {self.lang}")
60
+
61
+ def open_data(self):
62
+ with open(self.path_data) as json_file:
63
+ data = json.load(json_file)
64
+
65
+ self.num_docs = min(self.num_docs, len(data))
66
+ self.num_docs_for_words = min(self.num_docs_for_words, len(data))
67
+
68
+ if "words" in data[0]:
69
+ words = [doc["words"] for doc in data[: self.num_docs_for_words]]
70
+ words = [word for doc in words for word in doc]
71
+ self.words = pd.DataFrame(words)
72
+ else:
73
+ self.words = None
74
+
75
+ docs = data[: self.num_docs]
76
+ for doc in docs:
77
+ if not (self.words is None):
78
+ del doc["words"]
79
+ if len(doc["text"]) > self.max_len_text_display:
80
+ doc["text"] = (
81
+ doc["text"][: self.max_len_text_display]
82
+ + " [...] [THIS LONG TEXT HAS BEEN TRUNCATED FOR DISPLAY REASONS]"
83
+ )
84
+ self.docs_checkpoint = pd.DataFrame(docs)
85
+ self.docs = self.docs_checkpoint
86
+
87
+ @staticmethod
88
+ def print_discarded_by_cond(cond):
89
+ st.caption(
90
+ f"{(len(cond) - np.sum(1*cond)) / len(cond) * 100:.2f}% of the total is discarded with this filter."
91
+ )
92
+
93
+ @staticmethod
94
+ def plot_hist(dataframe, key, num_bins=50):
95
+ checkbox = st.checkbox(
96
+ "Diplay distribution", value=True, key=f"display_distribution_{key[0]}"
97
+ )
98
+ if checkbox:
99
+ fig, ax = plt.subplots()
100
+ val = dataframe[key[0]].values
101
+ if np.median(val) != 0:
102
+ val = val[
103
+ abs(val - np.median(val))
104
+ < 9 * np.median(np.absolute(val - np.median(val)))
105
+ ]
106
+ ax.hist(val, bins=num_bins, density=True)
107
+ ax.set_title(" ".join(key[0].split("_")))
108
+ ax.axvline(x=key[1], color="r", linestyle="dashed")
109
+ st.pyplot(fig)
110
+
111
+ @staticmethod
112
+ def display_dataset(dataframe, cond, description, type_of_examples):
113
+ displayed_examples = dataframe.loc[cond]
114
+ st.subheader(
115
+ f"{description}: {len(displayed_examples)} {type_of_examples} ({len(displayed_examples) / len(dataframe.index) * 100:.2f}%)"
116
+ )
117
+ st.markdown(
118
+ "Click on a column to sort by it, place the cursor on the text to display it."
119
+ )
120
+ st.dataframe(displayed_examples)
121
+
122
+ def filtering_of_docs(self):
123
+ def set_sliders():
124
+ columns = list(self.docs)
125
+ keys = []
126
+ conds = {}
127
+
128
+ def get_cond(key, cutoff, max_cutoff):
129
+ if max_cutoff:
130
+ return self.docs[key] <= cutoff
131
+ return self.docs[key] >= cutoff
132
+
133
+ if "number_words" in columns:
134
+ with st.sidebar.expander("Number of words"):
135
+ cutoff_def = "If the number of words of a document is lower than this number, the document is removed."
136
+ max_nb_words = int(np.max(self.docs["number_words"])) + 1
137
+ cutoff_min_number_words = st.slider(
138
+ cutoff_def, 0, min(max_nb_words, 500), 0
139
+ )
140
+ new_key = ("number_words", cutoff_min_number_words, False)
141
+ keys.append(new_key)
142
+ Visualization_for_lang.plot_hist(self.docs, new_key)
143
+ cond_1 = get_cond(new_key[0], new_key[1], new_key[2])
144
+ Visualization_for_lang.print_discarded_by_cond(cond_1)
145
+
146
+ cutoff_def = "If the number of words of a document is higher than this number, the document is removed."
147
+ cutoff_max_number_words = st.slider(
148
+ cutoff_def, 0, max_nb_words, max_nb_words
149
+ )
150
+ new_key = ("number_words", cutoff_max_number_words, True)
151
+ keys.append(new_key)
152
+ cond_2 = get_cond(new_key[0], new_key[1], new_key[2])
153
+ Visualization_for_lang.print_discarded_by_cond(cond_2)
154
+
155
+ conds["number_words"] = [cond_1, cond_2]
156
+
157
+ if "character_repetition_ratio" in columns:
158
+ with st.sidebar.expander("Character repetition ratio"):
159
+ val_repetitions_lengths = list(
160
+ self.docs["character_repetition_ratio"].iloc[0].keys()
161
+ )
162
+ default_index = (
163
+ val_repetitions_lengths.index("10")
164
+ if "10" in val_repetitions_lengths
165
+ else 0
166
+ )
167
+ label_selectbox = "Length of repetitions in characters (that will influence the character repetition ratio)."
168
+ repetitions_length = st.selectbox(
169
+ label=label_selectbox,
170
+ options=val_repetitions_lengths,
171
+ index=default_index,
172
+ )
173
+ st.caption(
174
+ "Choosing a higher or lower number does not mean that the filtering "
175
+ "is stronger or weaker. Be careful, choosing a low number (below 5 for languages like English) "
176
+ "tends to associate a high character repetition ratio to very long documents (like book chapters), but with "
177
+ "few or no repetitions, simply because their length gives them more diversity, and we do "
178
+ "not want to discard such documents. It is generally better to increase this number, so that false "
179
+ "positives are very short documents (which we want to delete anyway) rather than long ones. However, "
180
+ "a low number can be useful for Chinese, where a character can designate a whole word."
181
+ )
182
+ self.docs["character_repetition_ratio"] = self.docs_checkpoint[
183
+ "character_repetition_ratio"
184
+ ]
185
+ for i in range(len(self.docs["character_repetition_ratio"])):
186
+ self.docs["character_repetition_ratio"].iloc[i] = self.docs[
187
+ "character_repetition_ratio"
188
+ ].iloc[i][repetitions_length]
189
+
190
+ cutoff_def = "If the character repetition ratio of a document is higher than this number, the document is removed."
191
+ cutoff_character_repetition_ratio = st.slider(
192
+ cutoff_def, 0.0, 1.0, 1.0, step=0.01
193
+ )
194
+ new_key = (
195
+ "character_repetition_ratio",
196
+ cutoff_character_repetition_ratio,
197
+ True,
198
+ repetitions_length,
199
+ )
200
+ keys.append(new_key)
201
+ Visualization_for_lang.plot_hist(self.docs, new_key)
202
+ cond = get_cond(new_key[0], new_key[1], new_key[2])
203
+ Visualization_for_lang.print_discarded_by_cond(cond)
204
+ conds["character_repetition_ratio"] = [cond]
205
+
206
+ if "word_repetition_ratio" in columns:
207
+ with st.sidebar.expander("Word repetition ratio"):
208
+ val_repetitions_lengths = list(
209
+ self.docs["word_repetition_ratio"].iloc[0].keys()
210
+ )
211
+ default_index = (
212
+ val_repetitions_lengths.index("5")
213
+ if "5" in val_repetitions_lengths
214
+ else 0
215
+ )
216
+ label_selectbox = "Length of repetitions in words (that will influence the word repetition ratio)."
217
+ repetitions_length = st.selectbox(
218
+ label=label_selectbox,
219
+ options=val_repetitions_lengths,
220
+ index=default_index,
221
+ )
222
+ st.caption(
223
+ "Choosing a higher or lower number does not mean that the filtering "
224
+ "is stronger or weaker. Be careful, choosing a low number (like 3) could "
225
+ "tend to associate a high word repetition ratio to very long documents (like book chapters), but with "
226
+ "few or no repetitions, simply because their length gives them more diversity, and we do "
227
+ "not want to discard such documents. It is generally better to increase a bit this number, so that false "
228
+ "positives are very short documents (which we want to delete anyway) rather than long ones."
229
+ )
230
+ self.docs["word_repetition_ratio"] = self.docs_checkpoint[
231
+ "word_repetition_ratio"
232
+ ]
233
+ for i in range(len(self.docs["word_repetition_ratio"])):
234
+ self.docs["word_repetition_ratio"].iloc[i] = self.docs[
235
+ "word_repetition_ratio"
236
+ ].iloc[i][repetitions_length]
237
+
238
+ cutoff_def = "If the word repetition ratio of a document is higher than this number, the document is removed."
239
+ cutoff_word_repetition_ratio = st.slider(
240
+ cutoff_def, 0.0, 1.0, 1.0, step=0.01
241
+ )
242
+ new_key = (
243
+ "word_repetition_ratio",
244
+ cutoff_word_repetition_ratio,
245
+ True,
246
+ repetitions_length,
247
+ )
248
+ keys.append(new_key)
249
+ Visualization_for_lang.plot_hist(self.docs, new_key)
250
+ cond = get_cond(new_key[0], new_key[1], new_key[2])
251
+ Visualization_for_lang.print_discarded_by_cond(cond)
252
+ conds["word_repetition_ratio"] = [cond]
253
+
254
+ if "special_characters_ratio" in columns:
255
+ with st.sidebar.expander("Special characters ratio"):
256
+ cutoff_def = "If the special characters ratio of a document is higher than this number, the document is removed."
257
+ cutoff_special_characters_ratio = st.slider(
258
+ cutoff_def, 0.0, 1.0, 1.0, step=0.01
259
+ )
260
+ new_key = (
261
+ "special_characters_ratio",
262
+ cutoff_special_characters_ratio,
263
+ True,
264
+ )
265
+ keys.append(new_key)
266
+ Visualization_for_lang.plot_hist(self.docs, new_key)
267
+ cond = get_cond(new_key[0], new_key[1], new_key[2])
268
+ Visualization_for_lang.print_discarded_by_cond(cond)
269
+ conds["special_characters_ratio"] = [cond]
270
+
271
+ if "stopwords_ratio" in columns:
272
+ with st.sidebar.expander("Stop words ratio"):
273
+ stopwords_file = st.file_uploader(
274
+ "Upload your own list of stop words (one per line). If there is none, the default one is used."
275
+ )
276
+ if stopwords_file:
277
+ new_stopwords = StringIO(
278
+ stopwords_file.getvalue().decode("utf-8")
279
+ ).read()
280
+ new_stopwords = set(new_stopwords.split("\n"))
281
+ self.docs["stopwords_ratio"] = self.docs_checkpoint[
282
+ "stopwords_ratio"
283
+ ]
284
+ for i in range(len(self.docs["stopwords_ratio"])):
285
+ self.docs["stopwords_ratio"].iloc[
286
+ i
287
+ ] = Filtering.compute_stopwords_ratio(
288
+ self.docs["text"].iloc[i],
289
+ self.sentencepiece_model_tok,
290
+ self.param["strip_characters"],
291
+ self.param["cond_words_augmentation"],
292
+ self.param["words_augmentation_group_sizes"],
293
+ self.param["words_augmentation_join_char"],
294
+ new_stopwords,
295
+ )
296
+ cutoff_def = "If the stop words ratio of a document is lower than this number, the document is removed."
297
+ cutoff_stopwords_ratio = st.slider(
298
+ cutoff_def, 0.0, 1.0, 0.0, step=0.01
299
+ )
300
+ new_key = ("stopwords_ratio", cutoff_stopwords_ratio, False)
301
+ keys.append(new_key)
302
+ Visualization_for_lang.plot_hist(self.docs, new_key)
303
+ cond = get_cond(new_key[0], new_key[1], new_key[2])
304
+ Visualization_for_lang.print_discarded_by_cond(cond)
305
+ conds["stopwords_ratio"] = [cond]
306
+
307
+ if "flagged_words_ratio" in columns:
308
+ with st.sidebar.expander("Flagged words ratio"):
309
+ flagged_words_file = st.file_uploader(
310
+ "Upload your own list of flagged words (one per line). If there is none, the default one is used."
311
+ )
312
+ if flagged_words_file:
313
+ new_flagged_words = StringIO(
314
+ flagged_words_file.getvalue().decode("utf-8")
315
+ ).read()
316
+ new_flagged_words = set(new_flagged_words.split("\n"))
317
+ self.docs["flagged_words_ratio"] = self.docs_checkpoint[
318
+ "flagged_words_ratio"
319
+ ]
320
+ for i in range(len(self.docs["flagged_words_ratio"])):
321
+ self.docs["flagged_words_ratio"].iloc[
322
+ i
323
+ ] = Filtering.compute_flagged_words_ratio(
324
+ self.docs["text"].iloc[i],
325
+ self.sentencepiece_model_tok,
326
+ self.param["strip_characters"],
327
+ self.param["cond_words_augmentation"],
328
+ self.param["words_augmentation_group_sizes"],
329
+ self.param["words_augmentation_join_char"],
330
+ new_flagged_words,
331
+ )
332
+ cutoff_def = "If the flagged words ratio of a document is higher than this number, the document is removed."
333
+ max_fwr = np.max(self.docs["flagged_words_ratio"])
334
+ max_fwr = np.ceil(max_fwr * 1000) / 1000
335
+ max_fwr = float(max_fwr)
336
+ cutoff_flagged_words_ratio = st.slider(
337
+ cutoff_def,
338
+ 0.000,
339
+ max_fwr,
340
+ max_fwr,
341
+ step=0.001,
342
+ format="%f",
343
+ )
344
+ new_key = ("flagged_words_ratio", cutoff_flagged_words_ratio, True)
345
+ keys.append(new_key)
346
+ Visualization_for_lang.plot_hist(self.docs, new_key)
347
+ cond = get_cond(new_key[0], new_key[1], new_key[2])
348
+ Visualization_for_lang.print_discarded_by_cond(cond)
349
+ conds["flagged_words_ratio"] = [cond]
350
+
351
+ if "lang_id_score" in columns:
352
+ with st.sidebar.expander("Language ID confidence score"):
353
+ cutoff_def = "If the confidence score for the language identification prediction of a document is lower than this number, the document is removed."
354
+ cutoff_lang_id_score = st.slider(
355
+ cutoff_def, 0.0, 1.0, 0.0, step=0.01
356
+ )
357
+ new_key = ("lang_id_score", cutoff_lang_id_score, False)
358
+ keys.append(new_key)
359
+ Visualization_for_lang.plot_hist(self.docs, new_key)
360
+ cond = get_cond(new_key[0], new_key[1], new_key[2])
361
+ Visualization_for_lang.print_discarded_by_cond(cond)
362
+ conds["lang_id_score"] = [cond]
363
+
364
+ if "perplexity_score" in columns:
365
+ with st.sidebar.expander("Perplexity score"):
366
+ cutoff_def = "If the perplexity score of a document is higher than this number, the document is removed."
367
+ max_pp = int(np.max(self.docs["perplexity_score"])) + 1
368
+ cutoff_perplexity_score = st.slider(cutoff_def, 0, max_pp, max_pp)
369
+ new_key = ("perplexity_score", cutoff_perplexity_score, True)
370
+ keys.append(new_key)
371
+ Visualization_for_lang.plot_hist(self.docs, new_key)
372
+ cond = get_cond(new_key[0], new_key[1], new_key[2])
373
+ Visualization_for_lang.print_discarded_by_cond(cond)
374
+ conds["perplexity_score"] = [cond]
375
+
376
+ return keys, conds
377
+
378
+ with st.expander(
379
+ f"Filtering on documents, for {self.num_docs} {self.lang} documents"
380
+ ):
381
+ st.header(
382
+ f"Filtering on documents, for {self.num_docs} {self.lang} documents"
383
+ )
384
+
385
+ if "labels" in list(self.docs):
386
+ chosen_label = st.selectbox(
387
+ label="Consider only documents that include the following label",
388
+ options=[
389
+ "All",
390
+ "NA: Narrative",
391
+ "IN: Informational Description",
392
+ "OP: Opinion",
393
+ "ID: Interactive Discussion",
394
+ "HI: How-to/Instruction",
395
+ "IP: Informational Persuasion",
396
+ "LY: Lyrical",
397
+ "SP: Spoken",
398
+ ],
399
+ )
400
+ chosen_label = chosen_label.split(":")[0]
401
+ if chosen_label != "All":
402
+ cond_label = list(
403
+ self.docs["labels"].apply(
404
+ lambda x: True if chosen_label in x else False
405
+ )
406
+ )
407
+ self.docs = self.docs[cond_label]
408
+
409
+ if self.docs.empty:
410
+ st.markdown(
411
+ "No document to display, please try to select a different label."
412
+ )
413
+ self.keys = []
414
+ self.parameters = []
415
+
416
+ else:
417
+ st.sidebar.subheader("Parameters of the filtering on documents")
418
+ self.keys, conds = set_sliders()
419
+ self.parameters = self.keys * 1
420
+
421
+ all_conds = [
422
+ subcond for cond in list(conds.values()) for subcond in cond
423
+ ]
424
+ all_conds = np.all(all_conds, axis=0)
425
+
426
+ Visualization_for_lang.display_dataset(
427
+ self.docs, np.invert(all_conds), "Discarded documents", "docs"
428
+ )
429
+
430
+ # st.subheader("Display discarded documents by filter")
431
+ display_discarded_documents_by_filter = st.checkbox(
432
+ "Display discarded documents by filter"
433
+ )
434
+
435
+ if display_discarded_documents_by_filter:
436
+ columns = list(self.docs)
437
+
438
+ if "number_words" in columns:
439
+ cond_filter = np.invert(np.all(conds["number_words"], axis=0))
440
+ Visualization_for_lang.display_dataset(
441
+ self.docs,
442
+ cond_filter,
443
+ "Discarded documents for the filter on the number of words",
444
+ "docs",
445
+ )
446
+
447
+ if "character_repetition_ratio" in columns:
448
+ cond_filter = np.invert(
449
+ np.all(conds["character_repetition_ratio"], axis=0)
450
+ )
451
+ Visualization_for_lang.display_dataset(
452
+ self.docs,
453
+ cond_filter,
454
+ "Discarded documents for the filter on the character repetition ratio",
455
+ "docs",
456
+ )
457
+
458
+ if "word_repetition_ratio" in columns:
459
+ cond_filter = np.invert(
460
+ np.all(conds["word_repetition_ratio"], axis=0)
461
+ )
462
+ Visualization_for_lang.display_dataset(
463
+ self.docs,
464
+ cond_filter,
465
+ "Discarded documents for the filter on the word repetition ratio",
466
+ "docs",
467
+ )
468
+
469
+ if "special_characters_ratio" in columns:
470
+ cond_filter = np.invert(
471
+ np.all(conds["special_characters_ratio"], axis=0)
472
+ )
473
+ Visualization_for_lang.display_dataset(
474
+ self.docs,
475
+ cond_filter,
476
+ "Discarded documents for the filter on the special characters ratio",
477
+ "docs",
478
+ )
479
+
480
+ if "stopwords_ratio" in columns:
481
+ cond_filter = np.invert(
482
+ np.all(conds["stopwords_ratio"], axis=0)
483
+ )
484
+ Visualization_for_lang.display_dataset(
485
+ self.docs,
486
+ cond_filter,
487
+ "Discarded documents for the filter on the stop words ratio",
488
+ "docs",
489
+ )
490
+
491
+ if "flagged_words_ratio" in columns:
492
+ cond_filter = np.invert(
493
+ np.all(conds["flagged_words_ratio"], axis=0)
494
+ )
495
+ Visualization_for_lang.display_dataset(
496
+ self.docs,
497
+ cond_filter,
498
+ "Discarded documents for the filter on the flagged words ratio",
499
+ "docs",
500
+ )
501
+
502
+ if "lang_id_score" in columns:
503
+ cond_filter = np.invert(np.all(conds["lang_id_score"], axis=0))
504
+ Visualization_for_lang.display_dataset(
505
+ self.docs,
506
+ cond_filter,
507
+ "Discarded documents for the filter on the language identification confidence score",
508
+ "docs",
509
+ )
510
+
511
+ if "perplexity_score" in columns:
512
+ cond_filter = np.invert(
513
+ np.all(conds["perplexity_score"], axis=0)
514
+ )
515
+ Visualization_for_lang.display_dataset(
516
+ self.docs,
517
+ cond_filter,
518
+ "Discarded documents for the filter on the perplexity score",
519
+ "docs",
520
+ )
521
+
522
+ Visualization_for_lang.display_dataset(
523
+ self.docs, all_conds, "Retained documents", "docs"
524
+ )
525
+
526
+ st.header("Download data")
527
+
528
+ with open(self.path_data) as json_file:
529
+ btn = st.download_button(
530
+ label="Download data as json",
531
+ data=json_file,
532
+ file_name="data.json",
533
+ )
534
+
535
+ def filtering_of_words(self):
536
+ if not (self.words is None):
537
+ columns = list(self.words)
538
+
539
+ st.sidebar.subheader("Parameter of the filtering on words")
540
+
541
+ conds_words = {}
542
+
543
+ if "len_word" in columns:
544
+ with st.sidebar.expander("Length of words"):
545
+ cutoff_def = "If the length of a word is higher than this number, the word is removed."
546
+ max_len_word = min(int(np.max(self.words["len_word"])) + 1, 200)
547
+ cutoff_word = st.slider(cutoff_def, 0, max_len_word, max_len_word)
548
+ new_key = ("len_word", cutoff_word, True)
549
+ self.parameters.append(new_key)
550
+ Visualization_for_lang.plot_hist(self.words, new_key)
551
+ cond_len_words = self.words["len_word"] <= cutoff_word
552
+ Visualization_for_lang.print_discarded_by_cond(cond_len_words)
553
+ conds_words["len_word"] = cond_len_words
554
+
555
+ if "incorrect_substrings" in columns:
556
+ with st.sidebar.expander("Words with incorrect substrings"):
557
+ incorrect_substrings = st.checkbox(
558
+ "Remove words with incorrect substrings."
559
+ )
560
+ self.parameters.append(
561
+ ("incorrect_substrings", incorrect_substrings)
562
+ )
563
+
564
+ checkbox = st.checkbox(
565
+ "Diplay distribution",
566
+ value=True,
567
+ key="display_distribution_incorrect_substrings",
568
+ )
569
+ if checkbox:
570
+ incor_sub = np.array(self.words["incorrect_substrings"]) * 1
571
+ with_incor_sub = np.sum(incor_sub)
572
+ without_incor_sub = len(incor_sub) - with_incor_sub
573
+ st.markdown(
574
+ f"Number of words with incorrect substrings: {with_incor_sub}"
575
+ )
576
+ st.markdown(
577
+ f"Number of words without incorrect substrings: {without_incor_sub}"
578
+ )
579
+
580
+ if incorrect_substrings:
581
+ cond_incorrect_substrings = np.invert(
582
+ self.words["incorrect_substrings"]
583
+ )
584
+ else:
585
+ cond_incorrect_substrings = np.array(
586
+ [
587
+ True
588
+ for i in range(len(self.words["incorrect_substrings"]))
589
+ ]
590
+ )
591
+ Visualization_for_lang.print_discarded_by_cond(
592
+ cond_incorrect_substrings
593
+ )
594
+ conds_words["incorrect_substrings"] = cond_incorrect_substrings
595
+
596
+ all_conds_words = np.all(list(conds_words.values()), axis=0)
597
+
598
+ with st.expander(
599
+ f"Filtering on words, for {self.num_docs_for_words} {self.lang} documents"
600
+ ):
601
+ st.header(
602
+ f"Filtering on words, for {self.num_docs_for_words} {self.lang} documents"
603
+ )
604
+
605
+ st.markdown(
606
+ f"Since the number of words is way larger than the number of documents, "
607
+ f"we consider in this section words for only {self.num_docs_for_words} documents."
608
+ )
609
+
610
+ Visualization_for_lang.display_dataset(
611
+ self.words, np.invert(all_conds_words), "Discarded words", "words"
612
+ )
613
+
614
+ # st.subheader("Display discarded words by filter")
615
+ display_discarded_words_by_filter = st.checkbox(
616
+ "Display discarded words by filter"
617
+ )
618
+
619
+ if display_discarded_words_by_filter:
620
+
621
+ if "len_word" in columns:
622
+ cond_filter = np.invert(conds_words["len_word"])
623
+ Visualization_for_lang.display_dataset(
624
+ self.words,
625
+ cond_filter,
626
+ "Discarded words for the filter on length",
627
+ "words",
628
+ )
629
+
630
+ if "incorrect_substrings" in columns:
631
+ cond_filter = np.invert(conds_words["incorrect_substrings"])
632
+ Visualization_for_lang.display_dataset(
633
+ self.words,
634
+ cond_filter,
635
+ "Discarded words for the filter on incorrect substrings",
636
+ "words",
637
+ )
638
+
639
+ Visualization_for_lang.display_dataset(
640
+ self.words, all_conds_words, "Retained words", "words"
641
+ )
642
+
643
+ def download_parameters(self):
644
+ st.sidebar.subheader("Download parameters")
645
+ btn = st.sidebar.download_button(
646
+ label="Download current parameters as json",
647
+ data=json.dumps(self.parameters),
648
+ file_name=f"parameters_{self.lang_dataset_id}.json",
649
+ )
650
+
651
+ """
652
+ def plot_zipf_law(self):
653
+ if not (self.words is None):
654
+ st.header("Zipf's Law")
655
+
656
+ display_zipf_law = st.checkbox("Display Zipf's Law")
657
+
658
+ if display_zipf_law:
659
+
660
+ freq_words = {}
661
+ for _, row in self.words.iterrows():
662
+ freq_words[row["word"]] = freq_words.get(row["word"], 0) + 1
663
+ freq_words = np.array(list(freq_words.values()))
664
+ freq_words = -np.sort(-freq_words)
665
+
666
+ fig, ax = plt.subplots()
667
+ ax.loglog(freq_words)
668
+ ax.set_title("Zipf's Law")
669
+ ax.set_xlabel("$i$-th most frequent word")
670
+ ax.set_ylabel("frequency in the documents")
671
+ st.pyplot(fig)
672
+ """
673
+
674
+ def analyse_personal_doc(self):
675
+ with st.expander("Analyse your own document"):
676
+ st.header("Analyse your own document")
677
+
678
+ personal_doc = st.text_area(
679
+ label="Paste here the document you want to analyse",
680
+ value="",
681
+ max_chars=10000,
682
+ )
683
+
684
+ is_discarded = False
685
+
686
+ def is_doc_discarded(key, score):
687
+ if key[2]: # max cutoff
688
+ return score > key[1]
689
+ else:
690
+ return score < key[1]
691
+
692
+ if personal_doc:
693
+
694
+ st.markdown("Statistics of the document:")
695
+
696
+ for key in self.keys:
697
+ if key[0] == "number_words":
698
+ words = ModifyingDocuments.get_words_from_document(
699
+ personal_doc,
700
+ self.sentencepiece_model_tok,
701
+ lower_case=False,
702
+ strip_characters=self.param["strip_characters"],
703
+ )
704
+ if key[2]:
705
+ st.markdown(f"Number of words: {len(words)}")
706
+ if is_doc_discarded(key, len(words)):
707
+ is_discarded = True
708
+
709
+ elif key[0] == "character_repetition_ratio":
710
+ character_repetition_ratio = (
711
+ Filtering.compute_character_repetition_ratio(
712
+ personal_doc, int(key[3])
713
+ )
714
+ )
715
+ character_repetition_ratio = round(
716
+ character_repetition_ratio, 3
717
+ )
718
+ st.markdown(
719
+ f"Character repetition ratio: {character_repetition_ratio}"
720
+ )
721
+ if is_doc_discarded(key, character_repetition_ratio):
722
+ is_discarded = True
723
+
724
+ elif key[0] == "word_repetition_ratio":
725
+ word_repetition_ratio = Filtering.compute_word_repetition_ratio(
726
+ personal_doc,
727
+ self.sentencepiece_model_tok,
728
+ self.param["strip_characters"],
729
+ int(key[3]),
730
+ )
731
+ word_repetition_ratio = round(word_repetition_ratio, 3)
732
+ st.markdown(f"Word repetition ratio: {word_repetition_ratio}")
733
+ if is_doc_discarded(key, word_repetition_ratio):
734
+ is_discarded = True
735
+
736
+ elif key[0] == "special_characters_ratio":
737
+ special_characters_ratio = (
738
+ Filtering.compute_special_characters_ratio(
739
+ personal_doc, self.param["special_characters"]
740
+ )
741
+ )
742
+ special_characters_ratio = round(special_characters_ratio, 3)
743
+ st.markdown(
744
+ f"Special characters ratio: {special_characters_ratio}"
745
+ )
746
+ if is_doc_discarded(key, special_characters_ratio):
747
+ is_discarded = True
748
+
749
+ elif key[0] == "stopwords_ratio":
750
+ stopwords_ratio = Filtering.compute_stopwords_ratio(
751
+ personal_doc,
752
+ self.sentencepiece_model_tok,
753
+ self.param["strip_characters"],
754
+ self.param["cond_words_augmentation"],
755
+ self.param["words_augmentation_group_sizes"],
756
+ self.param["words_augmentation_join_char"],
757
+ self.stopwords,
758
+ )
759
+ stopwords_ratio = round(stopwords_ratio, 3)
760
+ st.markdown(f"Stop words ratio: {stopwords_ratio}")
761
+ if is_doc_discarded(key, stopwords_ratio):
762
+ is_discarded = True
763
+
764
+ elif key[0] == "flagged_words_ratio":
765
+ flagged_words_ratio = Filtering.compute_flagged_words_ratio(
766
+ personal_doc,
767
+ self.sentencepiece_model_tok,
768
+ self.param["strip_characters"],
769
+ self.param["cond_words_augmentation"],
770
+ self.param["words_augmentation_group_sizes"],
771
+ self.param["words_augmentation_join_char"],
772
+ self.flagged_words,
773
+ )
774
+ flagged_words_ratio = round(flagged_words_ratio, 3)
775
+ st.markdown(f"Flagged words ratio: {flagged_words_ratio}")
776
+ if is_doc_discarded(key, flagged_words_ratio):
777
+ is_discarded = True
778
+
779
+ elif key[0] == "lang_id_score":
780
+ (
781
+ lang_pred_dataset_id,
782
+ lang_id_score,
783
+ ) = Filtering.compute_lang_id_pred_score(
784
+ personal_doc, self.model_lang_id
785
+ )
786
+ lang_id_score = round(lang_id_score, 3)
787
+ st.markdown(
788
+ f"Language identification confidence score: {lang_id_score}"
789
+ )
790
+ if is_doc_discarded(key, flagged_words_ratio) or (
791
+ self.lang_dataset_id != lang_pred_dataset_id
792
+ ):
793
+ is_discarded = True
794
+
795
+ elif key[0] == "perplexity_score":
796
+ perplexity_score = Filtering.compute_perplexity_score(
797
+ personal_doc,
798
+ self.sentencepiece_model,
799
+ self.kenlm_model,
800
+ )
801
+ perplexity_score = round(perplexity_score, 3)
802
+ st.markdown(f"Perplexity score: {perplexity_score}")
803
+ if is_doc_discarded(key, perplexity_score):
804
+ is_discarded = True
805
+
806
+ is_discarded = "" if is_discarded else "not "
807
+ st.markdown(
808
+ f"With the current filtering parameters, this document **is {is_discarded}discarded**."
809
+ )
810
+
811
+ def visualization_for_lang(self):
812
+ self.set_title()
813
+ self.open_data()
814
+ self.filtering_of_docs()
815
+ self.filtering_of_words()
816
+ self.download_parameters()
817
+ self.analyse_personal_doc()
818
+
819
+
820
+ class Visualization:
821
+ def __init__(self, path_instructions, param_visu_langs):
822
+ self.path_instructions = path_instructions
823
+ self.param_visu_langs = param_visu_langs
824
+
825
+ def preamble(self):
826
+ def get_binary_file_downloader_html(bin_file, file_label="File"):
827
+ with open(bin_file, "rb") as f:
828
+ data = f.read()
829
+ bin_str = base64.b64encode(data).decode()
830
+ href = f'<a href="data:application/octet-stream;base64,{bin_str}" download="{os.path.basename(bin_file)}">{file_label}</a>'
831
+ return href
832
+
833
+ st.markdown(
834
+ "Before diving into this demo, you might want to take a look at how the filtering pipeline looks like in more detail in this "
835
+ + get_binary_file_downloader_html(
836
+ self.path_instructions,
837
+ "pdf",
838
+ )
839
+ + ".",
840
+ unsafe_allow_html=True,
841
+ )
842
+
843
+ def warning_preamble(self):
844
+ st.markdown(
845
+ "This demo can be a little slow, and only allows you to process up to 5000 documents "
846
+ "for a decent speed. If you want to display up to three times more documents and have "
847
+ "a faster visualization, we invite you to run this "
848
+ "[code](https://github.com/bigscience-workshop/data_tooling/tree/master/ac_dc/visualization) "
849
+ "on your computer."
850
+ )
851
+
852
+ def choose_lang(self):
853
+ options = [
854
+ self.param_visu_langs[lang_dataset_id]["lang"]
855
+ for lang_dataset_id in self.param_visu_langs
856
+ ]
857
+ index = options.index("English") if ("English" in options) else 0
858
+ lang_chosen = st.selectbox(
859
+ label="Select the language for visualization",
860
+ options=options,
861
+ index=index,
862
+ )
863
+ if lang_chosen != "None":
864
+ lang_chosen_dataset_id = langs_id.loc[
865
+ langs_id["lang"] == lang_chosen, "dataset_id"
866
+ ].iloc[0]
867
+ visualization_for_lang = Visualization_for_lang(
868
+ path_data=self.param_visu_langs[lang_chosen_dataset_id]["path_data"],
869
+ lang=self.param_visu_langs[lang_chosen_dataset_id]["lang"],
870
+ num_docs=self.param_visu_langs[lang_chosen_dataset_id]["num_docs"],
871
+ num_docs_for_words=self.param_visu_langs[lang_chosen_dataset_id][
872
+ "num_docs_for_words"
873
+ ],
874
+ max_len_text_display=self.param_visu_langs[lang_chosen_dataset_id][
875
+ "max_len_text_display"
876
+ ],
877
+ lang_dataset_id=self.param_visu_langs[lang_chosen_dataset_id][
878
+ "lang_dataset_id"
879
+ ],
880
+ path_fasttext_model=self.param_visu_langs[lang_chosen_dataset_id][
881
+ "path_fasttext_model"
882
+ ],
883
+ path_sentencepiece_model=self.param_visu_langs[lang_chosen_dataset_id][
884
+ "path_sentencepiece_model"
885
+ ],
886
+ path_kenlm_model=self.param_visu_langs[lang_chosen_dataset_id][
887
+ "path_kenlm_model"
888
+ ],
889
+ )
890
+ visualization_for_lang.visualization_for_lang()
891
+
892
+ def visualization(self):
893
+ self.preamble()
894
+ self.warning_preamble()
895
+ self.choose_lang()
896
+
897
+
898
+ path_instructions = "./explanation_filtering_pipeline.pdf"
899
+
900
+ param_visu_langs = {
901
+ lang_dataset_id: {
902
+ "path_data": f"./{lang_dataset_id}_examples_with_stats.json",
903
+ "lang": langs_id.loc[langs_id["dataset_id"] == lang_dataset_id, "lang"].iloc[0],
904
+ "num_docs": 5000,
905
+ "num_docs_for_words": 500,
906
+ "max_len_text_display": 10000,
907
+ "lang_dataset_id": lang_dataset_id,
908
+ "path_fasttext_model": "./lid.176.bin",
909
+ "path_sentencepiece_model": f"./{lang_dataset_id}.sp.model",
910
+ "path_kenlm_model": f"./{lang_dataset_id}.arpa.bin",
911
+ }
912
+ for lang_dataset_id in ["eu", "ca", "zh", "en", "fr", "id", "es"]
913
+ }
914
+
915
+ visualization = Visualization(path_instructions, param_visu_langs)
916
+ visualization.visualization()
ca.arpa.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2ece1e503d4b44409069ea9c5c5125b74792b575143169e08cf9a27248f9a78e
3
+ size 2809368958
ca.sp.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:abc6936e2ff5dcdc86962ffaeef48ef66f567d568ef7090d28123ed6618b455c
3
+ size 946977
ca_examples_with_stats.json ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c4207b45aa366ece2763a06565fcb771b86e433f2a6190248017f97e7534fa4a
3
+ size 103605036
en.arpa.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:04923fccbb4e63005c40f01d66112659416de01accd80d16e366a592289ee07a
3
+ size 4444690658
en.sp.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cf8147a573770b4e6c0d4df1dcb75453baa88190706dab406be7711b84f059de
3
+ size 931348
en_examples_with_stats.json ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1dccf03710e9dc7ec68c676175e711be815bc29a50260f5d334156b03fe2e6d1
3
+ size 241408394
es.arpa.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:26964ff8185eb105021fc0e9eaa0a1de590c4a12f8aa3fe12112b29d42281cf3
3
+ size 3828418653
es.sp.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aae545566a995d3374fbc8ac1d4e0c7073008da8ae32acfe7f176136a8efcf37
3
+ size 961535
es_examples_with_stats.json ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2d52760c4c961ebfe419a603a6d837619ca146656f563f5abbd140dec8fbe28e
3
+ size 148378888
eu.arpa.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2d04c4d1233b40044e2facc978987ecd4a6d4f84032f2af3f85f7079676fa08b
3
+ size 774011873
eu.sp.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:447cbd1714e51e6a7b4dd8ff55b7bd975fdb7f6ba873cb6f8a1fe36b5867dbb6
3
+ size 955869
eu_examples_with_stats.json ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:10a06ac7ed9b4c444f35fb9a3e3636a22689c198a6bdd4fd358b0eec50aa924d
3
+ size 66358003
explanation_filtering_pipeline.pdf ADDED
Binary file (218 kB). View file
filtering.py ADDED
@@ -0,0 +1,957 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+
3
+ import numpy as np
4
+
5
+ import fasttext
6
+
7
+ import sentencepiece
8
+ import kenlm
9
+
10
+ import pathlib
11
+
12
+ from languages_id import langs_id
13
+ from parameters_filtering import parameters_filtering
14
+ from normalization import normalization
15
+ from stopwords import stopwords
16
+ from flagged_words import flagged_words
17
+
18
+
19
+ class LoadParameters:
20
+ @staticmethod
21
+ def load_parameters(lang_dataset_id):
22
+ if lang_dataset_id in parameters_filtering:
23
+ param = parameters_filtering[lang_dataset_id]
24
+ else:
25
+ param = parameters_filtering["default"]
26
+ return param
27
+
28
+ @staticmethod
29
+ def load_stopwords(lang_dataset_id):
30
+ stopwords_lang_id = langs_id.loc[
31
+ langs_id["dataset_id"] == lang_dataset_id, "stopwords_id"
32
+ ].iloc[0]
33
+ if stopwords_lang_id:
34
+ stopwords_lang = set(stopwords[stopwords_lang_id])
35
+ else:
36
+ stopwords_lang = None
37
+ return stopwords_lang
38
+
39
+ @staticmethod
40
+ def load_flagged_words(lang_dataset_id):
41
+ flagged_words_lang_id = langs_id.loc[
42
+ langs_id["dataset_id"] == lang_dataset_id, "flagged_words_id"
43
+ ].iloc[0]
44
+ if flagged_words_lang_id:
45
+ flagged_words_lang = set(flagged_words[flagged_words_lang_id])
46
+ else:
47
+ flagged_words_lang = None
48
+ return flagged_words_lang
49
+
50
+ @staticmethod
51
+ def load_model_lang_id(lang_dataset_id, path_fasttext_model):
52
+ fasttext_lang_id = langs_id.loc[
53
+ langs_id["dataset_id"] == lang_dataset_id, "fasttext_id"
54
+ ].iloc[0]
55
+ if fasttext_lang_id:
56
+ model_lang_id = fasttext.load_model(path_fasttext_model)
57
+ else:
58
+ model_lang_id = None
59
+ return model_lang_id
60
+
61
+ @staticmethod
62
+ def load_sentencepiece_model(lang_dataset_id, path_sentencepiece_model):
63
+ sentencepiece_lang_id = langs_id.loc[
64
+ langs_id["dataset_id"] == lang_dataset_id, "sentencepiece_id"
65
+ ].iloc[0]
66
+ if sentencepiece_lang_id:
67
+ sentencepiece_model = sentencepiece.SentencePieceProcessor()
68
+ sentencepiece_model.load(path_sentencepiece_model)
69
+ else:
70
+ sentencepiece_model = None
71
+ return sentencepiece_model
72
+
73
+ @staticmethod
74
+ def load_kenlm_model(lang_dataset_id, path_kenlm_model):
75
+ kenlm_lang_id = langs_id.loc[
76
+ langs_id["dataset_id"] == lang_dataset_id, "kenlm_id"
77
+ ].iloc[0]
78
+ if kenlm_lang_id:
79
+ kenlm_model = kenlm.Model(path_kenlm_model)
80
+ else:
81
+ kenlm_model = None
82
+ return kenlm_model
83
+
84
+
85
+ class ModifyingDocuments:
86
+ @staticmethod
87
+ def remove_empty_el_from_list(list_):
88
+ return [el for el in list_ if el]
89
+
90
+ @staticmethod
91
+ def remove_non_printing_characters(document, non_printing_characters_re):
92
+ return non_printing_characters_re.sub("", document)
93
+
94
+ @staticmethod
95
+ def uniform_whitespace(
96
+ document,
97
+ whitespace=[
98
+ " ",
99
+ " ",
100
+ " ",
101
+ " ",
102
+ " ",
103
+ " ",
104
+ " ",
105
+ " ",
106
+ " ",
107
+ " ",
108
+ "",
109
+ "„",
110
+ ],
111
+ ):
112
+ """There are different whitespace characters."""
113
+ whitespace = set(whitespace)
114
+ document = "".join(
115
+ [char if char not in whitespace else " " for char in document]
116
+ )
117
+ return document
118
+
119
+ @staticmethod
120
+ def replace_digits_with_zeros(document, digits_re):
121
+ return digits_re.sub("0", document)
122
+
123
+ @staticmethod
124
+ def replace_unicode_punctuation(document, unicode_punctuation):
125
+ return "".join(unicode_punctuation.get(c, c) for c in document)
126
+
127
+ @staticmethod
128
+ def normalization(
129
+ document,
130
+ remove_non_printing_characters,
131
+ strip,
132
+ lower_case,
133
+ uniform_whitespace,
134
+ replace_digits_with_zeros,
135
+ replace_unicode_punctuation,
136
+ non_printing_characters_re=normalization["non_printing_characters_re"],
137
+ digits_re=normalization["digits_re"],
138
+ unicode_punctuation=normalization["unicode_punctuation"],
139
+ ):
140
+ if remove_non_printing_characters:
141
+ document = ModifyingDocuments.remove_non_printing_characters(
142
+ document, non_printing_characters_re
143
+ )
144
+ if strip:
145
+ document = document.strip()
146
+ if not document:
147
+ return document
148
+ if lower_case:
149
+ document = document.lower()
150
+ if uniform_whitespace:
151
+ document = ModifyingDocuments.uniform_whitespace(document)
152
+ if replace_digits_with_zeros:
153
+ document = ModifyingDocuments.replace_digits_with_zeros(document, digits_re)
154
+ if replace_unicode_punctuation:
155
+ document = ModifyingDocuments.replace_unicode_punctuation(
156
+ document, unicode_punctuation
157
+ )
158
+ return document
159
+
160
+ @staticmethod
161
+ def tokenization(document, sentencepiece_model, join_on_whitespace):
162
+ document_tokenized = sentencepiece_model.encode_as_pieces(document)
163
+ if join_on_whitespace:
164
+ document_tokenized = " ".join(document_tokenized)
165
+ return document_tokenized
166
+
167
+ @staticmethod
168
+ def split_on_whitespace(
169
+ document,
170
+ new_line=False,
171
+ tab=False,
172
+ ):
173
+ """This method also removes concatenated spaces."""
174
+ sep = [" "] + new_line * ["\n"] + tab * ["\t"]
175
+ sep = "|".join(sep)
176
+ split_document = re.split(sep, document)
177
+ split_document = ModifyingDocuments.remove_empty_el_from_list(split_document)
178
+ return split_document
179
+
180
+ @staticmethod
181
+ def strip(document, strip_characters):
182
+ """Way faster than document.strip(strip_characters)
183
+ since strip_characters is now a set instead of a str,
184
+ and it contains a lot of elements (all the emojis)."""
185
+ if not document:
186
+ return document
187
+ beg_ind = 0
188
+ end_ind = len(document)
189
+ for i in range(len(document)):
190
+ if document[i] in strip_characters:
191
+ beg_ind += 1
192
+ else:
193
+ break
194
+ for i in range(1, len(document) + 1):
195
+ if document[-i] in strip_characters:
196
+ end_ind -= 1
197
+ else:
198
+ break
199
+ document_stripped = document[beg_ind:end_ind]
200
+ return document_stripped
201
+
202
+ @staticmethod
203
+ def get_words_from_document(
204
+ document, sentencepiece_model_tok, lower_case, strip_characters
205
+ ):
206
+ """Get words from a document. Non reversible since the document
207
+ is split on multiple characters, words are stripped of
208
+ special characters and characters are converted to lower case.
209
+ Useful to compute ratios, like the stopwords ratio."""
210
+ if sentencepiece_model_tok:
211
+ document_normalized = ModifyingDocuments.normalization(
212
+ document=document,
213
+ remove_non_printing_characters=True,
214
+ strip=True,
215
+ lower_case=True,
216
+ uniform_whitespace=True,
217
+ replace_digits_with_zeros=True,
218
+ replace_unicode_punctuation=True,
219
+ )
220
+ words = ModifyingDocuments.tokenization(
221
+ document_normalized, sentencepiece_model_tok, join_on_whitespace=False
222
+ )
223
+ else:
224
+ words = ModifyingDocuments.split_on_whitespace(
225
+ document, new_line=True, tab=True
226
+ )
227
+ if lower_case:
228
+ words = [word.lower() for word in words]
229
+ if strip_characters:
230
+ words = [ModifyingDocuments.strip(word, strip_characters) for word in words]
231
+ words = ModifyingDocuments.remove_empty_el_from_list(words)
232
+ return words
233
+
234
+ @staticmethod
235
+ def words_augmentation(words, group_size, join_char):
236
+ """Augment words, especially for Chinese (without a space between words)
237
+ and Vietnamese (with a space between syllables)."""
238
+ augmentation = [
239
+ join_char.join(words[i : i + group_size])
240
+ for i in range(len(words) - group_size + 1)
241
+ ]
242
+ return augmentation
243
+
244
+ @staticmethod
245
+ def split_on_newline_tab_whitespace(document):
246
+ """First split on "\n", then on "\t", then on " "."""
247
+ sentences = document.split("\n")
248
+ sentences = [sentence.split("\t") for sentence in sentences]
249
+ sentences = [
250
+ [
251
+ ModifyingDocuments.split_on_whitespace(subsentence)
252
+ for subsentence in sentence
253
+ ]
254
+ for sentence in sentences
255
+ ]
256
+ return sentences
257
+
258
+ @staticmethod
259
+ def merge_on_whitespace_tab_newline(sentences):
260
+ """Invert the method split_on_newline_tab_whitespace.
261
+ Removes concatenated separators."""
262
+ sentences = [
263
+ [" ".join(subsentence) for subsentence in sentence if subsentence]
264
+ for sentence in sentences
265
+ ]
266
+ sentences = ["\t".join(sentence) for sentence in sentences if sentence]
267
+ if not sentences:
268
+ return ""
269
+ document = "\n".join(sentences)
270
+ return document
271
+
272
+ @staticmethod
273
+ def should_keep_word_with_incorrect_substrings(
274
+ word, strip_characters, incorrect_word_substrings
275
+ ):
276
+ word = ModifyingDocuments.strip(word, strip_characters)
277
+ should_keep = all(
278
+ [(i_substr not in word) for i_substr in incorrect_word_substrings]
279
+ )
280
+ return should_keep
281
+
282
+ @staticmethod
283
+ def remove_words_with_incorrect_substrings(
284
+ document,
285
+ strip_characters,
286
+ incorrect_word_substrings,
287
+ ):
288
+ sentences = ModifyingDocuments.split_on_newline_tab_whitespace(document)
289
+ sentences = [
290
+ [
291
+ [
292
+ word
293
+ for word in subsentence
294
+ if ModifyingDocuments.should_keep_word_with_incorrect_substrings(
295
+ word, strip_characters, incorrect_word_substrings
296
+ )
297
+ ]
298
+ for subsentence in sentence
299
+ ]
300
+ for sentence in sentences
301
+ ]
302
+ document = ModifyingDocuments.merge_on_whitespace_tab_newline(sentences)
303
+ return document
304
+
305
+ @staticmethod
306
+ def should_keep_long_word(word, strip_characters, length_word_max_cutoff):
307
+ """If the word is too long but it contains only one
308
+ special character, it might be a concatenation of one word,
309
+ a punctuation, and another word, with no space between them.
310
+ In this case, we give the word a pass."""
311
+ if len(word) <= length_word_max_cutoff:
312
+ return True
313
+ word = ModifyingDocuments.strip(word, strip_characters)
314
+ if not word: # The word consisted only of strip characters
315
+ return False
316
+ if len(word) <= length_word_max_cutoff:
317
+ return True
318
+ return False
319
+
320
+ def remove_long_words(
321
+ document,
322
+ strip_characters,
323
+ length_word_max_cutoff,
324
+ ):
325
+ sentences = ModifyingDocuments.split_on_newline_tab_whitespace(document)
326
+ sentences = [
327
+ [
328
+ [
329
+ word
330
+ for word in subsentence
331
+ if ModifyingDocuments.should_keep_long_word(
332
+ word,
333
+ strip_characters,
334
+ length_word_max_cutoff,
335
+ )
336
+ ]
337
+ for subsentence in sentence
338
+ ]
339
+ for sentence in sentences
340
+ ]
341
+ document = ModifyingDocuments.merge_on_whitespace_tab_newline(sentences)
342
+ return document
343
+
344
+ @staticmethod
345
+ def modifying_documents(
346
+ document,
347
+ cond_uniform_whitespace,
348
+ cond_replace_unicode_punctuation,
349
+ cond_remove_words_with_incorrect_substrings,
350
+ strip_characters,
351
+ incorrect_word_substrings,
352
+ cond_remove_long_words,
353
+ length_word_max_cutoff,
354
+ ):
355
+ document = ModifyingDocuments.normalization(
356
+ document=document,
357
+ remove_non_printing_characters=False,
358
+ strip=True,
359
+ lower_case=False,
360
+ uniform_whitespace=cond_uniform_whitespace,
361
+ replace_digits_with_zeros=False,
362
+ replace_unicode_punctuation=cond_replace_unicode_punctuation,
363
+ )
364
+ if cond_remove_words_with_incorrect_substrings:
365
+ document = ModifyingDocuments.remove_words_with_incorrect_substrings(
366
+ document,
367
+ strip_characters,
368
+ incorrect_word_substrings,
369
+ )
370
+ if cond_remove_long_words:
371
+ document = ModifyingDocuments.remove_long_words(
372
+ document,
373
+ strip_characters,
374
+ length_word_max_cutoff,
375
+ )
376
+ return document
377
+
378
+
379
+ class FunctionDatasetModifyingDocuments:
380
+ def __init__(self, lang_dataset_id):
381
+ self.lang_dataset_id = lang_dataset_id
382
+ self.param = LoadParameters.load_parameters(lang_dataset_id)
383
+
384
+ def __call__(self, example):
385
+ example["text"] = ModifyingDocuments.modifying_documents(
386
+ document=example["text"],
387
+ cond_uniform_whitespace=self.param["cond_uniform_whitespace"],
388
+ cond_replace_unicode_punctuation=self.param[
389
+ "cond_replace_unicode_punctuation"
390
+ ],
391
+ cond_remove_words_with_incorrect_substrings=self.param[
392
+ "cond_remove_words_with_incorrect_substrings"
393
+ ],
394
+ strip_characters=self.param["strip_characters"],
395
+ incorrect_word_substrings=self.param["incorrect_word_substrings"],
396
+ cond_remove_long_words=self.param["cond_remove_long_words"],
397
+ length_word_max_cutoff=self.param["length_word_max_cutoff"],
398
+ )
399
+ return example
400
+
401
+ def __reduce__(self):
402
+ return (self.__class__, (self.lang_dataset_id,))
403
+
404
+
405
+ class Filtering:
406
+ @staticmethod
407
+ def check_number_words(
408
+ document,
409
+ sentencepiece_model_tok,
410
+ strip_characters,
411
+ number_words_min_cutoff,
412
+ number_words_max_cutoff,
413
+ ):
414
+ words = ModifyingDocuments.get_words_from_document(
415
+ document,
416
+ sentencepiece_model_tok,
417
+ lower_case=False,
418
+ strip_characters=strip_characters,
419
+ )
420
+ cond = (len(words) >= number_words_min_cutoff) and (
421
+ len(words) <= number_words_max_cutoff
422
+ )
423
+ return cond
424
+
425
+ @staticmethod
426
+ def compute_character_repetition_ratio(document, character_repetition_length):
427
+ def get_freq_character_ngrams(document, n):
428
+ character_ngrams = [
429
+ document[i : i + n] for i in range(len(document) - n + 1)
430
+ ]
431
+ freq_character_ngrams = {}
432
+ for character_ngram in character_ngrams:
433
+ freq_character_ngrams[character_ngram] = (
434
+ freq_character_ngrams.get(character_ngram, 0) + 1
435
+ )
436
+ return freq_character_ngrams
437
+
438
+ freq_character_ngrams = get_freq_character_ngrams(
439
+ document, character_repetition_length
440
+ )
441
+ if len(freq_character_ngrams) == 0:
442
+ return 0
443
+ freq_character_ngrams = list(freq_character_ngrams.values())
444
+ freq_character_ngrams = sorted(freq_character_ngrams, reverse=True)
445
+ val_less_than_one = len([el for el in freq_character_ngrams if el > 1])
446
+ num_rep_character_ngrams = min(
447
+ int(np.sqrt(len(freq_character_ngrams))),
448
+ len(freq_character_ngrams) - val_less_than_one,
449
+ )
450
+ character_repetition_ratio = sum(
451
+ freq_character_ngrams[:num_rep_character_ngrams]
452
+ ) / sum(freq_character_ngrams)
453
+ return character_repetition_ratio
454
+
455
+ @staticmethod
456
+ def check_character_repetition_removal(
457
+ document,
458
+ character_repetition_length,
459
+ character_repetition_max_cutoff,
460
+ ):
461
+ character_repetition_ratio = Filtering.compute_character_repetition_ratio(
462
+ document, character_repetition_length
463
+ )
464
+ cond = character_repetition_ratio <= character_repetition_max_cutoff
465
+ return cond
466
+
467
+ @staticmethod
468
+ def compute_word_repetition_ratio(
469
+ document, sentencepiece_model_tok, strip_characters, word_repetition_length
470
+ ):
471
+ def get_freq_word_ngrams(
472
+ document, sentencepiece_model_tok, strip_characters, n
473
+ ):
474
+ words = ModifyingDocuments.get_words_from_document(
475
+ document,
476
+ sentencepiece_model_tok,
477
+ lower_case=True,
478
+ strip_characters=strip_characters,
479
+ )
480
+ word_ngrams = [
481
+ " ".join(words[i : i + n]) for i in range(len(words) - n + 1)
482
+ ]
483
+ freq_word_ngrams = {}
484
+ for word_ngram in word_ngrams:
485
+ freq_word_ngrams[word_ngram] = freq_word_ngrams.get(word_ngram, 0) + 1
486
+ return freq_word_ngrams
487
+
488
+ freq_word_ngrams = get_freq_word_ngrams(
489
+ document, sentencepiece_model_tok, strip_characters, word_repetition_length
490
+ )
491
+ if len(freq_word_ngrams) == 0:
492
+ return 0
493
+ freq_word_ngrams = list(freq_word_ngrams.values())
494
+ word_repetition_ratio = sum(
495
+ freq for freq in freq_word_ngrams if freq > 1
496
+ ) / sum(freq_word_ngrams)
497
+ return word_repetition_ratio
498
+
499
+ @staticmethod
500
+ def check_word_repetition_removal(
501
+ document,
502
+ sentencepiece_model_tok,
503
+ strip_characters,
504
+ word_repetition_length,
505
+ word_repetition_max_cutoff,
506
+ ):
507
+ word_repetition_ratio = Filtering.compute_word_repetition_ratio(
508
+ document, sentencepiece_model_tok, strip_characters, word_repetition_length
509
+ )
510
+ cond = word_repetition_ratio <= word_repetition_max_cutoff
511
+ return cond
512
+
513
+ @staticmethod
514
+ def compute_special_characters_ratio(document, special_characters):
515
+ if len(document) == 0:
516
+ return 0
517
+ special_characters_ratio = len(
518
+ [char for char in document if char in special_characters]
519
+ ) / len(document)
520
+ return special_characters_ratio
521
+
522
+ @staticmethod
523
+ def check_special_characters(
524
+ document,
525
+ special_characters,
526
+ special_characters_max_cutoff,
527
+ ):
528
+ special_characters_ratio = Filtering.compute_special_characters_ratio(
529
+ document, special_characters
530
+ )
531
+ cond = special_characters_ratio <= special_characters_max_cutoff
532
+ return cond
533
+
534
+ @staticmethod
535
+ def compute_stopwords_ratio(
536
+ document,
537
+ sentencepiece_model_tok,
538
+ strip_characters,
539
+ cond_words_augmentation,
540
+ words_augmentation_group_sizes,
541
+ words_augmentation_join_char,
542
+ stopwords,
543
+ ):
544
+ words = ModifyingDocuments.get_words_from_document(
545
+ document,
546
+ sentencepiece_model_tok,
547
+ lower_case=True,
548
+ strip_characters=strip_characters,
549
+ )
550
+ if not words:
551
+ return 0
552
+ augmentation = []
553
+ if cond_words_augmentation:
554
+ augmentation = [
555
+ ModifyingDocuments.words_augmentation(
556
+ words, group_size, words_augmentation_join_char
557
+ )
558
+ for group_size in words_augmentation_group_sizes
559
+ ]
560
+ augmentation = [word for augm in augmentation for word in augm]
561
+ stopwords_ratio = len(
562
+ [word for word in words + augmentation if word in stopwords]
563
+ ) / len(words)
564
+ if stopwords_ratio > 1.0:
565
+ stopwords_ratio = 1.0
566
+ return stopwords_ratio
567
+
568
+ @staticmethod
569
+ def check_stopwords(
570
+ document,
571
+ sentencepiece_model_tok,
572
+ strip_characters,
573
+ cond_words_augmentation,
574
+ words_augmentation_group_sizes,
575
+ words_augmentation_join_char,
576
+ stopwords,
577
+ stopwords_min_cutoff,
578
+ ):
579
+ cond = True
580
+ if stopwords:
581
+ stopwords_ratio = Filtering.compute_stopwords_ratio(
582
+ document,
583
+ sentencepiece_model_tok,
584
+ strip_characters,
585
+ cond_words_augmentation,
586
+ words_augmentation_group_sizes,
587
+ words_augmentation_join_char,
588
+ stopwords,
589
+ )
590
+ cond = stopwords_ratio >= stopwords_min_cutoff
591
+ return cond
592
+
593
+ @staticmethod
594
+ def compute_flagged_words_ratio(
595
+ document,
596
+ sentencepiece_model_tok,
597
+ strip_characters,
598
+ cond_words_augmentation,
599
+ words_augmentation_group_sizes,
600
+ words_augmentation_join_char,
601
+ flagged_words,
602
+ ):
603
+ words = ModifyingDocuments.get_words_from_document(
604
+ document,
605
+ sentencepiece_model_tok,
606
+ lower_case=True,
607
+ strip_characters=strip_characters,
608
+ )
609
+ if not words:
610
+ return 0
611
+ augmentation = []
612
+ if cond_words_augmentation:
613
+ augmentation = [
614
+ ModifyingDocuments.words_augmentation(
615
+ words, group_size, words_augmentation_join_char
616
+ )
617
+ for group_size in words_augmentation_group_sizes
618
+ ]
619
+ augmentation = [word for augm in augmentation for word in augm]
620
+ flagged_words_ratio = len(
621
+ [word for word in words + augmentation if word in flagged_words]
622
+ ) / len(words)
623
+ if flagged_words_ratio > 1.0:
624
+ flagged_words_ratio = 1.0
625
+ return flagged_words_ratio
626
+
627
+ @staticmethod
628
+ def check_flagged_words(
629
+ document,
630
+ sentencepiece_model_tok,
631
+ strip_characters,
632
+ cond_words_augmentation,
633
+ words_augmentation_group_sizes,
634
+ words_augmentation_join_char,
635
+ flagged_words,
636
+ flagged_words_max_cutoff,
637
+ ):
638
+ cond = True
639
+ if flagged_words:
640
+ flagged_words_ratio = Filtering.compute_flagged_words_ratio(
641
+ document,
642
+ sentencepiece_model_tok,
643
+ strip_characters,
644
+ cond_words_augmentation,
645
+ words_augmentation_group_sizes,
646
+ words_augmentation_join_char,
647
+ flagged_words,
648
+ )
649
+ cond = flagged_words_ratio <= flagged_words_max_cutoff
650
+ return cond
651
+
652
+ @staticmethod
653
+ def compute_lang_id_pred_score(document, model_lang_id):
654
+ document = document.lower().replace("\n", " ")
655
+ pred = model_lang_id.predict(document)
656
+ lang_pred_fasttext_id = pred[0][0].replace("__label__", "")
657
+ score_pred = pred[1][0]
658
+ lang_pred_dataset_id = langs_id.loc[
659
+ langs_id["fasttext_id"] == lang_pred_fasttext_id, "dataset_id"
660
+ ]
661
+ if len(lang_pred_dataset_id) > 0:
662
+ lang_pred_dataset_id = lang_pred_dataset_id.iloc[0]
663
+ else:
664
+ lang_pred_dataset_id = "unknown"
665
+ return lang_pred_dataset_id, score_pred
666
+
667
+ @staticmethod
668
+ def check_lang_id(
669
+ document,
670
+ lang_dataset_id,
671
+ model_lang_id,
672
+ lang_id_min_cutoff,
673
+ ):
674
+ cond = True
675
+ if model_lang_id:
676
+ lang_pred_dataset_id, score_pred = Filtering.compute_lang_id_pred_score(
677
+ document, model_lang_id
678
+ )
679
+ cond = (lang_pred_dataset_id == lang_dataset_id) and (
680
+ score_pred >= lang_id_min_cutoff
681
+ )
682
+ return cond
683
+
684
+ @staticmethod
685
+ def compute_perplexity_score(document, sentencepiece_model, kenlm_model):
686
+ document = ModifyingDocuments.normalization(
687
+ document=document,
688
+ remove_non_printing_characters=True,
689
+ strip=True,
690
+ lower_case=False,
691
+ uniform_whitespace=True,
692
+ replace_digits_with_zeros=True,
693
+ replace_unicode_punctuation=True,
694
+ )
695
+ document = ModifyingDocuments.tokenization(
696
+ document, sentencepiece_model, join_on_whitespace=True
697
+ )
698
+ doc_log_score, doc_length = 0, 0
699
+ for line in document.split("\n"):
700
+ log_score = kenlm_model.score(line)
701
+ length = len(line.split()) + 1
702
+ doc_log_score += log_score
703
+ doc_length += length
704
+ pp_score = 10.0 ** (-doc_log_score / doc_length)
705
+ pp_score = round(pp_score, 1)
706
+ return pp_score
707
+
708
+ @staticmethod
709
+ def check_perplexity(
710
+ document,
711
+ sentencepiece_model,
712
+ kenlm_model,
713
+ perplexity_max_cutoff,
714
+ ):
715
+ cond = True
716
+ if kenlm_model:
717
+ score = Filtering.compute_perplexity_score(
718
+ document, sentencepiece_model, kenlm_model
719
+ )
720
+ cond = score <= perplexity_max_cutoff
721
+ return cond
722
+
723
+ @staticmethod
724
+ def filtering(
725
+ document,
726
+ cond_check_number_words,
727
+ sentencepiece_model_tok,
728
+ strip_characters,
729
+ number_words_min_cutoff,
730
+ number_words_max_cutoff,
731
+ cond_check_character_repetition_removal,
732
+ character_repetition_length,
733
+ character_repetition_max_cutoff,
734
+ cond_check_word_repetition_removal,
735
+ word_repetition_length,
736
+ word_repetition_max_cutoff,
737
+ cond_check_special_characters,
738
+ special_characters,
739
+ special_characters_max_cutoff,
740
+ cond_words_augmentation,
741
+ words_augmentation_group_sizes,
742
+ words_augmentation_join_char,
743
+ cond_check_stopwords,
744
+ stopwords,
745
+ stopwords_min_cutoff,
746
+ cond_check_flagged_words,
747
+ flagged_words,
748
+ flagged_words_max_cutoff,
749
+ cond_check_lang_id,
750
+ lang_dataset_id,
751
+ model_lang_id,
752
+ lang_id_min_cutoff,
753
+ cond_check_perplexity,
754
+ sentencepiece_model,
755
+ kenlm_model,
756
+ perplexity_max_cutoff,
757
+ ):
758
+ if cond_check_number_words:
759
+ if not Filtering.check_number_words(
760
+ document,
761
+ sentencepiece_model_tok,
762
+ strip_characters,
763
+ number_words_min_cutoff,
764
+ number_words_max_cutoff,
765
+ ):
766
+ return False
767
+ if cond_check_character_repetition_removal:
768
+ if not Filtering.check_character_repetition_removal(
769
+ document,
770
+ character_repetition_length,
771
+ character_repetition_max_cutoff,
772
+ ):
773
+ return False
774
+ if cond_check_word_repetition_removal:
775
+ if not Filtering.check_word_repetition_removal(
776
+ document,
777
+ sentencepiece_model_tok,
778
+ strip_characters,
779
+ word_repetition_length,
780
+ word_repetition_max_cutoff,
781
+ ):
782
+ return False
783
+ if cond_check_special_characters:
784
+ if not Filtering.check_special_characters(
785
+ document,
786
+ special_characters,
787
+ special_characters_max_cutoff,
788
+ ):
789
+ return False
790
+ if cond_check_stopwords:
791
+ if not Filtering.check_stopwords(
792
+ document,
793
+ sentencepiece_model_tok,
794
+ strip_characters,
795
+ cond_words_augmentation,
796
+ words_augmentation_group_sizes,
797
+ words_augmentation_join_char,
798
+ stopwords,
799
+ stopwords_min_cutoff,
800
+ ):
801
+ return False
802
+ if cond_check_flagged_words:
803
+ if not Filtering.check_flagged_words(
804
+ document,
805
+ sentencepiece_model_tok,
806
+ strip_characters,
807
+ cond_words_augmentation,
808
+ words_augmentation_group_sizes,
809
+ words_augmentation_join_char,
810
+ flagged_words,
811
+ flagged_words_max_cutoff,
812
+ ):
813
+ return False
814
+ if cond_check_lang_id:
815
+ if not Filtering.check_lang_id(
816
+ document,
817
+ lang_dataset_id,
818
+ model_lang_id,
819
+ lang_id_min_cutoff,
820
+ ):
821
+ return False
822
+ if cond_check_perplexity:
823
+ if not Filtering.check_perplexity(
824
+ document,
825
+ sentencepiece_model,
826
+ kenlm_model,
827
+ perplexity_max_cutoff,
828
+ ):
829
+ return False
830
+ return True
831
+
832
+
833
+ class FunctionDatasetFiltering:
834
+ def __init__(
835
+ self,
836
+ lang_dataset_id,
837
+ path_fasttext_model,
838
+ path_sentencepiece_model,
839
+ path_kenlm_model,
840
+ ):
841
+ self.lang_dataset_id = lang_dataset_id
842
+ self.path_fasttext_model = path_fasttext_model
843
+ self.path_sentencepiece_model = path_sentencepiece_model
844
+ self.path_kenlm_model = path_kenlm_model
845
+
846
+ self.param = LoadParameters.load_parameters(lang_dataset_id)
847
+ self.stopwords = LoadParameters.load_stopwords(lang_dataset_id)
848
+ self.flagged_words = LoadParameters.load_flagged_words(lang_dataset_id)
849
+ self.model_lang_id = LoadParameters.load_model_lang_id(
850
+ lang_dataset_id, path_fasttext_model
851
+ )
852
+ self.sentencepiece_model = LoadParameters.load_sentencepiece_model(
853
+ lang_dataset_id, path_sentencepiece_model
854
+ )
855
+ self.sentencepiece_model_tok = (
856
+ self.sentencepiece_model if self.param["tokenization"] else None
857
+ )
858
+ self.kenlm_model = LoadParameters.load_kenlm_model(
859
+ lang_dataset_id, path_kenlm_model
860
+ )
861
+
862
+ def __call__(self, example):
863
+ keep_example = Filtering.filtering(
864
+ document=example["text"],
865
+ cond_check_number_words=self.param["cond_check_number_words"],
866
+ sentencepiece_model_tok=self.sentencepiece_model_tok,
867
+ strip_characters=self.param["strip_characters"],
868
+ number_words_min_cutoff=self.param["number_words_min_cutoff"],
869
+ number_words_max_cutoff=self.param["number_words_max_cutoff"],
870
+ cond_check_character_repetition_removal=self.param[
871
+ "cond_check_character_repetition_removal"
872
+ ],
873
+ character_repetition_length=self.param["character_repetition_length"],
874
+ character_repetition_max_cutoff=self.param[
875
+ "character_repetition_max_cutoff"
876
+ ],
877
+ cond_check_word_repetition_removal=self.param[
878
+ "cond_check_word_repetition_removal"
879
+ ],
880
+ word_repetition_length=self.param["word_repetition_length"],
881
+ word_repetition_max_cutoff=self.param["word_repetition_max_cutoff"],
882
+ cond_check_special_characters=self.param["cond_check_special_characters"],
883
+ special_characters=self.param["special_characters"],
884
+ special_characters_max_cutoff=self.param["special_characters_max_cutoff"],
885
+ cond_words_augmentation=self.param["cond_words_augmentation"],
886
+ words_augmentation_group_sizes=self.param["words_augmentation_group_sizes"],
887
+ words_augmentation_join_char=self.param["words_augmentation_join_char"],
888
+ cond_check_stopwords=self.param["cond_check_stopwords"],
889
+ stopwords=self.stopwords,
890
+ stopwords_min_cutoff=self.param["stopwords_min_cutoff"],
891
+ cond_check_flagged_words=self.param["cond_check_flagged_words"],
892
+ flagged_words=self.flagged_words,
893
+ flagged_words_max_cutoff=self.param["flagged_words_max_cutoff"],
894
+ cond_check_lang_id=self.param["cond_check_lang_id"],
895
+ lang_dataset_id=self.lang_dataset_id,
896
+ model_lang_id=self.model_lang_id,
897
+ lang_id_min_cutoff=self.param["lang_id_min_cutoff"],
898
+ cond_check_perplexity=self.param["cond_check_perplexity"],
899
+ sentencepiece_model=self.sentencepiece_model,
900
+ kenlm_model=self.kenlm_model,
901
+ perplexity_max_cutoff=self.param["perplexity_max_cutoff"],
902
+ )
903
+ return keep_example
904
+
905
+ def __reduce__(self):
906
+ return (
907
+ self.__class__,
908
+ (
909
+ self.lang_dataset_id,
910
+ self.path_fasttext_model,
911
+ self.path_sentencepiece_model,
912
+ self.path_kenlm_model,
913
+ ),
914
+ )
915
+
916
+
917
+ class DatasetFiltering:
918
+ def __init__(
919
+ self,
920
+ dataset,
921
+ lang_dataset_id,
922
+ path_fasttext_model,
923
+ path_sentencepiece_model,
924
+ path_kenlm_model,
925
+ num_proc,
926
+ path_dir_save_dataset,
927
+ ):
928
+ self.ds = dataset
929
+ self.lang_dataset_id = lang_dataset_id
930
+ self.path_fasttext_model = path_fasttext_model
931
+ self.path_sentencepiece_model = path_sentencepiece_model
932
+ self.path_kenlm_model = path_kenlm_model
933
+ self.num_proc = num_proc
934
+ self.path_dir_save_dataset = path_dir_save_dataset
935
+
936
+ def modifying_documents(self):
937
+ func_dataset_modifying_documents = FunctionDatasetModifyingDocuments(
938
+ self.lang_dataset_id
939
+ )
940
+ self.ds = self.ds.map(func_dataset_modifying_documents, num_proc=self.num_proc)
941
+
942
+ def filtering(self):
943
+ func_dataset_filtering = FunctionDatasetFiltering(
944
+ self.lang_dataset_id,
945
+ self.path_fasttext_model,
946
+ self.path_sentencepiece_model,
947
+ self.path_kenlm_model,
948
+ )
949
+ self.ds = self.ds.filter(func_dataset_filtering, num_proc=self.num_proc)
950
+
951
+ def save_dataset(self):
952
+ pathlib.Path(self.path_dir_save_dataset).mkdir(parents=True, exist_ok=True)
953
+ path_dir_save_dataset = pathlib.PurePath(
954
+ self.path_dir_save_dataset, self.lang_dataset_id
955
+ )
956
+ pathlib.Path(path_dir_save_dataset).mkdir(parents=True, exist_ok=True)
957
+ self.ds.save_to_disk(path_dir_save_dataset)
flagged_words.py ADDED
@@ -0,0 +1,1055 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Merge
2
+ # https://github.com/zacanger/profane-words
3
+ # and
4
+ # https://github.com/thisandagain/washyourmouthoutwithsoap/blob/develop/data/build.json
5
+ # and
6
+ # https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words
7
+
8
+
9
+ english_flagged_words = [
10
+ "anal",
11
+ "bareback",
12
+ "bbw",
13
+ "bdsm",
14
+ "blowjob",
15
+ "blowjobs",
16
+ "brazzers",
17
+ "bukkake",
18
+ "camgirl",
19
+ "camwhore",
20
+ "cocksucking",
21
+ "cougar",
22
+ "creampie",
23
+ "cuckold",
24
+ "cum",
25
+ "cumming",
26
+ "cums",
27
+ "cumshot",
28
+ "cumshots",
29
+ "cumslut",
30
+ "cunnilingus",
31
+ "deepthroat",
32
+ "deepthroating",
33
+ "dildo",
34
+ "dildos",
35
+ "dogging",
36
+ "doggystyle",
37
+ "dominatrix",
38
+ "erotic",
39
+ "fellatio",
40
+ "femdom",
41
+ "fingering",
42
+ "fisting",
43
+ "footjob",
44
+ "gangbang",
45
+ "handjob",
46
+ "hentai",
47
+ "horney",
48
+ "horniest",
49
+ "horny",
50
+ "jism",
51
+ "jizz",
52
+ "masterbating",
53
+ "masturbate",
54
+ "masturbating",
55
+ "masturbation",
56
+ "milf",
57
+ "orgies",
58
+ "orgy",
59
+ "pegging",
60
+ "porn",
61
+ "pornhub",
62
+ "porno",
63
+ "pornos",
64
+ "pornstar",
65
+ "pornstars",
66
+ "redtube",
67
+ "rimming",
68
+ "slutty",
69
+ "squirting",
70
+ "strapon",
71
+ "threesome",
72
+ "vibrator",
73
+ "xhamster",
74
+ "xnxx",
75
+ "xvideos",
76
+ "xxx",
77
+ "youporn",
78
+ ]
79
+
80
+
81
+ flagged_words = {
82
+ "ar": english_flagged_words
83
+ + [
84
+ "إباحي",
85
+ "احتلام",
86
+ "است",
87
+ "استمناء",
88
+ "اغتصاب",
89
+ "أورغازم",
90
+ "إيروتيك",
91
+ "أير",
92
+ "بز",
93
+ "بزاز",
94
+ "بظر",
95
+ "بورن",
96
+ "بيضان",
97
+ "مص",
98
+ "ثدي",
99
+ "جماع",
100
+ "جنس",
101
+ "حلمة",
102
+ "خلاعة",
103
+ "خنثي",
104
+ "خول",
105
+ "دعارة",
106
+ "زب",
107
+ "سحاق",
108
+ "سحاقية",
109
+ "سكس",
110
+ "سيكس",
111
+ "شاذ",
112
+ "شبق",
113
+ "شرج",
114
+ "شرموطة",
115
+ "شهواني",
116
+ "شهوة",
117
+ "طيز",
118
+ "عادة السرية",
119
+ "عاهرة",
120
+ "عرص",
121
+ "فاسقة",
122
+ "فرج",
123
+ "قحبة",
124
+ "قضيب",
125
+ "كس",
126
+ "لحس",
127
+ "لعق",
128
+ "لواط",
129
+ "لوطي",
130
+ "مؤخرة",
131
+ "متناك",
132
+ "متناكة",
133
+ "مومس",
134
+ "مثير",
135
+ "مص",
136
+ "مضاجعة",
137
+ "مفلقسة",
138
+ "مني",
139
+ "مهتاج",
140
+ "نشوة",
141
+ "نكاح",
142
+ "نيك",
143
+ ],
144
+ "ca": english_flagged_words
145
+ + [
146
+ "cagarro",
147
+ "cagarros",
148
+ "cipote",
149
+ "cipotes",
150
+ "collons",
151
+ "colló",
152
+ "consolador",
153
+ "consoladors",
154
+ "cony",
155
+ "conys",
156
+ "corre's",
157
+ "corre't",
158
+ "corregut",
159
+ "cunillingus",
160
+ "córrer-se",
161
+ "escorreguda",
162
+ "escorregudes",
163
+ "escorregut",
164
+ "escrot",
165
+ "escrots",
166
+ "escórre's",
167
+ "escórre't",
168
+ "escórrer-se",
169
+ "mamada",
170
+ "mamadera",
171
+ "mamaderes",
172
+ "mamades",
173
+ "masturba",
174
+ "masturbacions",
175
+ "masturbació",
176
+ "masturbant",
177
+ "masturbar",
178
+ "masturbar-se",
179
+ "masturbat",
180
+ "masturbats",
181
+ "masturbes",
182
+ "orgasme",
183
+ "orgasmes",
184
+ "ou",
185
+ "ous",
186
+ "palla",
187
+ "palles",
188
+ "pornografia",
189
+ "semen",
190
+ "semens",
191
+ "verga",
192
+ "vergues",
193
+ "xxx",
194
+ ],
195
+ "en": english_flagged_words,
196
+ "es": english_flagged_words
197
+ + [
198
+ "chupar el coño",
199
+ "chupar la concha",
200
+ "chupar la polla",
201
+ "chupar la verga",
202
+ "comer el coño",
203
+ "comer la concha",
204
+ "comer la polla",
205
+ "comer la verga",
206
+ "coprofagía",
207
+ "correrse",
208
+ "cunillingus",
209
+ "fagging",
210
+ "felación",
211
+ "felching",
212
+ "follada",
213
+ "follador de culo",
214
+ "folladores",
215
+ "fudge packer",
216
+ "hacer una paja",
217
+ "hacerse una paja",
218
+ "hore",
219
+ "kock",
220
+ "macizorra",
221
+ "madre folladora",
222
+ "mamada",
223
+ "perro follador",
224
+ "pisser",
225
+ "pornografía",
226
+ "sado",
227
+ "sadomasoquismo",
228
+ "sadomasoquista",
229
+ "sexo anal",
230
+ "skank",
231
+ "smegma",
232
+ "x clasificado",
233
+ ],
234
+ "eu": english_flagged_words + [],
235
+ "fr": english_flagged_words
236
+ + [
237
+ "baiseurs",
238
+ "baiseur",
239
+ "baiseuse",
240
+ "baiseuses",
241
+ "branlette",
242
+ "branlettes",
243
+ "branleuse",
244
+ "branleuses",
245
+ "cunillingus",
246
+ "cunilingus",
247
+ "enculée",
248
+ "enculées",
249
+ "enculation",
250
+ "enculations",
251
+ "enculement",
252
+ "enculements",
253
+ "fellation",
254
+ "fellations",
255
+ "porno",
256
+ "pornos",
257
+ "pornographie",
258
+ "pornographique",
259
+ "pornographiques",
260
+ "salope",
261
+ "salopes",
262
+ "suceuse",
263
+ "suceuses",
264
+ "xxx",
265
+ ],
266
+ "ha": english_flagged_words
267
+ + [
268
+ "bf",
269
+ "bura",
270
+ "burar",
271
+ "cin gindi",
272
+ "duri",
273
+ "durin",
274
+ "gindi",
275
+ "gindin",
276
+ "guntsu",
277
+ "guntsun",
278
+ "karuwa",
279
+ "karuwai",
280
+ "karuwar",
281
+ "maniyyi",
282
+ "maɗigo",
283
+ "maɗugo",
284
+ "nonuwa",
285
+ "shan gindi",
286
+ "tsuliya",
287
+ "tsuliyariskanci",
288
+ "ɗuwaiwai",
289
+ "ɗuwaiwan",
290
+ "ɗuwawu",
291
+ "ɗuwawun",
292
+ ],
293
+ "hi": english_flagged_words
294
+ + [
295
+ "अंडकोश की थैली",
296
+ "एक्स रेटेड",
297
+ "ओगाज़्म",
298
+ "कामोद्दीपक चित्र",
299
+ "कालीन का चूरा",
300
+ "कून",
301
+ "कॉक",
302
+ "गेंद का थैला",
303
+ "चाकलेट का रंग",
304
+ "चूची",
305
+ "चूतड़",
306
+ "झटका बंद",
307
+ "ठगना पैकर",
308
+ "डिल्डो",
309
+ "नितंब",
310
+ "पिछाड़ी",
311
+ "पीड़न कामुक",
312
+ "पॉर्न",
313
+ "फटना",
314
+ "फूहड़",
315
+ "बट",
316
+ "बहुत मदहोश",
317
+ "बेल अंत",
318
+ "भगवान-शापित",
319
+ "भगशेफ",
320
+ "माँ कमीने",
321
+ "मुखमैथुन",
322
+ "मुर्गा चूसने वाला",
323
+ "रक्तरंजित",
324
+ "लेबिया",
325
+ "वहशी",
326
+ "वहशीता",
327
+ "वैंग",
328
+ "शिश्नमल",
329
+ "संभोग सुख",
330
+ "सह शॉट",
331
+ "सींग का बना हुआ",
332
+ "होर",
333
+ "घपा घप",
334
+ "चुदाई",
335
+ "चुदक्कड़",
336
+ ],
337
+ "id": english_flagged_words
338
+ + [
339
+ "bokep",
340
+ "coli",
341
+ "colmek",
342
+ "grepe",
343
+ "horni",
344
+ "janda",
345
+ "jembut",
346
+ "jilat memek",
347
+ "jilmek",
348
+ "kontol",
349
+ "masturbasi",
350
+ "memek",
351
+ "ngentot",
352
+ "ngewe",
353
+ "peju",
354
+ "pepek",
355
+ "pornografi",
356
+ "sange",
357
+ "sepong",
358
+ "tusbol",
359
+ ],
360
+ "kn": english_flagged_words
361
+ + [
362
+ "ಗರ್ಭಪಾತ",
363
+ "ಗುದ",
364
+ "ಗುದದ್ವಾರ",
365
+ "ಕತ್ತೆ",
366
+ "ಆಶ್-ಫಕರ್",
367
+ "ಅಸ್ಹೋಲ್",
368
+ "ಅಸೋಲೆಸ್",
369
+ "ಬಾಲ್ಬಾಗ್",
370
+ "ಚೆಂಡುಗಳು",
371
+ "ಬಾಸ್ಟರ್ಡ್",
372
+ "ಬೆಲೆಂಡ್",
373
+ "ಮೃದ್ವಂಗಿ",
374
+ "ಪ್ರಾಣಿಜನ್ಯತೆ",
375
+ "ಬಿಚ್",
376
+ "ಬಿಟ್ಚಿಸ್",
377
+ "ಬೆಚಿಂಗ್",
378
+ "ರಕ್ತಸಿಕ್ತ",
379
+ "ಬ್ಲೋಜಾಬ್",
380
+ "ಬೊಲ್ಲೊಕ್",
381
+ "ಕುರುಚಲು ಗಿಡ",
382
+ "ಬೂಬಿಗಳು",
383
+ "ಸ್ತನಗಳನ್ನು",
384
+ "ಬುಕೆಟಾ",
385
+ "ತಿಕ",
386
+ "ಬಟ್",
387
+ "ಕಾರ್ಪೆಟ್ ಮಂಚರ್",
388
+ "ಚಿಂಕ್",
389
+ "ಸಿಪಾ",
390
+ "ಚಂದ್ರನಾಡಿ",
391
+ "ಕೋಳಿ",
392
+ "ಕೋಳಿ ಸಕ್ಕರ್",
393
+ "ಕಾಕ್ಸ್",
394
+ "ಕೂನ್",
395
+ "ಅಮೇಧ್ಯ",
396
+ "ಕಮ್",
397
+ "ಕಮ್ಶಾಟ್",
398
+ "ಕುನಿಲ್ಲಸ್",
399
+ "ಕಂಟ್",
400
+ "ಡ್ಯಾಮ್",
401
+ "ಡಿಕ್",
402
+ "ದ್ವಿಧ್ರುವಿ",
403
+ "dildos",
404
+ "ಡಿಂಕ್",
405
+ "ನಾಯಿ-ಫಕರ್",
406
+ "ಡಚೆ",
407
+ "ಡೈಕ್",
408
+ "ಹೊರಹೊಮ್ಮಿಸು",
409
+ "ಸ್ಫೂರ್ತಿ",
410
+ "ಎಜಾಕ್ಯುಲೇಟ್ಸ್",
411
+ "ಇಜಲಲೇಟಿಂಗ್",
412
+ "ಉದ್ಗಾರ",
413
+ "ತಮಾಷೆ",
414
+ "ಮಂದಗತಿ",
415
+ "ಮಬ್ಬು",
416
+ "fagots",
417
+ "ಫ್ಯಾನಿ",
418
+ "ಹೊಡೆತ",
419
+ "ಪತನ",
420
+ "ಚಾಚುಪಟ್ಟಿ",
421
+ "ಫಕ್",
422
+ "ನಾಶವಾಗಿದ್ದನು",
423
+ "ಫಕರ್",
424
+ "fuckers",
425
+ "ಫಕಿಂಗ್",
426
+ "ಫಕಿಂಗ್ಸ್",
427
+ "ಇಷ್ಟಪಡುತ್ತಾನೆ",
428
+ "ಮಿಠಾಯಿ ಪ್ಯಾಕರ್",
429
+ "ದೇವರನ್ನು ಹಾನಿಗೊಳಗಾಯಿತು",
430
+ "ಗಾಡ್ಡಮ್",
431
+ "ನರಕ",
432
+ "ಹೋರ್",
433
+ "ಮೊನಚಾದ",
434
+ "ಜರ್ಕ್-ಆಫ್",
435
+ "ಕೋಕ್",
436
+ "ಯೋನಿಯ",
437
+ "ಕಾಮ",
438
+ "ಕಾಮುಕ",
439
+ "ಮಾಸೋಚಿಸ್ಟ್",
440
+ "ಹಸ್ತಮೈಥುನ ಮಾಡು",
441
+ "ತಾಯಿ ಫಕರ್",
442
+ "ನಾಜಿ",
443
+ "ನಿಗರ್",
444
+ "ನಿಗ್ಗರ್ಗಳು",
445
+ "ಒರಾಸಿಮ್",
446
+ "ಪರಾಕಾಷ್ಠೆ",
447
+ "ಪರಾಕಾಷ್ಠೆಗಳನ್ನು",
448
+ "ಪೆಕರ್",
449
+ "ಶಿಶ್ನ",
450
+ "ಮೂತ್ರ ವಿಸರ್ಜಿಸು",
451
+ "ನಿರುತ್ಸಾಹಗೊಂಡಿದೆ",
452
+ "ಪಿಸರ್",
453
+ "ಮೂತ್ರಪಿಂಡಗಳು",
454
+ "pissing",
455
+ "ಪಿಸ್ಸಾಫ್",
456
+ "ಪೂಪ್",
457
+ "ಅಶ್ಲೀಲತೆ",
458
+ "ಅಶ್ಲೀಲ",
459
+ "ಚುಚ್ಚು",
460
+ "ಪ್ರಿಕ್ಸ್",
461
+ "ಪಬ್",
462
+ "ಪುಸಿಗಳು",
463
+ "ಪುಸಿ",
464
+ "ಅತ್ಯಾಚಾರ",
465
+ "ಅತ್ಯಾಚಾರಿ",
466
+ "ಗುದನಾಳದ",
467
+ "ರಿಟಾರ್ಡ್",
468
+ "ಹಚ್ಚುವುದು",
469
+ "ದುಃಖಗಾರ",
470
+ "ತಿರುಗಿಸುವುದು",
471
+ "ಸ್ಕ್ರೋಟಮ್",
472
+ "ವೀರ್ಯ",
473
+ "ಲೈಂಗಿಕತೆ",
474
+ "ಶಾಗ್",
475
+ "ಶಾಗ್ಗಿಂಗ್",
476
+ "ಶೆಮೇಲ್",
477
+ "ಶಿಟ್",
478
+ "ಷೈಟ್",
479
+ "ಶಿಟ್ಸ್",
480
+ "shitted",
481
+ "ಅಲುಗಾಡುವಿಕೆ",
482
+ "ಅಸಹ್ಯ",
483
+ "ಸ್ಕಾಂಕ್",
484
+ "ಸೂಳೆ",
485
+ "ಸ್ಲಟ್ಗಳು",
486
+ "ಸ್ಮೆಗ್ಮಾ",
487
+ "ಕೊಳೆತ",
488
+ "ಸ್ನ್ಯಾಚ್",
489
+ "ಮಗ-ಆಫ್-ಬಿಚ್",
490
+ "spac",
491
+ "ಉಬ್ಬು",
492
+ "ವೃಷಣ",
493
+ "ಟಿಟ್",
494
+ "ಚೇಕಡಿ ಹಕ್ಕಿಗಳು",
495
+ "turd",
496
+ "ಯೋನಿ",
497
+ "ವಯಾಗ್ರ",
498
+ "ವಾಂಗ್",
499
+ "ಮುಷ್ಕರ",
500
+ "x ರೇಟೆಡ್",
501
+ "xxx",
502
+ ],
503
+ "ml": english_flagged_words
504
+ + [
505
+ "ഗർഭഛിദ്രം",
506
+ "വിശപ്പ്",
507
+ "മലദ്വാരം",
508
+ "കഴുത",
509
+ "അസി ഫക്കർ",
510
+ "കഴുതകളെ",
511
+ "ആസ്ഹോൾ",
512
+ "അശ്ളീലങ്ങൾ",
513
+ "ബോൾബാഗ്",
514
+ "പന്തുകൾ",
515
+ "തന്തയില്ലാത്തവൻ",
516
+ "ബെല്ലെൻഡ്",
517
+ "മൃഗീയമായ",
518
+ "മൃഗീയത",
519
+ "ബിച്ച്",
520
+ "ബിച്ചുകൾ",
521
+ "ബിപിഡിംഗ്",
522
+ "രക്തരൂക്ഷിതമായ",
523
+ "ആശ്വാസം",
524
+ "ബലോക്ക്",
525
+ "ബോബ്",
526
+ "പൂക്കൾ",
527
+ "സ്തനങ്ങൾ",
528
+ "ബ്യൂട്ടാ",
529
+ "ബം",
530
+ "മയക്കുമരുന്ന്",
531
+ "പരവതാനി മാൻച്ചർ",
532
+ "ചുംബ്",
533
+ "സിപാ",
534
+ "ക്ലോറിസിസ്",
535
+ "കോക്ക്",
536
+ "കോക്ക് സക്കർ",
537
+ "കോക്സ്",
538
+ "കോൺ",
539
+ "ക്രാപ്പ്",
540
+ "ശുക്ലം",
541
+ "പുരുഷാരം",
542
+ "സി",
543
+ "മുഷിഞ്ഞ",
544
+ "കഷ്ടം",
545
+ "ഡിക്ക്",
546
+ "ഡിൽഡോ",
547
+ "dildos",
548
+ "ഡൈൻ",
549
+ "നായ-ഫക്കർ",
550
+ "ഡച്ച്",
551
+ "ഡൈകെ",
552
+ "ശമിപ്പിക്കുക",
553
+ "മോഷ്ടിച്ചു",
554
+ "വികാരങ്ങൾ",
555
+ "വിരസത",
556
+ "മടി",
557
+ "ക്ഷീണിപ്പിക്കുക",
558
+ "fagot",
559
+ "വഞ്ചന",
560
+ "ഫാനി",
561
+ "വേദന",
562
+ "flange",
563
+ "ഊമ്പി",
564
+ "സംഭോഗം ചെയ്യുക",
565
+ "ഫക്കർ",
566
+ "നർമ്മം",
567
+ "ഫഡ്ജ് പാക്കർ",
568
+ "ദൈവം-കൊള്ളിത",
569
+ "ഗോഡ്ഡം",
570
+ "നരകം",
571
+ "വയ്ക്കുക",
572
+ "വൃത്തികെട്ട",
573
+ "ജെർക് ഓഫ്",
574
+ "കിക്ക്",
575
+ "ലാബിയ",
576
+ "മോഹം",
577
+ "മോഹഭംഗം",
578
+ "മാസോച്ചിസ്റ്റ്",
579
+ "സ്വയംഭോഗം ചെയ്യുക",
580
+ "അമ്മ ഫക്കർ",
581
+ "നാസി",
582
+ "നിഗർ",
583
+ "മ��ക്കുമരുന്നുകൾ",
584
+ "രതിമൂർച്ഛ",
585
+ "പെക്കർ",
586
+ "ലിംഗം",
587
+ "മൂത്രമൊഴിക്കുക",
588
+ "കുഴഞ്ഞുവീഴുന്നു",
589
+ "പിസ്സർ",
590
+ "പിസ്സകൾ",
591
+ "pissing",
592
+ "പിസ്സോഫ്",
593
+ "poop",
594
+ "അശ്ലീലം",
595
+ "അശ്ലീലത",
596
+ "പ്രാവി",
597
+ "വിസർജ്യങ്ങൾ",
598
+ "പ്യൂബ്",
599
+ "pussies",
600
+ "pussy",
601
+ "ബലാൽസംഗം",
602
+ "ബലാത്സംഗം",
603
+ "മലാശയം",
604
+ "തുടരുക",
605
+ "റിമ്മിംഗ്",
606
+ "സചിസ്റ്റ്",
607
+ "വഞ്ചി",
608
+ "പുല്ല്",
609
+ "ബീജം",
610
+ "ശവം",
611
+ "ഷാഗിംഗ്",
612
+ "അവൾ",
613
+ "ഷീറ്റ്",
614
+ "ഷെയ്റ്റ്",
615
+ "shits",
616
+ "തിന്നിട്ടില്ല",
617
+ "ഷോർട്ട്",
618
+ "ഷൈറ്റി",
619
+ "സ്കാൻ",
620
+ "മന്ദഹസരം",
621
+ "സ്നെഗമാ",
622
+ "പുഞ്ചിരി",
623
+ "പിടിക്കുക",
624
+ "വെറുക്കപ്പെട്ടയാൾ",
625
+ "സ്പെയ്ക്",
626
+ "തുളച്ച്",
627
+ "വൃഷണം",
628
+ "പേ",
629
+ "ടിത്ത്",
630
+ "കുഴപ്പമില്ല",
631
+ "യോനി",
632
+ "വരാഗ്ര",
633
+ "വാൽവ",
634
+ "വാങ്",
635
+ "വാൻ",
636
+ "വേശ്യ",
637
+ "x റേറ്റുചെയ്തു",
638
+ "xxx",
639
+ ],
640
+ "mr": english_flagged_words
641
+ + [
642
+ "गर्भपात",
643
+ "गुदा",
644
+ "गाढव",
645
+ "गांडुळ",
646
+ "asses",
647
+ "asshole",
648
+ "assholes",
649
+ "ballbag",
650
+ "चेंडू",
651
+ "बॅस्टर्ड",
652
+ "बेलेंड",
653
+ "बेस्टियल",
654
+ "प्राण्यांबरोबर",
655
+ "कुत्री",
656
+ "बिट्स",
657
+ "खूनी",
658
+ "blowjob",
659
+ "बोलोक",
660
+ "बोब",
661
+ "स्तन",
662
+ "बसीटा",
663
+ "बम",
664
+ "बट",
665
+ "कार्पेट मुन्चर",
666
+ "चिंक",
667
+ "सिपा",
668
+ "क्लिटोरिस",
669
+ "मुर्ख",
670
+ "मांसाहारी",
671
+ "कॉक्स",
672
+ "कॉनन",
673
+ "बकवास",
674
+ "सह",
675
+ "cumshot",
676
+ "कनिलिंगस",
677
+ "कांट",
678
+ "धिक्कार",
679
+ "डिक",
680
+ "dildo",
681
+ "डिल्डो",
682
+ "डंक",
683
+ "duche",
684
+ "डाईक",
685
+ "उद्गार",
686
+ "उत्साही",
687
+ "ejaculates",
688
+ "उत्सुकता",
689
+ "स्खलन",
690
+ "फॅग",
691
+ "फॅगिंग",
692
+ "फॅगॉट",
693
+ "फॅगॉट्स",
694
+ "फॅनी",
695
+ "फेलिंग",
696
+ "फॅलेटीओ",
697
+ "निकला",
698
+ "fucked",
699
+ "गुप्तचर",
700
+ "fuckers",
701
+ "fucking",
702
+ "fuckings",
703
+ "fucks",
704
+ "फडगे पॅकर",
705
+ "देव-शापित",
706
+ "देव",
707
+ "नरक",
708
+ "होरे",
709
+ "शिंग",
710
+ "झटका बंद",
711
+ "कॉक",
712
+ "लॅबिया",
713
+ "वासना",
714
+ "मासोचिस्ट",
715
+ "हस्तमैथुन करा",
716
+ "आई माकड",
717
+ "नाझी",
718
+ "निगर",
719
+ "निगार",
720
+ "ऑर्गॅसिम",
721
+ "संभोग",
722
+ "orgasms",
723
+ "चापटी",
724
+ "पुरुषाचे जननेंद्रिय",
725
+ "पेशी",
726
+ "pissed",
727
+ "पिसर",
728
+ "pisses",
729
+ "पिसिंग",
730
+ "पिसोफ",
731
+ "घाट",
732
+ "अश्लील",
733
+ "पोर्नोग्राफी",
734
+ "मुरुम",
735
+ "प्रिक्स",
736
+ "प्यूब",
737
+ "pussies",
738
+ "मांजर",
739
+ "बलात्कार",
740
+ "गुदाशय",
741
+ "मंद",
742
+ "rimming",
743
+ "दुःखी",
744
+ "screwing",
745
+ "स्क्रोटम",
746
+ "वीर्य",
747
+ "लिंग",
748
+ "शेग",
749
+ "shagging",
750
+ "शेमले",
751
+ "विचित्र",
752
+ "shite",
753
+ "shits",
754
+ "shitted",
755
+ "shitting",
756
+ "shitty",
757
+ "घाणेरडा",
758
+ "फट",
759
+ "sluts",
760
+ "सुगंध",
761
+ "स्मट",
762
+ "छेडछाड",
763
+ "मुलगा-एक-कुत्री",
764
+ "spac",
765
+ "तिरस्कार",
766
+ "परीक्षक",
767
+ "शीर्षक",
768
+ "टिट",
769
+ "टर्ड",
770
+ "योनी",
771
+ "वियाग्रा",
772
+ "वल्वा",
773
+ "वांग",
774
+ "विंक",
775
+ "वेश्या",
776
+ "एक्स रेट केले",
777
+ "xxx",
778
+ ],
779
+ "pt": english_flagged_words
780
+ + [
781
+ "balalao",
782
+ "bate uma",
783
+ "beijo grego",
784
+ "boceta",
785
+ "boquete",
786
+ "buceta",
787
+ "caralho",
788
+ "chochota",
789
+ "coito",
790
+ "cona",
791
+ "consolo",
792
+ "corno",
793
+ "cu",
794
+ "dar a bunda",
795
+ "dar o rabo",
796
+ "dildo",
797
+ "dildos",
798
+ "esporrar",
799
+ "estrovenga",
800
+ "felação",
801
+ "filho da puta",
802
+ "filhos da puta",
803
+ "gozada",
804
+ "jeba",
805
+ "perereca",
806
+ "pica",
807
+ "piru",
808
+ "porno",
809
+ "pornografia",
810
+ "pornô",
811
+ "porra",
812
+ "prostituta",
813
+ "pube",
814
+ "punheta",
815
+ "punheteiro",
816
+ "putaria",
817
+ "queca",
818
+ "sexo",
819
+ "siririca",
820
+ "tesão",
821
+ "trepada",
822
+ "verga",
823
+ "vibrador",
824
+ "xana",
825
+ "xochota",
826
+ "xoxota",
827
+ ],
828
+ "ta": english_flagged_words
829
+ + [
830
+ "ஓதா",
831
+ "ஒத்தா",
832
+ "புண்டை",
833
+ "ஒம்மாளே",
834
+ "பக்கி",
835
+ "கூமுட்டை",
836
+ "கருமம்",
837
+ "சனியன்",
838
+ "கஸ்மாலம்",
839
+ "சூத்து",
840
+ ],
841
+ "te": english_flagged_words
842
+ + [
843
+ "గర్భస్రావం",
844
+ "అంగ",
845
+ "పాయువు",
846
+ "గాడిద",
847
+ "గాడిద-fucker",
848
+ "asses",
849
+ "assholes",
850
+ "బాల్బ్యాగ్",
851
+ "బంతుల్లో",
852
+ "బాస్టర్డ్",
853
+ "బెల్లెండ్",
854
+ "మృగ",
855
+ "బెస్టియాలిటీ",
856
+ "బిచ్",
857
+ "bitches",
858
+ "బిట్చింగ్",
859
+ "బ్లడీ",
860
+ "blowjob",
861
+ "బోల్లక",
862
+ "బూబ్",
863
+ "వక్షోజాలను",
864
+ "ఛాతీ",
865
+ "buceta",
866
+ "బం",
867
+ "బట్",
868
+ "కార్పెట్ ముంచర్",
869
+ "చింక్",
870
+ "cipa",
871
+ "స్త్రీగుహ్యాంకురము",
872
+ "ఆత్మవిశ్వాసం",
873
+ "కాక్-సక్కర్",
874
+ "కాక్స్",
875
+ "కూన్",
876
+ "చెత్త",
877
+ "కం",
878
+ "cumshot",
879
+ "క్యునిల్లింగస్",
880
+ "కంట్",
881
+ "తిట్టు",
882
+ "డిక్",
883
+ "లైంగిక సంతృప్తి కోసం స్త్రీలు ఉపయోగించే పురుషాంగము వంటి పరికరము",
884
+ "డిల్డోస్",
885
+ "dink",
886
+ "కుక్క-fucker",
887
+ "డూష్",
888
+ "డైక్",
889
+ "స్ఖలించు",
890
+ "ఎజాక్యులేటెడ్",
891
+ "ఎజాక్యులేట్స్",
892
+ "ఎరాక్యులేటింగ్",
893
+ "స్ఖలనం",
894
+ "నవుకరు",
895
+ "ఫాగ్గింగ్",
896
+ "ఫాగాట్",
897
+ "ఫగాట్స్",
898
+ "fanny",
899
+ "ఫెల్చింగ్",
900
+ "కుడుచుట",
901
+ "అచ్చు",
902
+ "ఫక్",
903
+ "ఇబ్బంది పెట్టాడు",
904
+ "fucker",
905
+ "ఫకర్స్",
906
+ "ఫకింగ్",
907
+ "ఫకింగ్స్",
908
+ "ఫక్స్",
909
+ "ఫడ్జ్ ప్యాకర్",
910
+ "దేవతలా మంచిది",
911
+ "గాడ్డామ్",
912
+ "నరకం",
913
+ "హోర్",
914
+ "horny",
915
+ "జెర్క్-ఆఫ్",
916
+ "కాక్",
917
+ "పెదవి",
918
+ "కామం",
919
+ "మనసు పడ్డట్లు చిత్రించారు",
920
+ "masochist",
921
+ "హస్తప్రయోగం",
922
+ "తల్లి ఫెకర్",
923
+ "నాజీ",
924
+ "నిగ్గర్",
925
+ "నిగ్గర్స్",
926
+ "ఆర్గాసిమ్",
927
+ "స్కలనం",
928
+ "orgasms",
929
+ "pecker",
930
+ "పురుషాంగం",
931
+ "విసర్జన",
932
+ "pissed",
933
+ "పిస్సర్",
934
+ "పిస్సీస్",
935
+ "పిస్సింగ్",
936
+ "పిస్సాఫ్",
937
+ "poop",
938
+ "శృంగార",
939
+ "పోర్నో",
940
+ "అశ్లీల",
941
+ "బుడతడు",
942
+ "ప్రిక్స్",
943
+ "ప్యూబ్",
944
+ "pussies",
945
+ "పుస్సీ",
946
+ "రేప్",
947
+ "ఉన్నప్పటికీ బలాత్కారం",
948
+ "పురీషనాళం",
949
+ "రిటార్డ్",
950
+ "రిమ్మింగ్",
951
+ "పీడన కాముకత",
952
+ "screwing",
953
+ "స్క్రోటమ్",
954
+ "వీర్యం",
955
+ "సెక్స్",
956
+ "బొచ్చు",
957
+ "షగ్గింగ్",
958
+ "షీమేల్",
959
+ "ఒంటి",
960
+ "షైట్",
961
+ "షిట్స్",
962
+ "షిట్టెడ్",
963
+ "షిట్టింగ్",
964
+ "shitty",
965
+ "స్కాన్క్",
966
+ "నీతి",
967
+ "స్లట్స్",
968
+ "శిశ్న",
969
+ "స్మట్",
970
+ "స్నాచ్",
971
+ "ఒక బిచ్ కుమారుడు ఆఫ్",
972
+ "spac",
973
+ "స్పంక్",
974
+ "వృషణాలు",
975
+ "తునక",
976
+ "టిట్స్",
977
+ "టిట్",
978
+ "turd",
979
+ "యోని",
980
+ "వయాగ్రా",
981
+ "జననాంగం",
982
+ "వాంగ్",
983
+ "వ్యాంక్",
984
+ "వేశ్య",
985
+ "x రేట్",
986
+ "xxx",
987
+ ],
988
+ "vi": english_flagged_words
989
+ + [
990
+ "cặc",
991
+ "lồn",
992
+ "địt",
993
+ "buồi",
994
+ "đụ",
995
+ "nứng",
996
+ "phò",
997
+ "bú lol",
998
+ "bú lon",
999
+ "bú cac",
1000
+ "bú cak",
1001
+ "bú cax",
1002
+ "bú buoi",
1003
+ "bú mút",
1004
+ "húp lol",
1005
+ "húp lon",
1006
+ "húp sò",
1007
+ "nung lol",
1008
+ "nung lon",
1009
+ "nug lon",
1010
+ "nung cac",
1011
+ "nung buoi",
1012
+ "thèm dit",
1013
+ "thèm lon",
1014
+ "thèm lol",
1015
+ "thèm buoi",
1016
+ "thèm dit",
1017
+ "thèm chich",
1018
+ "thèm chịch",
1019
+ "thèm cax",
1020
+ "thèm cak",
1021
+ "móc lol",
1022
+ "móc lon",
1023
+ "buscu",
1024
+ "lol non",
1025
+ "bướm non",
1026
+ "vú to",
1027
+ "vú bự",
1028
+ "lon to",
1029
+ "phim sex",
1030
+ "phim xex",
1031
+ "phim xxx",
1032
+ ],
1033
+ "zh": english_flagged_words
1034
+ + [
1035
+ "三级片",
1036
+ "乳交",
1037
+ "卖B",
1038
+ "卖比",
1039
+ "卖淫",
1040
+ "口交",
1041
+ "姣西",
1042
+ "打炮",
1043
+ "打飞机",
1044
+ "抽插",
1045
+ "招妓",
1046
+ "插你",
1047
+ "插死你",
1048
+ "淫水",
1049
+ "肉壁",
1050
+ "肉棍子",
1051
+ "肉缝",
1052
+ "肏",
1053
+ "肛交",
1054
+ ],
1055
+ }
fr.arpa.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:301c82d52a8e34f63937afc12970794c8783244c8c0b085a8bbfb0d54dcb9374
3
+ size 2829042764
fr.sp.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b1b70d5e6556ad245e02ac76919a714ad0b7d288955df65ecd3831a42950b653
3
+ size 942639
fr_examples_with_stats.json ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8dd605b140e7a4c20a00e06c8c70d90333d2559434acd9c182de054d6b53b13b
3
+ size 140859096
id.arpa.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6e099b6216a558d6c6f6108895e2e13fbc6ffd00b59791d16d6a5f85103ac0be
3
+ size 1847280248
id.sp.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b217615a7b185e5e0c967ea5b7156fe149145221e32a54b96dfed15d98b3c807
3
+ size 926624
id_examples_with_stats.json ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d1c05dfc6f847bccf2e79cdb90c0dbb05a7266ae77673cd9f6c3cb811dace8e8
3
+ size 89435039
languages_id.py ADDED
@@ -0,0 +1,222 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+
3
+
4
+ langs_id = [
5
+ {
6
+ "lang": "Afrikaans",
7
+ "dataset_id": "af",
8
+ "stopwords_id": "af",
9
+ "flagged_words_id": None,
10
+ "fasttext_id": "af",
11
+ "sentencepiece_id": "af",
12
+ "kenlm_id": "af",
13
+ },
14
+ {
15
+ "lang": "Arabic",
16
+ "dataset_id": "ar",
17
+ "stopwords_id": "ar",
18
+ "flagged_words_id": "ar",
19
+ "fasttext_id": "ar",
20
+ "sentencepiece_id": "ar",
21
+ "kenlm_id": "ar",
22
+ },
23
+ {
24
+ "lang": "Egyptian Arabic",
25
+ "dataset_id": "arz",
26
+ "stopwords_id": None,
27
+ "flagged_words_id": None,
28
+ "fasttext_id": "arz",
29
+ "sentencepiece_id": "arz",
30
+ "kenlm_id": "arz",
31
+ },
32
+ {
33
+ "lang": "Assamese",
34
+ "dataset_id": "as",
35
+ "stopwords_id": None,
36
+ "flagged_words_id": None,
37
+ "fasttext_id": "as",
38
+ "sentencepiece_id": "as",
39
+ "kenlm_id": "as",
40
+ },
41
+ {
42
+ "lang": "Bengali",
43
+ "dataset_id": "bn",
44
+ "stopwords_id": "bn",
45
+ "flagged_words_id": None,
46
+ "fasttext_id": "bn",
47
+ "sentencepiece_id": "bn",
48
+ "kenlm_id": "bn",
49
+ },
50
+ {
51
+ "lang": "Catalan",
52
+ "dataset_id": "ca",
53
+ "stopwords_id": "ca",
54
+ "flagged_words_id": "ca",
55
+ "fasttext_id": "ca",
56
+ "sentencepiece_id": "ca",
57
+ "kenlm_id": "ca",
58
+ },
59
+ {
60
+ "lang": "English",
61
+ "dataset_id": "en",
62
+ "stopwords_id": "en",
63
+ "flagged_words_id": "en",
64
+ "fasttext_id": "en",
65
+ "sentencepiece_id": "en",
66
+ "kenlm_id": "en",
67
+ },
68
+ {
69
+ "lang": "Spanish",
70
+ "dataset_id": "es",
71
+ "stopwords_id": "es",
72
+ "flagged_words_id": "es",
73
+ "fasttext_id": "es",
74
+ "sentencepiece_id": "es",
75
+ "kenlm_id": "es",
76
+ },
77
+ {
78
+ "lang": "Basque",
79
+ "dataset_id": "eu",
80
+ "stopwords_id": "eu",
81
+ "flagged_words_id": "eu",
82
+ "fasttext_id": "eu",
83
+ "sentencepiece_id": "eu",
84
+ "kenlm_id": "eu",
85
+ },
86
+ {
87
+ "lang": "French",
88
+ "dataset_id": "fr",
89
+ "stopwords_id": "fr",
90
+ "flagged_words_id": "fr",
91
+ "fasttext_id": "fr",
92
+ "sentencepiece_id": "fr",
93
+ "kenlm_id": "fr",
94
+ },
95
+ {
96
+ "lang": "Gujarati",
97
+ "dataset_id": "gu",
98
+ "stopwords_id": None,
99
+ "flagged_words_id": None,
100
+ "fasttext_id": "gu",
101
+ "sentencepiece_id": "gu",
102
+ "kenlm_id": "gu",
103
+ },
104
+ {
105
+ "lang": "Hindi",
106
+ "dataset_id": "hi",
107
+ "stopwords_id": "hi",
108
+ "flagged_words_id": "hi",
109
+ "fasttext_id": "hi",
110
+ "sentencepiece_id": "hi",
111
+ "kenlm_id": "hi",
112
+ },
113
+ {
114
+ "lang": "Indonesian",
115
+ "dataset_id": "id",
116
+ "stopwords_id": "id",
117
+ "flagged_words_id": "id",
118
+ "fasttext_id": "id",
119
+ "sentencepiece_id": "id",
120
+ "kenlm_id": "id",
121
+ },
122
+ {
123
+ "lang": "Kannada",
124
+ "dataset_id": "kn",
125
+ "stopwords_id": None,
126
+ "flagged_words_id": "kn",
127
+ "fasttext_id": "kn",
128
+ "sentencepiece_id": "kn",
129
+ "kenlm_id": "kn",
130
+ },
131
+ {
132
+ "lang": "Malayalam",
133
+ "dataset_id": "ml",
134
+ "stopwords_id": None,
135
+ "flagged_words_id": "ml",
136
+ "fasttext_id": "ml",
137
+ "sentencepiece_id": "ml",
138
+ "kenlm_id": "ml",
139
+ },
140
+ {
141
+ "lang": "Marathi",
142
+ "dataset_id": "mr",
143
+ "stopwords_id": "mr",
144
+ "flagged_words_id": "mr",
145
+ "fasttext_id": "mr",
146
+ "sentencepiece_id": "mr",
147
+ "kenlm_id": "mr",
148
+ },
149
+ {
150
+ "lang": "Portuguese",
151
+ "dataset_id": "pt",
152
+ "stopwords_id": "pt",
153
+ "flagged_words_id": "pt",
154
+ "fasttext_id": "pt",
155
+ "sentencepiece_id": "pt",
156
+ "kenlm_id": "pt",
157
+ },
158
+ {
159
+ "lang": "Swahili",
160
+ "dataset_id": "sw",
161
+ "stopwords_id": "sw",
162
+ "flagged_words_id": None,
163
+ "fasttext_id": "sw",
164
+ "sentencepiece_id": "sw",
165
+ "kenlm_id": "sw",
166
+ },
167
+ {
168
+ "lang": "Tamil",
169
+ "dataset_id": "ta",
170
+ "stopwords_id": None,
171
+ "flagged_words_id": "ta",
172
+ "fasttext_id": "ta",
173
+ "sentencepiece_id": "ta",
174
+ "kenlm_id": "ta",
175
+ },
176
+ {
177
+ "lang": "Telugu",
178
+ "dataset_id": "te",
179
+ "stopwords_id": None,
180
+ "flagged_words_id": "te",
181
+ "fasttext_id": "te",
182
+ "sentencepiece_id": "te",
183
+ "kenlm_id": "te",
184
+ },
185
+ {
186
+ "lang": "Urdu",
187
+ "dataset_id": "ur",
188
+ "stopwords_id": "ur",
189
+ "flagged_words_id": None,
190
+ "fasttext_id": "ur",
191
+ "sentencepiece_id": "ur",
192
+ "kenlm_id": "ur",
193
+ },
194
+ {
195
+ "lang": "Vietnamese",
196
+ "dataset_id": "vi",
197
+ "stopwords_id": "vi",
198
+ "flagged_words_id": "vi",
199
+ "fasttext_id": "vi",
200
+ "sentencepiece_id": "vi",
201
+ "kenlm_id": "vi",
202
+ },
203
+ {
204
+ "lang": "Yoruba",
205
+ "dataset_id": "yo",
206
+ "stopwords_id": "yo",
207
+ "flagged_words_id": None,
208
+ "fasttext_id": "yo",
209
+ "sentencepiece_id": "yo",
210
+ "kenlm_id": "yo",
211
+ },
212
+ {
213
+ "lang": "Chinese",
214
+ "dataset_id": "zh",
215
+ "stopwords_id": "zh",
216
+ "flagged_words_id": "zh",
217
+ "fasttext_id": "zh",
218
+ "sentencepiece_id": "zh",
219
+ "kenlm_id": "zh",
220
+ },
221
+ ]
222
+ langs_id = pd.DataFrame(langs_id)
lid.176.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7e69ec5451bc261cc7844e49e4792a85d7f09c06789ec800fc4a44aec362764e
3
+ size 131266198
normalization.py ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ from typing import Dict
3
+
4
+
5
+ non_printing_characters_re = re.compile(
6
+ f"[{''.join(map(chr, list(range(0,32)) + list(range(127,160))))}]"
7
+ )
8
+
9
+ digits_re: re.Pattern = re.compile(r"\d")
10
+
11
+ unicode_punctuation: Dict[str, str] = {
12
+ ",": ",",
13
+ "。": ".",
14
+ "、": ",",
15
+ "„": '"',
16
+ "”": '"',
17
+ "“": '"',
18
+ "«": '"',
19
+ "»": '"',
20
+ "1": '"',
21
+ "」": '"',
22
+ "「": '"',
23
+ "《": '"',
24
+ "》": '"',
25
+ "´": "'",
26
+ "∶": ":",
27
+ ":": ":",
28
+ "?": "?",
29
+ "!": "!",
30
+ "(": "(",
31
+ ")": ")",
32
+ ";": ";",
33
+ "–": "-",
34
+ "—": " - ",
35
+ ".": ". ",
36
+ "~": "~",
37
+ "’": "'",
38
+ "…": "...",
39
+ "━": "-",
40
+ "〈": "<",
41
+ "〉": ">",
42
+ "【": "[",
43
+ "】": "]",
44
+ "%": "%",
45
+ "►": "-",
46
+ }
47
+
48
+ normalization = {
49
+ "non_printing_characters_re": non_printing_characters_re,
50
+ "digits_re": digits_re,
51
+ "unicode_punctuation": unicode_punctuation,
52
+ }
parameters_filtering.py ADDED
@@ -0,0 +1,895 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import string
2
+ import emoji
3
+
4
+
5
+ main_special_characters = string.punctuation + string.digits + string.whitespace
6
+ other_special_characters = (
7
+ "’ “— ™ – •‘œ    ˜ ‚ƒ„’“”–ー一▬…✦�­£​•€«»°·═"
8
+ "×士^˘⇓↓↑←→()§″′´¿−±∈¢ø‚„½¼¾¹²³―⁃,ˌ¸‹›ʺˈʻ¦‐⠀‰……‑≤≥‖"
9
+ "◆●■►▼▲▴∆▻¡★☆✱ːº。¯˜¥ɪ≈†上ン:∼⁄・♡✓⊕․.⋅÷1‟;،、¨ाাी्े◦˚"
10
+ "゜ʼ≖ʼ¤ッツシ℃√!【】‿∞➤~πه۩☛₨➩☻๑٪♥ıॽ《‘©﴿٬?▷Г♫∟™ª₪®「—❖"
11
+ "」﴾》"
12
+ )
13
+ emoji = list(emoji.UNICODE_EMOJI["en"].keys())
14
+
15
+ special_characters_default = set(main_special_characters + other_special_characters)
16
+ special_characters_default.update(emoji)
17
+
18
+
19
+ parameters_filtering_default = {
20
+ "cond_uniform_whitespace": True,
21
+ "cond_replace_unicode_punctuation": False,
22
+ "cond_remove_words_with_incorrect_substrings": False,
23
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
24
+ "cond_remove_long_words": False,
25
+ "length_word_max_cutoff": 50,
26
+ "cond_check_number_words": True,
27
+ "tokenization": False,
28
+ "strip_characters": special_characters_default,
29
+ "number_words_min_cutoff": 1,
30
+ "number_words_max_cutoff": 100000,
31
+ "cond_check_character_repetition_removal": True,
32
+ "character_repetition_length": 10,
33
+ "character_repetition_max_cutoff": 0.106,
34
+ "cond_check_word_repetition_removal": True,
35
+ "word_repetition_length": 5,
36
+ "word_repetition_max_cutoff": 0.19,
37
+ "cond_check_special_characters": True,
38
+ "special_characters": special_characters_default,
39
+ "special_characters_max_cutoff": 0.4,
40
+ "cond_words_augmentation": False,
41
+ "words_augmentation_group_sizes": [],
42
+ "words_augmentation_join_char": "",
43
+ "cond_check_stopwords": False,
44
+ "stopwords_min_cutoff": 0,
45
+ "cond_check_flagged_words": False,
46
+ "flagged_words_max_cutoff": 0.2,
47
+ "cond_check_lang_id": True,
48
+ "lang_id_min_cutoff": 0.70,
49
+ "cond_check_perplexity": False,
50
+ "perplexity_max_cutoff": 3000000,
51
+ }
52
+
53
+ parameters_filtering_af = {
54
+ "cond_uniform_whitespace": True,
55
+ "cond_replace_unicode_punctuation": False,
56
+ "cond_remove_words_with_incorrect_substrings": False,
57
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
58
+ "cond_remove_long_words": True,
59
+ "length_word_max_cutoff": 25,
60
+ "cond_check_number_words": True,
61
+ "tokenization": False,
62
+ "strip_characters": special_characters_default,
63
+ "number_words_min_cutoff": 1,
64
+ "number_words_max_cutoff": 100000,
65
+ "cond_check_character_repetition_removal": True,
66
+ "character_repetition_length": 10,
67
+ "character_repetition_max_cutoff": 0.106,
68
+ "cond_check_word_repetition_removal": True,
69
+ "word_repetition_length": 5,
70
+ "word_repetition_max_cutoff": 0.19,
71
+ "cond_check_special_characters": True,
72
+ "special_characters": special_characters_default,
73
+ "special_characters_max_cutoff": 0.3,
74
+ "cond_words_augmentation": False,
75
+ "words_augmentation_group_sizes": [],
76
+ "words_augmentation_join_char": "",
77
+ "cond_check_stopwords": True,
78
+ "stopwords_min_cutoff": 0,
79
+ "cond_check_flagged_words": False,
80
+ "flagged_words_max_cutoff": 0.2,
81
+ "cond_check_lang_id": True,
82
+ "lang_id_min_cutoff": 0.6,
83
+ "cond_check_perplexity": True,
84
+ "perplexity_max_cutoff": 3000000,
85
+ }
86
+
87
+ parameters_filtering_ar = {
88
+ "cond_uniform_whitespace": True,
89
+ "cond_replace_unicode_punctuation": False,
90
+ "cond_remove_words_with_incorrect_substrings": False,
91
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
92
+ "cond_remove_long_words": True,
93
+ "length_word_max_cutoff": 25,
94
+ "cond_check_number_words": True,
95
+ "tokenization": False,
96
+ "strip_characters": special_characters_default,
97
+ "number_words_min_cutoff": 1,
98
+ "number_words_max_cutoff": 100000,
99
+ "cond_check_character_repetition_removal": True,
100
+ "character_repetition_length": 10,
101
+ "character_repetition_max_cutoff": 0.106,
102
+ "cond_check_word_repetition_removal": True,
103
+ "word_repetition_length": 5,
104
+ "word_repetition_max_cutoff": 0.19,
105
+ "cond_check_special_characters": True,
106
+ "special_characters": special_characters_default,
107
+ "special_characters_max_cutoff": 0.45,
108
+ "cond_words_augmentation": False,
109
+ "words_augmentation_group_sizes": [],
110
+ "words_augmentation_join_char": "",
111
+ "cond_check_stopwords": True,
112
+ "stopwords_min_cutoff": 0,
113
+ "cond_check_flagged_words": False,
114
+ "flagged_words_max_cutoff": 0.2,
115
+ "cond_check_lang_id": True,
116
+ "lang_id_min_cutoff": 0.75,
117
+ "cond_check_perplexity": True,
118
+ "perplexity_max_cutoff": 1000000,
119
+ }
120
+
121
+ parameters_filtering_arz = {
122
+ "cond_uniform_whitespace": True,
123
+ "cond_replace_unicode_punctuation": False,
124
+ "cond_remove_words_with_incorrect_substrings": False,
125
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
126
+ "cond_remove_long_words": True,
127
+ "length_word_max_cutoff": 25,
128
+ "cond_check_number_words": True,
129
+ "tokenization": False,
130
+ "strip_characters": special_characters_default,
131
+ "number_words_min_cutoff": 1,
132
+ "number_words_max_cutoff": 100000,
133
+ "cond_check_character_repetition_removal": True,
134
+ "character_repetition_length": 10,
135
+ "character_repetition_max_cutoff": 0.106,
136
+ "cond_check_word_repetition_removal": True,
137
+ "word_repetition_length": 5,
138
+ "word_repetition_max_cutoff": 0.19,
139
+ "cond_check_special_characters": True,
140
+ "special_characters": special_characters_default,
141
+ "special_characters_max_cutoff": 0.5,
142
+ "cond_words_augmentation": False,
143
+ "words_augmentation_group_sizes": [],
144
+ "words_augmentation_join_char": "",
145
+ "cond_check_stopwords": True,
146
+ "stopwords_min_cutoff": 0,
147
+ "cond_check_flagged_words": False,
148
+ "flagged_words_max_cutoff": 0.2,
149
+ "cond_check_lang_id": True,
150
+ "lang_id_min_cutoff": 0.75,
151
+ "cond_check_perplexity": False,
152
+ "perplexity_max_cutoff": 3000000,
153
+ }
154
+
155
+ parameters_filtering_as = {
156
+ "cond_uniform_whitespace": True,
157
+ "cond_replace_unicode_punctuation": False,
158
+ "cond_remove_words_with_incorrect_substrings": False,
159
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
160
+ "cond_remove_long_words": True,
161
+ "length_word_max_cutoff": 25,
162
+ "cond_check_number_words": True,
163
+ "tokenization": False,
164
+ "strip_characters": special_characters_default,
165
+ "number_words_min_cutoff": 1,
166
+ "number_words_max_cutoff": 100000,
167
+ "cond_check_character_repetition_removal": True,
168
+ "character_repetition_length": 10,
169
+ "character_repetition_max_cutoff": 0.106,
170
+ "cond_check_word_repetition_removal": True,
171
+ "word_repetition_length": 5,
172
+ "word_repetition_max_cutoff": 0.19,
173
+ "cond_check_special_characters": True,
174
+ "special_characters": special_characters_default,
175
+ "special_characters_max_cutoff": 0.25,
176
+ "cond_words_augmentation": False,
177
+ "words_augmentation_group_sizes": [],
178
+ "words_augmentation_join_char": "",
179
+ "cond_check_stopwords": True,
180
+ "stopwords_min_cutoff": 0,
181
+ "cond_check_flagged_words": False,
182
+ "flagged_words_max_cutoff": 0.2,
183
+ "cond_check_lang_id": True,
184
+ "lang_id_min_cutoff": 0.75,
185
+ "cond_check_perplexity": False,
186
+ "perplexity_max_cutoff": 3000000,
187
+ }
188
+
189
+ parameters_filtering_bn = {
190
+ "cond_uniform_whitespace": True,
191
+ "cond_replace_unicode_punctuation": False,
192
+ "cond_remove_words_with_incorrect_substrings": False,
193
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
194
+ "cond_remove_long_words": True,
195
+ "length_word_max_cutoff": 30,
196
+ "cond_check_number_words": True,
197
+ "tokenization": False,
198
+ "strip_characters": special_characters_default,
199
+ "number_words_min_cutoff": 1,
200
+ "number_words_max_cutoff": 100000,
201
+ "cond_check_character_repetition_removal": True,
202
+ "character_repetition_length": 10,
203
+ "character_repetition_max_cutoff": 0.106,
204
+ "cond_check_word_repetition_removal": True,
205
+ "word_repetition_length": 5,
206
+ "word_repetition_max_cutoff": 0.19,
207
+ "cond_check_special_characters": True,
208
+ "special_characters": special_characters_default,
209
+ "special_characters_max_cutoff": 0.275,
210
+ "cond_words_augmentation": False,
211
+ "words_augmentation_group_sizes": [],
212
+ "words_augmentation_join_char": "",
213
+ "cond_check_stopwords": True,
214
+ "stopwords_min_cutoff": 0.05,
215
+ "cond_check_flagged_words": False,
216
+ "flagged_words_max_cutoff": 0.2,
217
+ "cond_check_lang_id": True,
218
+ "lang_id_min_cutoff": 0.75,
219
+ "cond_check_perplexity": False,
220
+ "perplexity_max_cutoff": 575000,
221
+ }
222
+
223
+ parameters_filtering_ca = {
224
+ "cond_uniform_whitespace": True,
225
+ "cond_replace_unicode_punctuation": False,
226
+ "cond_remove_words_with_incorrect_substrings": False,
227
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
228
+ "cond_remove_long_words": True,
229
+ "length_word_max_cutoff": 30,
230
+ "cond_check_number_words": True,
231
+ "tokenization": False,
232
+ "strip_characters": special_characters_default,
233
+ "number_words_min_cutoff": 1,
234
+ "number_words_max_cutoff": 100000,
235
+ "cond_check_character_repetition_removal": True,
236
+ "character_repetition_length": 10,
237
+ "character_repetition_max_cutoff": 0.106,
238
+ "cond_check_word_repetition_removal": True,
239
+ "word_repetition_length": 5,
240
+ "word_repetition_max_cutoff": 0.19,
241
+ "cond_check_special_characters": True,
242
+ "special_characters": special_characters_default,
243
+ "special_characters_max_cutoff": 0.35,
244
+ "cond_words_augmentation": False,
245
+ "words_augmentation_group_sizes": [],
246
+ "words_augmentation_join_char": "",
247
+ "cond_check_stopwords": True,
248
+ "stopwords_min_cutoff": 0,
249
+ "cond_check_flagged_words": False,
250
+ "flagged_words_max_cutoff": 0.2,
251
+ "cond_check_lang_id": True,
252
+ "lang_id_min_cutoff": 0.75,
253
+ "cond_check_perplexity": True,
254
+ "perplexity_max_cutoff": 1750000,
255
+ }
256
+
257
+ parameters_filtering_en = {
258
+ "cond_uniform_whitespace": True,
259
+ "cond_replace_unicode_punctuation": False,
260
+ "cond_remove_words_with_incorrect_substrings": True,
261
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
262
+ "cond_remove_long_words": True,
263
+ "length_word_max_cutoff": 25,
264
+ "cond_check_number_words": True,
265
+ "tokenization": False,
266
+ "strip_characters": special_characters_default,
267
+ "number_words_min_cutoff": 20,
268
+ "number_words_max_cutoff": 100000,
269
+ "cond_check_character_repetition_removal": True,
270
+ "character_repetition_length": 10,
271
+ "character_repetition_max_cutoff": 0.106,
272
+ "cond_check_word_repetition_removal": True,
273
+ "word_repetition_length": 5,
274
+ "word_repetition_max_cutoff": 0.19,
275
+ "cond_check_special_characters": True,
276
+ "special_characters": special_characters_default,
277
+ "special_characters_max_cutoff": 0.4,
278
+ "cond_words_augmentation": False,
279
+ "words_augmentation_group_sizes": [],
280
+ "words_augmentation_join_char": "",
281
+ "cond_check_stopwords": True,
282
+ "stopwords_min_cutoff": 0.3,
283
+ "cond_check_flagged_words": True,
284
+ "flagged_words_max_cutoff": 0.045,
285
+ "cond_check_lang_id": True,
286
+ "lang_id_min_cutoff": 0.80,
287
+ "cond_check_perplexity": True,
288
+ "perplexity_max_cutoff": 2500,
289
+ }
290
+
291
+ parameters_filtering_es = {
292
+ "cond_uniform_whitespace": True,
293
+ "cond_replace_unicode_punctuation": False,
294
+ "cond_remove_words_with_incorrect_substrings": False,
295
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
296
+ "cond_remove_long_words": True,
297
+ "length_word_max_cutoff": 30,
298
+ "cond_check_number_words": True,
299
+ "tokenization": False,
300
+ "strip_characters": special_characters_default,
301
+ "number_words_min_cutoff": 1,
302
+ "number_words_max_cutoff": 100000,
303
+ "cond_check_character_repetition_removal": True,
304
+ "character_repetition_length": 10,
305
+ "character_repetition_max_cutoff": 0.106,
306
+ "cond_check_word_repetition_removal": True,
307
+ "word_repetition_length": 5,
308
+ "word_repetition_max_cutoff": 0.19,
309
+ "cond_check_special_characters": True,
310
+ "special_characters": special_characters_default,
311
+ "special_characters_max_cutoff": 0.3,
312
+ "cond_words_augmentation": False,
313
+ "words_augmentation_group_sizes": [],
314
+ "words_augmentation_join_char": "",
315
+ "cond_check_stopwords": True,
316
+ "stopwords_min_cutoff": 0.2,
317
+ "cond_check_flagged_words": False,
318
+ "flagged_words_max_cutoff": 0.2,
319
+ "cond_check_lang_id": True,
320
+ "lang_id_min_cutoff": 0.75,
321
+ "cond_check_perplexity": True,
322
+ "perplexity_max_cutoff": 2500000,
323
+ }
324
+
325
+ parameters_filtering_eu = {
326
+ "cond_uniform_whitespace": True,
327
+ "cond_replace_unicode_punctuation": False,
328
+ "cond_remove_words_with_incorrect_substrings": False,
329
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
330
+ "cond_remove_long_words": True,
331
+ "length_word_max_cutoff": 35,
332
+ "cond_check_number_words": True,
333
+ "tokenization": False,
334
+ "strip_characters": special_characters_default,
335
+ "number_words_min_cutoff": 1,
336
+ "number_words_max_cutoff": 100000,
337
+ "cond_check_character_repetition_removal": True,
338
+ "character_repetition_length": 10,
339
+ "character_repetition_max_cutoff": 0.106,
340
+ "cond_check_word_repetition_removal": True,
341
+ "word_repetition_length": 5,
342
+ "word_repetition_max_cutoff": 0.19,
343
+ "cond_check_special_characters": True,
344
+ "special_characters": special_characters_default,
345
+ "special_characters_max_cutoff": 0.3,
346
+ "cond_words_augmentation": False,
347
+ "words_augmentation_group_sizes": [],
348
+ "words_augmentation_join_char": "",
349
+ "cond_check_stopwords": True,
350
+ "stopwords_min_cutoff": 0,
351
+ "cond_check_flagged_words": False,
352
+ "flagged_words_max_cutoff": 0.2,
353
+ "cond_check_lang_id": True,
354
+ "lang_id_min_cutoff": 0.75,
355
+ "cond_check_perplexity": False,
356
+ "perplexity_max_cutoff": 3000000,
357
+ }
358
+
359
+ parameters_filtering_fr = {
360
+ "cond_uniform_whitespace": True,
361
+ "cond_replace_unicode_punctuation": False,
362
+ "cond_remove_words_with_incorrect_substrings": False,
363
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
364
+ "cond_remove_long_words": True,
365
+ "length_word_max_cutoff": 30,
366
+ "cond_check_number_words": True,
367
+ "tokenization": False,
368
+ "strip_characters": special_characters_default,
369
+ "number_words_min_cutoff": 1,
370
+ "number_words_max_cutoff": 100000,
371
+ "cond_check_character_repetition_removal": True,
372
+ "character_repetition_length": 10,
373
+ "character_repetition_max_cutoff": 0.106,
374
+ "cond_check_word_repetition_removal": True,
375
+ "word_repetition_length": 5,
376
+ "word_repetition_max_cutoff": 0.19,
377
+ "cond_check_special_characters": True,
378
+ "special_characters": special_characters_default,
379
+ "special_characters_max_cutoff": 0.35,
380
+ "cond_words_augmentation": False,
381
+ "words_augmentation_group_sizes": [],
382
+ "words_augmentation_join_char": "",
383
+ "cond_check_stopwords": True,
384
+ "stopwords_min_cutoff": 0.15,
385
+ "cond_check_flagged_words": False,
386
+ "flagged_words_max_cutoff": 0.2,
387
+ "cond_check_lang_id": True,
388
+ "lang_id_min_cutoff": 0.75,
389
+ "cond_check_perplexity": True,
390
+ "perplexity_max_cutoff": 3000000,
391
+ }
392
+
393
+ parameters_filtering_gu = {
394
+ "cond_uniform_whitespace": True,
395
+ "cond_replace_unicode_punctuation": False,
396
+ "cond_remove_words_with_incorrect_substrings": False,
397
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
398
+ "cond_remove_long_words": True,
399
+ "length_word_max_cutoff": 30,
400
+ "cond_check_number_words": True,
401
+ "tokenization": False,
402
+ "strip_characters": special_characters_default,
403
+ "number_words_min_cutoff": 1,
404
+ "number_words_max_cutoff": 100000,
405
+ "cond_check_character_repetition_removal": True,
406
+ "character_repetition_length": 10,
407
+ "character_repetition_max_cutoff": 0.106,
408
+ "cond_check_word_repetition_removal": True,
409
+ "word_repetition_length": 5,
410
+ "word_repetition_max_cutoff": 0.19,
411
+ "cond_check_special_characters": True,
412
+ "special_characters": special_characters_default,
413
+ "special_characters_max_cutoff": 0.3,
414
+ "cond_words_augmentation": False,
415
+ "words_augmentation_group_sizes": [],
416
+ "words_augmentation_join_char": "",
417
+ "cond_check_stopwords": True,
418
+ "stopwords_min_cutoff": 0,
419
+ "cond_check_flagged_words": False,
420
+ "flagged_words_max_cutoff": 0.2,
421
+ "cond_check_lang_id": True,
422
+ "lang_id_min_cutoff": 0.75,
423
+ "cond_check_perplexity": True,
424
+ "perplexity_max_cutoff": 250000,
425
+ }
426
+
427
+ parameters_filtering_hi = {
428
+ "cond_uniform_whitespace": True,
429
+ "cond_replace_unicode_punctuation": False,
430
+ "cond_remove_words_with_incorrect_substrings": False,
431
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
432
+ "cond_remove_long_words": True,
433
+ "length_word_max_cutoff": 25,
434
+ "cond_check_number_words": True,
435
+ "tokenization": False,
436
+ "strip_characters": special_characters_default,
437
+ "number_words_min_cutoff": 1,
438
+ "number_words_max_cutoff": 100000,
439
+ "cond_check_character_repetition_removal": True,
440
+ "character_repetition_length": 10,
441
+ "character_repetition_max_cutoff": 0.106,
442
+ "cond_check_word_repetition_removal": True,
443
+ "word_repetition_length": 5,
444
+ "word_repetition_max_cutoff": 0.19,
445
+ "cond_check_special_characters": True,
446
+ "special_characters": special_characters_default,
447
+ "special_characters_max_cutoff": 0.35,
448
+ "cond_words_augmentation": False,
449
+ "words_augmentation_group_sizes": [],
450
+ "words_augmentation_join_char": "",
451
+ "cond_check_stopwords": True,
452
+ "stopwords_min_cutoff": 0,
453
+ "cond_check_flagged_words": False,
454
+ "flagged_words_max_cutoff": 0.2,
455
+ "cond_check_lang_id": True,
456
+ "lang_id_min_cutoff": 0.75,
457
+ "cond_check_perplexity": True,
458
+ "perplexity_max_cutoff": 600000,
459
+ }
460
+
461
+ parameters_filtering_id = {
462
+ "cond_uniform_whitespace": True,
463
+ "cond_replace_unicode_punctuation": False,
464
+ "cond_remove_words_with_incorrect_substrings": False,
465
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
466
+ "cond_remove_long_words": True,
467
+ "length_word_max_cutoff": 30,
468
+ "cond_check_number_words": True,
469
+ "tokenization": False,
470
+ "strip_characters": special_characters_default,
471
+ "number_words_min_cutoff": 1,
472
+ "number_words_max_cutoff": 100000,
473
+ "cond_check_character_repetition_removal": True,
474
+ "character_repetition_length": 10,
475
+ "character_repetition_max_cutoff": 0.106,
476
+ "cond_check_word_repetition_removal": True,
477
+ "word_repetition_length": 5,
478
+ "word_repetition_max_cutoff": 0.19,
479
+ "cond_check_special_characters": True,
480
+ "special_characters": special_characters_default,
481
+ "special_characters_max_cutoff": 0.25,
482
+ "cond_words_augmentation": False,
483
+ "words_augmentation_group_sizes": [],
484
+ "words_augmentation_join_char": "",
485
+ "cond_check_stopwords": True,
486
+ "stopwords_min_cutoff": 0.25,
487
+ "cond_check_flagged_words": False,
488
+ "flagged_words_max_cutoff": 0.2,
489
+ "cond_check_lang_id": True,
490
+ "lang_id_min_cutoff": 0.75,
491
+ "cond_check_perplexity": True,
492
+ "perplexity_max_cutoff": 2500000,
493
+ }
494
+
495
+ parameters_filtering_kn = {
496
+ "cond_uniform_whitespace": True,
497
+ "cond_replace_unicode_punctuation": False,
498
+ "cond_remove_words_with_incorrect_substrings": False,
499
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
500
+ "cond_remove_long_words": True,
501
+ "length_word_max_cutoff": 50,
502
+ "cond_check_number_words": True,
503
+ "tokenization": False,
504
+ "strip_characters": special_characters_default,
505
+ "number_words_min_cutoff": 1,
506
+ "number_words_max_cutoff": 100000,
507
+ "cond_check_character_repetition_removal": True,
508
+ "character_repetition_length": 10,
509
+ "character_repetition_max_cutoff": 0.106,
510
+ "cond_check_word_repetition_removal": True,
511
+ "word_repetition_length": 5,
512
+ "word_repetition_max_cutoff": 0.19,
513
+ "cond_check_special_characters": True,
514
+ "special_characters": special_characters_default,
515
+ "special_characters_max_cutoff": 0.25,
516
+ "cond_words_augmentation": False,
517
+ "words_augmentation_group_sizes": [],
518
+ "words_augmentation_join_char": "",
519
+ "cond_check_stopwords": True,
520
+ "stopwords_min_cutoff": 0,
521
+ "cond_check_flagged_words": False,
522
+ "flagged_words_max_cutoff": 0.2,
523
+ "cond_check_lang_id": True,
524
+ "lang_id_min_cutoff": 0.75,
525
+ "cond_check_perplexity": True,
526
+ "perplexity_max_cutoff": 400000,
527
+ }
528
+
529
+ parameters_filtering_ml = {
530
+ "cond_uniform_whitespace": True,
531
+ "cond_replace_unicode_punctuation": False,
532
+ "cond_remove_words_with_incorrect_substrings": False,
533
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
534
+ "cond_remove_long_words": True,
535
+ "length_word_max_cutoff": 50,
536
+ "cond_check_number_words": True,
537
+ "tokenization": False,
538
+ "strip_characters": special_characters_default,
539
+ "number_words_min_cutoff": 1,
540
+ "number_words_max_cutoff": 100000,
541
+ "cond_check_character_repetition_removal": True,
542
+ "character_repetition_length": 10,
543
+ "character_repetition_max_cutoff": 0.106,
544
+ "cond_check_word_repetition_removal": True,
545
+ "word_repetition_length": 5,
546
+ "word_repetition_max_cutoff": 0.19,
547
+ "cond_check_special_characters": True,
548
+ "special_characters": special_characters_default,
549
+ "special_characters_max_cutoff": 0.2,
550
+ "cond_words_augmentation": False,
551
+ "words_augmentation_group_sizes": [],
552
+ "words_augmentation_join_char": "",
553
+ "cond_check_stopwords": True,
554
+ "stopwords_min_cutoff": 0,
555
+ "cond_check_flagged_words": False,
556
+ "flagged_words_max_cutoff": 0.2,
557
+ "cond_check_lang_id": True,
558
+ "lang_id_min_cutoff": 0.75,
559
+ "cond_check_perplexity": True,
560
+ "perplexity_max_cutoff": 1600000,
561
+ }
562
+
563
+ parameters_filtering_mr = {
564
+ "cond_uniform_whitespace": True,
565
+ "cond_replace_unicode_punctuation": False,
566
+ "cond_remove_words_with_incorrect_substrings": False,
567
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
568
+ "cond_remove_long_words": True,
569
+ "length_word_max_cutoff": 30,
570
+ "cond_check_number_words": True,
571
+ "tokenization": False,
572
+ "strip_characters": special_characters_default,
573
+ "number_words_min_cutoff": 1,
574
+ "number_words_max_cutoff": 100000,
575
+ "cond_check_character_repetition_removal": True,
576
+ "character_repetition_length": 10,
577
+ "character_repetition_max_cutoff": 0.106,
578
+ "cond_check_word_repetition_removal": True,
579
+ "word_repetition_length": 5,
580
+ "word_repetition_max_cutoff": 0.19,
581
+ "cond_check_special_characters": True,
582
+ "special_characters": special_characters_default,
583
+ "special_characters_max_cutoff": 0.25,
584
+ "cond_words_augmentation": False,
585
+ "words_augmentation_group_sizes": [],
586
+ "words_augmentation_join_char": "",
587
+ "cond_check_stopwords": True,
588
+ "stopwords_min_cutoff": 0,
589
+ "cond_check_flagged_words": False,
590
+ "flagged_words_max_cutoff": 0.2,
591
+ "cond_check_lang_id": True,
592
+ "lang_id_min_cutoff": 0.75,
593
+ "cond_check_perplexity": True,
594
+ "perplexity_max_cutoff": 425000,
595
+ }
596
+
597
+ parameters_filtering_pt = {
598
+ "cond_uniform_whitespace": True,
599
+ "cond_replace_unicode_punctuation": False,
600
+ "cond_remove_words_with_incorrect_substrings": False,
601
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
602
+ "cond_remove_long_words": True,
603
+ "length_word_max_cutoff": 30,
604
+ "cond_check_number_words": True,
605
+ "tokenization": False,
606
+ "strip_characters": special_characters_default,
607
+ "number_words_min_cutoff": 1,
608
+ "number_words_max_cutoff": 100000,
609
+ "cond_check_character_repetition_removal": True,
610
+ "character_repetition_length": 10,
611
+ "character_repetition_max_cutoff": 0.106,
612
+ "cond_check_word_repetition_removal": True,
613
+ "word_repetition_length": 5,
614
+ "word_repetition_max_cutoff": 0.19,
615
+ "cond_check_special_characters": True,
616
+ "special_characters": special_characters_default,
617
+ "special_characters_max_cutoff": 0.3,
618
+ "cond_words_augmentation": False,
619
+ "words_augmentation_group_sizes": [],
620
+ "words_augmentation_join_char": "",
621
+ "cond_check_stopwords": True,
622
+ "stopwords_min_cutoff": 0.15,
623
+ "cond_check_flagged_words": False,
624
+ "flagged_words_max_cutoff": 0.2,
625
+ "cond_check_lang_id": True,
626
+ "lang_id_min_cutoff": 0.75,
627
+ "cond_check_perplexity": True,
628
+ "perplexity_max_cutoff": 3000000,
629
+ }
630
+
631
+ parameters_filtering_sw = {
632
+ "cond_uniform_whitespace": True,
633
+ "cond_replace_unicode_punctuation": False,
634
+ "cond_remove_words_with_incorrect_substrings": False,
635
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
636
+ "cond_remove_long_words": True,
637
+ "length_word_max_cutoff": 30,
638
+ "cond_check_number_words": True,
639
+ "tokenization": False,
640
+ "strip_characters": special_characters_default,
641
+ "number_words_min_cutoff": 1,
642
+ "number_words_max_cutoff": 100000,
643
+ "cond_check_character_repetition_removal": True,
644
+ "character_repetition_length": 10,
645
+ "character_repetition_max_cutoff": 0.106,
646
+ "cond_check_word_repetition_removal": True,
647
+ "word_repetition_length": 5,
648
+ "word_repetition_max_cutoff": 0.19,
649
+ "cond_check_special_characters": True,
650
+ "special_characters": special_characters_default,
651
+ "special_characters_max_cutoff": 0.275,
652
+ "cond_words_augmentation": False,
653
+ "words_augmentation_group_sizes": [],
654
+ "words_augmentation_join_char": "",
655
+ "cond_check_stopwords": True,
656
+ "stopwords_min_cutoff": 0,
657
+ "cond_check_flagged_words": False,
658
+ "flagged_words_max_cutoff": 0.2,
659
+ "cond_check_lang_id": True,
660
+ "lang_id_min_cutoff": 0.75,
661
+ "cond_check_perplexity": False,
662
+ "perplexity_max_cutoff": 3000000,
663
+ }
664
+
665
+ parameters_filtering_ta = {
666
+ "cond_uniform_whitespace": True,
667
+ "cond_replace_unicode_punctuation": False,
668
+ "cond_remove_words_with_incorrect_substrings": False,
669
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
670
+ "cond_remove_long_words": True,
671
+ "length_word_max_cutoff": 50,
672
+ "cond_check_number_words": True,
673
+ "tokenization": False,
674
+ "strip_characters": special_characters_default,
675
+ "number_words_min_cutoff": 1,
676
+ "number_words_max_cutoff": 100000,
677
+ "cond_check_character_repetition_removal": True,
678
+ "character_repetition_length": 10,
679
+ "character_repetition_max_cutoff": 0.106,
680
+ "cond_check_word_repetition_removal": True,
681
+ "word_repetition_length": 5,
682
+ "word_repetition_max_cutoff": 0.19,
683
+ "cond_check_special_characters": True,
684
+ "special_characters": special_characters_default,
685
+ "special_characters_max_cutoff": 0.25,
686
+ "cond_words_augmentation": False,
687
+ "words_augmentation_group_sizes": [],
688
+ "words_augmentation_join_char": "",
689
+ "cond_check_stopwords": True,
690
+ "stopwords_min_cutoff": 0,
691
+ "cond_check_flagged_words": False,
692
+ "flagged_words_max_cutoff": 0.2,
693
+ "cond_check_lang_id": True,
694
+ "lang_id_min_cutoff": 0.75,
695
+ "cond_check_perplexity": False,
696
+ "perplexity_max_cutoff": 3000000,
697
+ }
698
+
699
+ parameters_filtering_te = {
700
+ "cond_uniform_whitespace": True,
701
+ "cond_replace_unicode_punctuation": False,
702
+ "cond_remove_words_with_incorrect_substrings": False,
703
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
704
+ "cond_remove_long_words": True,
705
+ "length_word_max_cutoff": 35,
706
+ "cond_check_number_words": True,
707
+ "tokenization": False,
708
+ "strip_characters": special_characters_default,
709
+ "number_words_min_cutoff": 1,
710
+ "number_words_max_cutoff": 100000,
711
+ "cond_check_character_repetition_removal": True,
712
+ "character_repetition_length": 10,
713
+ "character_repetition_max_cutoff": 0.106,
714
+ "cond_check_word_repetition_removal": True,
715
+ "word_repetition_length": 5,
716
+ "word_repetition_max_cutoff": 0.19,
717
+ "cond_check_special_characters": True,
718
+ "special_characters": special_characters_default,
719
+ "special_characters_max_cutoff": 0.25,
720
+ "cond_words_augmentation": False,
721
+ "words_augmentation_group_sizes": [],
722
+ "words_augmentation_join_char": "",
723
+ "cond_check_stopwords": True,
724
+ "stopwords_min_cutoff": 0,
725
+ "cond_check_flagged_words": False,
726
+ "flagged_words_max_cutoff": 0.2,
727
+ "cond_check_lang_id": True,
728
+ "lang_id_min_cutoff": 0.75,
729
+ "cond_check_perplexity": False,
730
+ "perplexity_max_cutoff": 3000000,
731
+ }
732
+
733
+ parameters_filtering_ur = {
734
+ "cond_uniform_whitespace": True,
735
+ "cond_replace_unicode_punctuation": False,
736
+ "cond_remove_words_with_incorrect_substrings": False,
737
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
738
+ "cond_remove_long_words": True,
739
+ "length_word_max_cutoff": 30,
740
+ "cond_check_number_words": True,
741
+ "tokenization": False,
742
+ "strip_characters": special_characters_default,
743
+ "number_words_min_cutoff": 1,
744
+ "number_words_max_cutoff": 100000,
745
+ "cond_check_character_repetition_removal": True,
746
+ "character_repetition_length": 10,
747
+ "character_repetition_max_cutoff": 0.106,
748
+ "cond_check_word_repetition_removal": True,
749
+ "word_repetition_length": 5,
750
+ "word_repetition_max_cutoff": 0.19,
751
+ "cond_check_special_characters": True,
752
+ "special_characters": special_characters_default,
753
+ "special_characters_max_cutoff": 0.4,
754
+ "cond_words_augmentation": False,
755
+ "words_augmentation_group_sizes": [],
756
+ "words_augmentation_join_char": "",
757
+ "cond_check_stopwords": True,
758
+ "stopwords_min_cutoff": 0,
759
+ "cond_check_flagged_words": False,
760
+ "flagged_words_max_cutoff": 0.2,
761
+ "cond_check_lang_id": True,
762
+ "lang_id_min_cutoff": 0.75,
763
+ "cond_check_perplexity": False,
764
+ "perplexity_max_cutoff": 3000000,
765
+ }
766
+
767
+ parameters_filtering_vi = {
768
+ "cond_uniform_whitespace": True,
769
+ "cond_replace_unicode_punctuation": False,
770
+ "cond_remove_words_with_incorrect_substrings": False,
771
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
772
+ "cond_remove_long_words": True,
773
+ "length_word_max_cutoff": 30,
774
+ "cond_check_number_words": True,
775
+ "tokenization": False,
776
+ "strip_characters": special_characters_default,
777
+ "number_words_min_cutoff": 1,
778
+ "number_words_max_cutoff": 100000,
779
+ "cond_check_character_repetition_removal": True,
780
+ "character_repetition_length": 10,
781
+ "character_repetition_max_cutoff": 0.106,
782
+ "cond_check_word_repetition_removal": True,
783
+ "word_repetition_length": 5,
784
+ "word_repetition_max_cutoff": 0.19,
785
+ "cond_check_special_characters": True,
786
+ "special_characters": special_characters_default,
787
+ "special_characters_max_cutoff": 0.35,
788
+ "cond_words_augmentation": True,
789
+ "words_augmentation_group_sizes": [2],
790
+ "words_augmentation_join_char": " ",
791
+ "cond_check_stopwords": True,
792
+ "stopwords_min_cutoff": 0,
793
+ "cond_check_flagged_words": False,
794
+ "flagged_words_max_cutoff": 0.2,
795
+ "cond_check_lang_id": True,
796
+ "lang_id_min_cutoff": 0.75,
797
+ "cond_check_perplexity": False,
798
+ "perplexity_max_cutoff": 3000000,
799
+ }
800
+
801
+ parameters_filtering_yo = {
802
+ "cond_uniform_whitespace": True,
803
+ "cond_replace_unicode_punctuation": False,
804
+ "cond_remove_words_with_incorrect_substrings": False,
805
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
806
+ "cond_remove_long_words": True,
807
+ "length_word_max_cutoff": 30,
808
+ "cond_check_number_words": True,
809
+ "tokenization": False,
810
+ "strip_characters": special_characters_default,
811
+ "number_words_min_cutoff": 1,
812
+ "number_words_max_cutoff": 100000,
813
+ "cond_check_character_repetition_removal": True,
814
+ "character_repetition_length": 10,
815
+ "character_repetition_max_cutoff": 0.106,
816
+ "cond_check_word_repetition_removal": True,
817
+ "word_repetition_length": 5,
818
+ "word_repetition_max_cutoff": 0.19,
819
+ "cond_check_special_characters": True,
820
+ "special_characters": special_characters_default,
821
+ "special_characters_max_cutoff": 0.3,
822
+ "cond_words_augmentation": False,
823
+ "words_augmentation_group_sizes": [],
824
+ "words_augmentation_join_char": "",
825
+ "cond_check_stopwords": True,
826
+ "stopwords_min_cutoff": 0,
827
+ "cond_check_flagged_words": False,
828
+ "flagged_words_max_cutoff": 0.2,
829
+ "cond_check_lang_id": True,
830
+ "lang_id_min_cutoff": 0.75,
831
+ "cond_check_perplexity": False,
832
+ "perplexity_max_cutoff": 3000000,
833
+ }
834
+
835
+ parameters_filtering_zh = {
836
+ "cond_uniform_whitespace": True,
837
+ "cond_replace_unicode_punctuation": False,
838
+ "cond_remove_words_with_incorrect_substrings": False,
839
+ "incorrect_word_substrings": ["http", "www", ".com", "href", "//"],
840
+ "cond_remove_long_words": False,
841
+ "length_word_max_cutoff": 1000,
842
+ "cond_check_number_words": True,
843
+ "tokenization": True,
844
+ "strip_characters": special_characters_default,
845
+ "number_words_min_cutoff": 1,
846
+ "number_words_max_cutoff": 100000,
847
+ "cond_check_character_repetition_removal": True,
848
+ "character_repetition_length": 10,
849
+ "character_repetition_max_cutoff": 0.106,
850
+ "cond_check_word_repetition_removal": True,
851
+ "word_repetition_length": 5,
852
+ "word_repetition_max_cutoff": 0.19,
853
+ "cond_check_special_characters": True,
854
+ "special_characters": special_characters_default,
855
+ "special_characters_max_cutoff": 0.4,
856
+ "cond_words_augmentation": True,
857
+ "words_augmentation_group_sizes": [2],
858
+ "words_augmentation_join_char": "",
859
+ "cond_check_stopwords": False,
860
+ "stopwords_min_cutoff": 0,
861
+ "cond_check_flagged_words": False,
862
+ "flagged_words_max_cutoff": 0.2,
863
+ "cond_check_lang_id": True,
864
+ "lang_id_min_cutoff": 0.75,
865
+ "cond_check_perplexity": False,
866
+ "perplexity_max_cutoff": 3000000,
867
+ }
868
+
869
+ parameters_filtering = {
870
+ "default": parameters_filtering_default,
871
+ "af": parameters_filtering_af,
872
+ "ar": parameters_filtering_ar,
873
+ "arz": parameters_filtering_arz,
874
+ "as": parameters_filtering_as,
875
+ "bn": parameters_filtering_bn,
876
+ "ca": parameters_filtering_ca,
877
+ "en": parameters_filtering_en,
878
+ "es": parameters_filtering_es,
879
+ "eu": parameters_filtering_eu,
880
+ "fr": parameters_filtering_fr,
881
+ "gu": parameters_filtering_gu,
882
+ "hi": parameters_filtering_hi,
883
+ "id": parameters_filtering_id,
884
+ "kn": parameters_filtering_kn,
885
+ "ml": parameters_filtering_ml,
886
+ "mr": parameters_filtering_mr,
887
+ "pt": parameters_filtering_pt,
888
+ "sw": parameters_filtering_sw,
889
+ "ta": parameters_filtering_ta,
890
+ "te": parameters_filtering_te,
891
+ "ur": parameters_filtering_ur,
892
+ "vi": parameters_filtering_vi,
893
+ "yo": parameters_filtering_yo,
894
+ "zh": parameters_filtering_zh,
895
+ }
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
1
+ fasttext
2
+ sentencepiece
3
+ https://github.com/kpu/kenlm/archive/master.zip
4
+ emoji
stopwords.py ADDED
The diff for this file is too large to render. See raw diff
zh.arpa.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:240f156d70a4b04cb078b4f127ae0103378454143a77442c18e5e24b93404e56
3
+ size 3635106545
zh.sp.model ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ff2189b2cc84a513a76d24f9a0154e52f0afaf3010dc5fd1034ed37c9d2b5970
3
+ size 876286
zh_examples_with_stats.json ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:318cf4641a46c9c7c16fc77171f28475cb8e96935201d3541d493b5231e8d53a
3
+ size 63524762