Update README.md
Browse files
README.md
CHANGED
@@ -1,12 +1,9 @@
|
|
1 |
## CC_FILTER
|
2 |
-
this is ja cc filter for reference from ja wiki vs random ja
|
3 |
1. get ja wiki dump file, and extract the all url inside, get about 4M urls
|
4 |
2. crawl 300K of 4M webpages from the urls
|
5 |
3. get pure text and remove content len less than 1k,
|
6 |
-
4. use langdetect to tell the lang of the pages,
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
7. tokenize all text with "cl-tohoku/bert-base-japanese"
|
11 |
-
8. feed lang_all.txt to fasttext to get model_all.bin
|
12 |
-
9. feed lang_ja.txt to fasttext to get model_ja.bin
|
|
|
1 |
## CC_FILTER
|
2 |
+
this is ja cc filter for reference from ja wiki vs random ja common crawl, and build with following procedure.
|
3 |
1. get ja wiki dump file, and extract the all url inside, get about 4M urls
|
4 |
2. crawl 300K of 4M webpages from the urls
|
5 |
3. get pure text and remove content len less than 1k,
|
6 |
+
4. use langdetect to tell the lang of the pages, we finally get total 101K ja pages
|
7 |
+
5. random sample from commoncrawl 202303 and use langdetect to find 101k ja pages
|
8 |
+
6. tokenize all text with "cl-tohoku/bert-base-japanese"
|
9 |
+
7. feed tokens to fasttext to get model.bin
|
|
|
|
|
|