lfsm
/

ja_cc_filter

lfsm commited on Jun 30, 2023

Commit

7aafbf6

•

1 Parent(s): 3071277

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -1,12 +1,9 @@
 ## CC_FILTER
-this is ja cc filter for reference from ja wiki vs random ja mc4, and build with following procedure.
 1. get ja wiki dump file, and extract the all url inside, get about 4M urls
 2. crawl 300K of 4M webpages from the urls
 3. get pure text and remove content len less than 1k,
-4. use langdetect to tell the lang of the pages,
-we finally get total 160K pages : 101K ja pages, 47K en pages, and 12K other lang pages
-5. random sample 16K from ja mc4, concat with all 16k pages to get lang_all.txt data
-6. random sample 10K from ja mc4, concat with ja 10k pages to get lang_ja.txt data
-7. tokenize all text with "cl-tohoku/bert-base-japanese"
-8. feed lang_all.txt to fasttext to get model_all.bin
-9. feed lang_ja.txt to fasttext to get model_ja.bin

 ## CC_FILTER
+this is ja cc filter for reference from ja wiki vs random ja common crawl, and build with following procedure.
 1. get ja wiki dump file, and extract the all url inside, get about 4M urls
 2. crawl 300K of 4M webpages from the urls
 3. get pure text and remove content len less than 1k,
+4. use langdetect to tell the lang of the pages, we finally get total 101K ja pages
+5. random sample from commoncrawl 202303 and use langdetect to find 101k ja pages
+6. tokenize all text with "cl-tohoku/bert-base-japanese"
+7. feed tokens to fasttext to get model.bin