CC_FILTER
this is ja cc filter for reference from ja wiki vs random ja common crawl, and build with following procedure.
- get ja wiki dump file, and extract the all url inside, get about 4M urls
- crawl 300K of 4M webpages from the urls
- get pure text and remove content len less than 1k,
- use langdetect to tell the lang of the pages, we finally get total 101K ja pages
- random sample from commoncrawl 202303 and use langdetect to find 101k ja pages
- tokenize all text with "rinna/japanese-roberta-base"
- feed tokens to fasttext to get model.bin