lfsm commited on
Commit
7aafbf6
1 Parent(s): 3071277

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -8
README.md CHANGED
@@ -1,12 +1,9 @@
1
  ## CC_FILTER
2
- this is ja cc filter for reference from ja wiki vs random ja mc4, and build with following procedure.
3
  1. get ja wiki dump file, and extract the all url inside, get about 4M urls
4
  2. crawl 300K of 4M webpages from the urls
5
  3. get pure text and remove content len less than 1k,
6
- 4. use langdetect to tell the lang of the pages,
7
- we finally get total 160K pages : 101K ja pages, 47K en pages, and 12K other lang pages
8
- 5. random sample 16K from ja mc4, concat with all 16k pages to get lang_all.txt data
9
- 6. random sample 10K from ja mc4, concat with ja 10k pages to get lang_ja.txt data
10
- 7. tokenize all text with "cl-tohoku/bert-base-japanese"
11
- 8. feed lang_all.txt to fasttext to get model_all.bin
12
- 9. feed lang_ja.txt to fasttext to get model_ja.bin
 
1
  ## CC_FILTER
2
+ this is ja cc filter for reference from ja wiki vs random ja common crawl, and build with following procedure.
3
  1. get ja wiki dump file, and extract the all url inside, get about 4M urls
4
  2. crawl 300K of 4M webpages from the urls
5
  3. get pure text and remove content len less than 1k,
6
+ 4. use langdetect to tell the lang of the pages, we finally get total 101K ja pages
7
+ 5. random sample from commoncrawl 202303 and use langdetect to find 101k ja pages
8
+ 6. tokenize all text with "cl-tohoku/bert-base-japanese"
9
+ 7. feed tokens to fasttext to get model.bin