Yuta Hayashibe commited on
Commit
3ce8a0e
1 Parent(s): b7e22bb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -2
README.md CHANGED
@@ -17,15 +17,21 @@ datasets:
17
  [megagonlabs/t5-base-japanese-web](https://huggingface.co/megagonlabs/t5-base-japanese-web) is a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts.
18
  Training codes are [available on GitHub](https://github.com/megagonlabs/t5-japanese).
19
 
20
- ### Corpus
 
 
21
 
22
  - Japanese in [mC4/3.0.1](https://huggingface.co/datasets/mc4) (We used [Tensorflow native format](https://github.com/allenai/allennlp/discussions/5056))
 
 
23
  - [Japanese](https://www.tensorflow.org/datasets/catalog/wiki40b#wiki40bja) in [wiki40b/1.3.0](https://www.tensorflow.org/datasets/catalog/wiki40b)
 
 
24
 
25
 
26
  ### Tokenizer
27
 
28
- SentencePiece trained on Japanese Wikipedia
29
 
30
  - Vocabulary size: 32,000
31
  - [Byte-fallback](https://github.com/google/sentencepiece/releases/tag/v0.1.9): Enabled
17
  [megagonlabs/t5-base-japanese-web](https://huggingface.co/megagonlabs/t5-base-japanese-web) is a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts.
18
  Training codes are [available on GitHub](https://github.com/megagonlabs/t5-japanese).
19
 
20
+ ### Corpora
21
+
22
+ We used following corpora for pre-training.
23
 
24
  - Japanese in [mC4/3.0.1](https://huggingface.co/datasets/mc4) (We used [Tensorflow native format](https://github.com/allenai/allennlp/discussions/5056))
25
+ - 87,425,304 pages
26
+ - 782 GB in TFRecord format
27
  - [Japanese](https://www.tensorflow.org/datasets/catalog/wiki40b#wiki40bja) in [wiki40b/1.3.0](https://www.tensorflow.org/datasets/catalog/wiki40b)
28
+ - 828,236 articles (2,073,584 examples)
29
+ - 2 GB in TFRecord format
30
 
31
 
32
  ### Tokenizer
33
 
34
+ We used Japanese Wikipedia to train [SentencePiece](https://github.com/google/sentencepiece).
35
 
36
  - Vocabulary size: 32,000
37
  - [Byte-fallback](https://github.com/google/sentencepiece/releases/tag/v0.1.9): Enabled