Update README.md
Browse files
README.md
CHANGED
@@ -21,7 +21,7 @@ EleutherAIによるMesh Transformer JAXをコードベースとした、GPT-J-6B
|
|
21 |
- We used T5Tokenizer and SentencePiece instead of GPT-2/3 tokenizer. Normalization done by SentencePiece is must for Japanese tokenizing as there are so much many more variations for common symbols than Western languages.
|
22 |
- Tokenizer has a vocabulary of 52,500 tokens and trained on Japanese Wikipedia dump as of 01 Aug 2021.
|
23 |
- The model fits within 16GB VRAM GPUs like P100 for inference up to 1688 context length. Full 2048 context length output requires 20GB VRAM or more (e.g. GTX3090/A5000).
|
24 |
-
- The model was trained with TPUv3-128 generously provided by Google TRC for about 4 weeks.
|
25 |
|
26 |
## Specifications
|
27 |
|
@@ -50,7 +50,7 @@ Lack of quality Japanese corpus was one of the major challenges when we trained
|
|
50 |
|
51 |
The dataset is normalized and sanitized against leading and trailing spaces, excessive CR/LF repetitions.
|
52 |
|
53 |
-
The whole dataset is about 400GB and 106B tokens (compared to 825GB/300B tokens for The Pile).
|
54 |
|
55 |
** Common Crawl
|
56 |
- Jan-Dec 2018 72GB CC100-Japanese (https://metatext.io/datasets/cc100-japanese)
|
|
|
21 |
- We used T5Tokenizer and SentencePiece instead of GPT-2/3 tokenizer. Normalization done by SentencePiece is must for Japanese tokenizing as there are so much many more variations for common symbols than Western languages.
|
22 |
- Tokenizer has a vocabulary of 52,500 tokens and trained on Japanese Wikipedia dump as of 01 Aug 2021.
|
23 |
- The model fits within 16GB VRAM GPUs like P100 for inference up to 1688 context length. Full 2048 context length output requires 20GB VRAM or more (e.g. GTX3090/A5000).
|
24 |
+
- The model was trained with TPUv3-128 generously provided by Google TRC for about 4 weeks. We are currently formatting additional datasets and preparing for more training time.
|
25 |
|
26 |
## Specifications
|
27 |
|
|
|
50 |
|
51 |
The dataset is normalized and sanitized against leading and trailing spaces, excessive CR/LF repetitions.
|
52 |
|
53 |
+
The whole dataset is about 400GB (as of October 2021) and 106B tokens (compared to 825GB/300B tokens for The Pile).
|
54 |
|
55 |
** Common Crawl
|
56 |
- Jan-Dec 2018 72GB CC100-Japanese (https://metatext.io/datasets/cc100-japanese)
|