naclbit
/

gpt-j-japanese-6.8b

Text Generation

Inference Endpoints

Model card Files Files and versions Community

naclbit commited on Oct 17, 2021

Commit

aabbf7b

•

1 Parent(s): 9d34f40

Update README.md

Files changed (1) hide show

README.md +39 -2

README.md CHANGED Viewed

@@ -10,7 +10,7 @@ tags:
 license: apache-2.0
 ---
-The pre-trained model is work in progress!
 A 6.8 billion parameter pre-trained model for Japanese language, based on EleutherAI's Mesh Transformer JAX, that has a similar model structure to their GPT-J-6B pre-trained model.
@@ -34,4 +34,41 @@ EleutherAIによるMesh Transformer JAXをコードベースとした、GPT-J-6B
 | n_ctx             | 2,048  |
 | n_vocab           | 52,512 |
 | position encoding | [Rotary position encodings (RoPE)](https://arxiv.org/abs/2104.09864) |
-| RoPE dimensions   | 64 |

 license: apache-2.0
 ---
+The pre-trained model is work in progress! Model weight download will be available in the future.
 A 6.8 billion parameter pre-trained model for Japanese language, based on EleutherAI's Mesh Transformer JAX, that has a similar model structure to their GPT-J-6B pre-trained model.
 | n_ctx             | 2,048  |
 | n_vocab           | 52,512 |
 | position encoding | [Rotary position encodings (RoPE)](https://arxiv.org/abs/2104.09864) |
+| RoPE dimensions   | 64 |
+## Instructions
+We recommend to use finetuneanon's transformer codebase for inferencing as split checkpoint loads up a lot faster than monolithic checkpoint supported by HuggingFace Transformers repository.
+The tokenizer still uses 50256 as the <|endoftext|> substitute. Therefore 50256 should be excluded when inferencing.
+## Datasets
+Lack of quality Japanese corpus is one of the major challenges when we trained the model. We aimed to compile well-formatted corpuses outside of Common Crawl.
+The dataset is normalized and sanitized against leading and trailing spaces, excessive CR/LF repetitions.
+The whole dataset is about 400GB and 106B tokens (compared to 825GB/300B tokens for The Pile).
+** Common Crawl
+- Jan-Dec 2018 72GB CC100-Japanese (https://metatext.io/datasets/cc100-japanese)
+- November 2018 106GB OSCAR-Japanese (https://oscar-corpus.com)
+- 75GB Converted 860GB Google C4 Multilingual Japanese (re-formatted)
+** Books
+- 140GB Web Fictions, non-fictions and blogs corpus
+- 5GB Books and Aozora Bunko corpus (weighted 2x)
+** News
+- 1GB Scientific news, medical news and web news corpus
+** Wikipedia
+- Aug 2021 3GB Assorted and Deduplicated Japanese Wikipedia (weighted 2x)
+- Aug 2021 Wikibooks, Wikinews, Wikiquote, Wikisource, Wiktionary, Wikiversity and Wikivoyage
+** Other Corpuses
+- 2018 OpenSubtitles (https://opus.nlpl.eu/OpenSubtitles-v2018.php)
+- 80-90's BBS Logs
+- Assorted Blogs Crawl
+- QED-ja
+- TED 2020-ja