Update README.md
Browse files
README.md
CHANGED
@@ -10,7 +10,7 @@ tags:
|
|
10 |
license: apache-2.0
|
11 |
---
|
12 |
|
13 |
-
The pre-trained model is work in progress!
|
14 |
|
15 |
A 6.8 billion parameter pre-trained model for Japanese language, based on EleutherAI's Mesh Transformer JAX, that has a similar model structure to their GPT-J-6B pre-trained model.
|
16 |
|
@@ -34,4 +34,41 @@ EleutherAIによるMesh Transformer JAXをコードベースとした、GPT-J-6B
|
|
34 |
| n_ctx | 2,048 |
|
35 |
| n_vocab | 52,512 |
|
36 |
| position encoding | [Rotary position encodings (RoPE)](https://arxiv.org/abs/2104.09864) |
|
37 |
-
| RoPE dimensions | 64 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
license: apache-2.0
|
11 |
---
|
12 |
|
13 |
+
The pre-trained model is work in progress! Model weight download will be available in the future.
|
14 |
|
15 |
A 6.8 billion parameter pre-trained model for Japanese language, based on EleutherAI's Mesh Transformer JAX, that has a similar model structure to their GPT-J-6B pre-trained model.
|
16 |
|
|
|
34 |
| n_ctx | 2,048 |
|
35 |
| n_vocab | 52,512 |
|
36 |
| position encoding | [Rotary position encodings (RoPE)](https://arxiv.org/abs/2104.09864) |
|
37 |
+
| RoPE dimensions | 64 |
|
38 |
+
|
39 |
+
## Instructions
|
40 |
+
|
41 |
+
We recommend to use finetuneanon's transformer codebase for inferencing as split checkpoint loads up a lot faster than monolithic checkpoint supported by HuggingFace Transformers repository.
|
42 |
+
|
43 |
+
The tokenizer still uses 50256 as the <|endoftext|> substitute. Therefore 50256 should be excluded when inferencing.
|
44 |
+
|
45 |
+
## Datasets
|
46 |
+
|
47 |
+
Lack of quality Japanese corpus is one of the major challenges when we trained the model. We aimed to compile well-formatted corpuses outside of Common Crawl.
|
48 |
+
|
49 |
+
The dataset is normalized and sanitized against leading and trailing spaces, excessive CR/LF repetitions.
|
50 |
+
|
51 |
+
The whole dataset is about 400GB and 106B tokens (compared to 825GB/300B tokens for The Pile).
|
52 |
+
|
53 |
+
** Common Crawl
|
54 |
+
- Jan-Dec 2018 72GB CC100-Japanese (https://metatext.io/datasets/cc100-japanese)
|
55 |
+
- November 2018 106GB OSCAR-Japanese (https://oscar-corpus.com)
|
56 |
+
- 75GB Converted 860GB Google C4 Multilingual Japanese (re-formatted)
|
57 |
+
|
58 |
+
** Books
|
59 |
+
- 140GB Web Fictions, non-fictions and blogs corpus
|
60 |
+
- 5GB Books and Aozora Bunko corpus (weighted 2x)
|
61 |
+
|
62 |
+
** News
|
63 |
+
- 1GB Scientific news, medical news and web news corpus
|
64 |
+
|
65 |
+
** Wikipedia
|
66 |
+
- Aug 2021 3GB Assorted and Deduplicated Japanese Wikipedia (weighted 2x)
|
67 |
+
- Aug 2021 Wikibooks, Wikinews, Wikiquote, Wikisource, Wiktionary, Wikiversity and Wikivoyage
|
68 |
+
|
69 |
+
** Other Corpuses
|
70 |
+
- 2018 OpenSubtitles (https://opus.nlpl.eu/OpenSubtitles-v2018.php)
|
71 |
+
- 80-90's BBS Logs
|
72 |
+
- Assorted Blogs Crawl
|
73 |
+
- QED-ja
|
74 |
+
- TED 2020-ja
|