naclbit commited on
Commit
aabbf7b
1 Parent(s): 9d34f40

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -2
README.md CHANGED
@@ -10,7 +10,7 @@ tags:
10
  license: apache-2.0
11
  ---
12
 
13
- The pre-trained model is work in progress!
14
 
15
  A 6.8 billion parameter pre-trained model for Japanese language, based on EleutherAI's Mesh Transformer JAX, that has a similar model structure to their GPT-J-6B pre-trained model.
16
 
@@ -34,4 +34,41 @@ EleutherAIによるMesh Transformer JAXをコードベースとした、GPT-J-6B
34
  | n_ctx | 2,048 |
35
  | n_vocab | 52,512 |
36
  | position encoding | [Rotary position encodings (RoPE)](https://arxiv.org/abs/2104.09864) |
37
- | RoPE dimensions | 64 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  license: apache-2.0
11
  ---
12
 
13
+ The pre-trained model is work in progress! Model weight download will be available in the future.
14
 
15
  A 6.8 billion parameter pre-trained model for Japanese language, based on EleutherAI's Mesh Transformer JAX, that has a similar model structure to their GPT-J-6B pre-trained model.
16
 
 
34
  | n_ctx | 2,048 |
35
  | n_vocab | 52,512 |
36
  | position encoding | [Rotary position encodings (RoPE)](https://arxiv.org/abs/2104.09864) |
37
+ | RoPE dimensions | 64 |
38
+
39
+ ## Instructions
40
+
41
+ We recommend to use finetuneanon's transformer codebase for inferencing as split checkpoint loads up a lot faster than monolithic checkpoint supported by HuggingFace Transformers repository.
42
+
43
+ The tokenizer still uses 50256 as the <|endoftext|> substitute. Therefore 50256 should be excluded when inferencing.
44
+
45
+ ## Datasets
46
+
47
+ Lack of quality Japanese corpus is one of the major challenges when we trained the model. We aimed to compile well-formatted corpuses outside of Common Crawl.
48
+
49
+ The dataset is normalized and sanitized against leading and trailing spaces, excessive CR/LF repetitions.
50
+
51
+ The whole dataset is about 400GB and 106B tokens (compared to 825GB/300B tokens for The Pile).
52
+
53
+ ** Common Crawl
54
+ - Jan-Dec 2018 72GB CC100-Japanese (https://metatext.io/datasets/cc100-japanese)
55
+ - November 2018 106GB OSCAR-Japanese (https://oscar-corpus.com)
56
+ - 75GB Converted 860GB Google C4 Multilingual Japanese (re-formatted)
57
+
58
+ ** Books
59
+ - 140GB Web Fictions, non-fictions and blogs corpus
60
+ - 5GB Books and Aozora Bunko corpus (weighted 2x)
61
+
62
+ ** News
63
+ - 1GB Scientific news, medical news and web news corpus
64
+
65
+ ** Wikipedia
66
+ - Aug 2021 3GB Assorted and Deduplicated Japanese Wikipedia (weighted 2x)
67
+ - Aug 2021 Wikibooks, Wikinews, Wikiquote, Wikisource, Wiktionary, Wikiversity and Wikivoyage
68
+
69
+ ** Other Corpuses
70
+ - 2018 OpenSubtitles (https://opus.nlpl.eu/OpenSubtitles-v2018.php)
71
+ - 80-90's BBS Logs
72
+ - Assorted Blogs Crawl
73
+ - QED-ja
74
+ - TED 2020-ja