Update README.md
Browse files
README.md
CHANGED
@@ -38,13 +38,13 @@ EleutherAIによるMesh Transformer JAXをコードベースとした、GPT-J-6B
|
|
38 |
|
39 |
## Instructions
|
40 |
|
41 |
-
We recommend to use finetuneanon's transformer codebase for inferencing as split checkpoint loads up a lot faster than monolithic checkpoint supported by HuggingFace Transformers repository.
|
42 |
|
43 |
The tokenizer still uses 50256 as the <|endoftext|> substitute. Therefore 50256 should be excluded when inferencing.
|
44 |
|
45 |
## Datasets
|
46 |
|
47 |
-
Lack of quality Japanese corpus
|
48 |
|
49 |
The dataset is normalized and sanitized against leading and trailing spaces, excessive CR/LF repetitions.
|
50 |
|
|
|
38 |
|
39 |
## Instructions
|
40 |
|
41 |
+
We recommend to use finetuneanon's forked transformer codebase for inferencing as split checkpoint loads up a lot faster than monolithic checkpoint supported by HuggingFace Transformers repository.
|
42 |
|
43 |
The tokenizer still uses 50256 as the <|endoftext|> substitute. Therefore 50256 should be excluded when inferencing.
|
44 |
|
45 |
## Datasets
|
46 |
|
47 |
+
Lack of quality Japanese corpus was one of the major challenges when we trained the model. We aimed to compile well-formatted corpuses outside of Common Crawl.
|
48 |
|
49 |
The dataset is normalized and sanitized against leading and trailing spaces, excessive CR/LF repetitions.
|
50 |
|