Update README.md
Browse files
README.md
CHANGED
@@ -62,7 +62,27 @@ A 24-layer, 2048-hidden-size transformer-based language model.
|
|
62 |
|
63 |
# Training
|
64 |
The model was trained on [Japanese C4](https://huggingface.co/datasets/allenai/c4), [Japanese CC-100](http://data.statmt.org/cc-100/ja.txt.xz) and [Japanese Wikipedia](https://dumps.wikimedia.org/other/cirrussearch) to optimize a traditional language modelling objective. It reaches around 14 perplexity on a chosen validation set from the same data.
|
|
|
65 |
# Tokenization
|
66 |
The model uses a [sentencepiece](https://github.com/google/sentencepiece)-based tokenizer. The vocabulary was first trained on a selected subset from the training data using the official sentencepiece training script, and then augmented with emojis and symbols.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
67 |
# Licenese
|
68 |
[The MIT license](https://opensource.org/licenses/MIT)
|
62 |
|
63 |
# Training
|
64 |
The model was trained on [Japanese C4](https://huggingface.co/datasets/allenai/c4), [Japanese CC-100](http://data.statmt.org/cc-100/ja.txt.xz) and [Japanese Wikipedia](https://dumps.wikimedia.org/other/cirrussearch) to optimize a traditional language modelling objective. It reaches around 14 perplexity on a chosen validation set from the same data.
|
65 |
+
|
66 |
# Tokenization
|
67 |
The model uses a [sentencepiece](https://github.com/google/sentencepiece)-based tokenizer. The vocabulary was first trained on a selected subset from the training data using the official sentencepiece training script, and then augmented with emojis and symbols.
|
68 |
+
|
69 |
+
# How to cite
|
70 |
+
~~~
|
71 |
+
@misc{rinna-japanese-gpt-1b,
|
72 |
+
title = {rinna/japanese-gpt-1b},
|
73 |
+
author = {Zhao, Tianyu and Sawada, Kei}
|
74 |
+
url = {https://huggingface.co/rinna/japanese-gpt-1b},
|
75 |
+
}
|
76 |
+
|
77 |
+
@inproceedings{sawada2024release,
|
78 |
+
title = {Release of Pre-Trained Models for the {J}apanese Language},
|
79 |
+
author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
|
80 |
+
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
|
81 |
+
month = {5},
|
82 |
+
year = {2024},
|
83 |
+
url = {https://arxiv.org/abs/2404.01657},
|
84 |
+
}
|
85 |
+
~~~
|
86 |
+
|
87 |
# Licenese
|
88 |
[The MIT license](https://opensource.org/licenses/MIT)
|