--- thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png license: gemma datasets: - mc4 - wikipedia - EleutherAI/pile - oscar-corpus/colossal-oscar-1.0 - cc100 language: - ja - en tags: - gemma2 inference: false base_model: google/gemma-2-2b --- # `Gemma 2 Baku 2B (rinna/gemma-2-baku-2b)` ![rinna-icon](./rinna.png) # Overview We conduct continual pre-training of [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b) on **80B** tokens from a mixture of Japanese and English datasets. The continual pre-training improves the model's performance on Japanese tasks. The name `baku` comes from the Japanese word [`獏/ばく/Baku`](https://ja.wikipedia.org/wiki/獏), which is a kind of Japanese mythical creature ([`妖怪/ようかい/Youkai`](https://ja.wikipedia.org/wiki/%E5%A6%96%E6%80%AA)). | Size | Continual Pre-Training | Instruction-Tuning | | :- | :- | :- | | 2B | Gemma 2 Baku 2B [[HF]](https://huggingface.co/rinna/gemma-2-baku-2b) | Gemma 2 Baku 2B Instruct [[HF]](https://huggingface.co/rinna/gemma-2-baku-2b-instruct) | * **Library** The model was trained using code based on [Lightning-AI/litgpt](https://github.com/Lightning-AI/litgpt). * **Model architecture** A 26-layer, 2304-hidden-size transformer-based language model. Please refer to the [Gemma 2 Model Card](https://www.kaggle.com/models/google/gemma-2/) for detailed information on the model's architecture. * **Training** The model was initialized with the [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b) model and continually trained on around **80B** tokens from a mixture of the following corpora - [Japanese CC-100](https://huggingface.co/datasets/cc100) - [Japanese C4](https://huggingface.co/datasets/mc4) - [Japanese OSCAR](https://huggingface.co/datasets/oscar-corpus/colossal-oscar-1.0) - [The Pile](https://huggingface.co/datasets/EleutherAI/pile) - [Wikipedia](https://dumps.wikimedia.org/other/cirrussearch) - rinna curated Japanese dataset * **Contributors** - [Toshiaki Wakatsuki](https://huggingface.co/t-w) - [Xinqi Chen](https://huggingface.co/Keely0419) - [Kei Sawada](https://huggingface.co/keisawada) --- # Benchmarking Please refer to [rinna's LM benchmark page](https://rinnakk.github.io/research/benchmarks/lm/index.html). --- # How to use the model ~~~python import transformers import torch model_id = "rinna/gemma-2-baku-2b" pipeline = transformers.pipeline( "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto" ) output = pipeline( "西田幾多郎は、", max_new_tokens=256, do_sample=True ) print(output[0]["generated_text"]) ~~~ --- # Tokenization The model uses the original [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b) tokenizer. --- # How to cite ```bibtex @misc{rinna-gemma-2-baku-2b, title = {rinna/gemma-2-baku-2b}, author = {Wakatsuki, Toshiaki and Chen, Xinqi and Sawada, Kei}, url = {https://huggingface.co/rinna/gemma-2-baku-2b} } @inproceedings{sawada2024release, title = {Release of Pre-Trained Models for the {J}apanese Language}, author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh}, booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, month = {5}, year = {2024}, pages = {13898--13905}, url = {https://aclanthology.org/2024.lrec-main.1213}, note = {\url{https://arxiv.org/abs/2404.01657}} } ``` --- # References ```bibtex @article{gemma-2-2024, title = {Gemma 2}, url = {https://www.kaggle.com/models/google/gemma-2}, publisher = {Kaggle}, author = {Gemma Team}, year = {2024} } @misc{litgpt-2023, author = {Lightning AI}, title = {LitGPT}, howpublished = {\url{https://github.com/Lightning-AI/litgpt}}, year = {2023} } ``` --- # License [Gemma Terms of Use](https://ai.google.dev/gemma/terms)