File size: 4,530 Bytes
6d870c3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
---
thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png
license: gemma
datasets:
- mc4
- wikipedia
- EleutherAI/pile
- oscar-corpus/colossal-oscar-1.0
- cc100
language:
- ja
- en
tags:
- gemma2
inference: false
base_model: google/gemma-2-2b
pipeline_tag: text-generation
library_name: transformers
---
# `Gemma 2 Baku 2B (rinna/gemma-2-baku-2b)`
![rinna-icon](./rinna.png)
# Overview
We conduct continual pre-training of [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b) on **80B** tokens from a mixture of Japanese and English datasets. The continual pre-training improves the model's performance on Japanese tasks.
The name `baku` comes from the Japanese word [`獏/ばく/Baku`](https://ja.wikipedia.org/wiki/獏), which is a kind of Japanese mythical creature ([`妖怪/ようかい/Youkai`](https://ja.wikipedia.org/wiki/%E5%A6%96%E6%80%AA)).
| Size | Continual Pre-Training | Instruction-Tuning |
| :- | :- | :- |
| 2B | Gemma 2 Baku 2B [[HF]](https://huggingface.co/rinna/gemma-2-baku-2b) | Gemma 2 Baku 2B Instruct [[HF]](https://huggingface.co/rinna/gemma-2-baku-2b-it) |
* **Library**
The model was trained using code based on [Lightning-AI/litgpt](https://github.com/Lightning-AI/litgpt).
* **Model architecture**
A 26-layer, 2304-hidden-size transformer-based language model. Please refer to the [Gemma 2 Model Card](https://www.kaggle.com/models/google/gemma-2/) for detailed information on the model's architecture.
* **Training**
The model was initialized with the [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b) model and continually trained on around **80B** tokens from a mixture of the following corpora
- [Japanese CC-100](https://huggingface.co/datasets/cc100)
- [Japanese C4](https://huggingface.co/datasets/mc4)
- [Japanese OSCAR](https://huggingface.co/datasets/oscar-corpus/colossal-oscar-1.0)
- [The Pile](https://huggingface.co/datasets/EleutherAI/pile)
- [Wikipedia](https://dumps.wikimedia.org/other/cirrussearch)
- rinna curated Japanese dataset
* **Contributors**
- [Toshiaki Wakatsuki](https://huggingface.co/t-w)
- [Xinqi Chen](https://huggingface.co/Keely0419)
- [Kei Sawada](https://huggingface.co/keisawada)
---
# Benchmarking
Please refer to [rinna's LM benchmark page](https://rinnakk.github.io/research/benchmarks/lm/index.html).
---
# How to use the model
~~~python
import transformers
import torch
model_id = "rinna/gemma-2-baku-2b"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16, "attn_implementation": "eager"},
device_map="auto"
)
output = pipeline(
"西田幾多郎は、",
max_new_tokens=256,
do_sample=True
)
print(output[0]["generated_text"])
~~~
It is recommended to use eager attention when conducting batch inference under bfloat16 precision.
Currently, Gemma 2 yields NaN values for input sequences with padding when the default attention mechanism (torch.scaled_dot_product_attention) is employed in conjunction with bfloat16.
---
# Tokenization
The model uses the original [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b) tokenizer.
---
# How to cite
```bibtex
@misc{rinna-gemma-2-baku-2b,
title = {rinna/gemma-2-baku-2b},
author = {Wakatsuki, Toshiaki and Chen, Xinqi and Sawada, Kei},
url = {https://huggingface.co/rinna/gemma-2-baku-2b}
}
@inproceedings{sawada2024release,
title = {Release of Pre-Trained Models for the {J}apanese Language},
author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
month = {5},
year = {2024},
pages = {13898--13905},
url = {https://aclanthology.org/2024.lrec-main.1213},
note = {\url{https://arxiv.org/abs/2404.01657}}
}
```
---
# References
```bibtex
@article{gemma-2-2024,
title = {Gemma 2},
url = {https://www.kaggle.com/models/google/gemma-2},
publisher = {Kaggle},
author = {Gemma Team},
year = {2024}
}
@misc{litgpt-2023,
author = {Lightning AI},
title = {LitGPT},
howpublished = {\url{https://github.com/Lightning-AI/litgpt}},
year = {2023}
}
```
---
# License
[Gemma Terms of Use](https://ai.google.dev/gemma/terms) |