Gemma 2 Baku 2B Instruct (rinna/gemma-2-baku-2b-it)
Overview
The model is an instruction-tuned variant of rinna/gemma-2-baku-2b, utilizing Chat Vector and Odds Ratio Preference Optimization (ORPO) for fine-tuning. It adheres to the gemma-2 chat format.
Size | Continual Pre-Training | Instruction-Tuning |
---|---|---|
2B | Gemma 2 Baku 2B [HF] | Gemma 2 Baku 2B Instruct [HF] |
Model architecture
A 26-layer, 2304-hidden-size transformer-based language model. Please refer to the Gemma 2 Model Card for detailed information on the model's architecture.
Training
Model merging. The base model was endowed with instruction-following capabilities through a chat vector addition process. The chat vector was derived by subtracting the parameter vectors of google/gemma-2-2b from google/gemma-2-2b-it, as follows.
rinna/gemma-2-baku-2b + 1.0 * (google/gemma-2-2b-it - google/gemma-2-2b)
During this process, the embedding layer was excluded during the subtraction and addition of parameter vectors.
ORPO was applied using a subset of the following dataset to further refine the performance of the merged model.
- rinna's internal dataset
Contributors
Release date
October 3, 2024
Benchmarking
Please refer to rinna's LM benchmark page (Sheet 20241003).
How to use the model
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "rinna/gemma-2-baku-2b-it"
dtype = torch.bfloat16
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda",
torch_dtype=dtype,
attn_implementation="eager",
)
chat = [
{ "role": "user", "content": "西田幾多郎とはどんな人物ですか?" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=512,
)
response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
It is recommended to use eager attention when conducting batch inference under bfloat16 precision. Currently, Gemma 2 yields NaN values for input sequences with padding when the default attention mechanism (torch.scaled_dot_product_attention) is employed in conjunction with bfloat16.
Tokenization
The model uses the original google/gemma-2-2b-it tokenizer.
How to cite
@misc{rinna-gemma-2-baku-2b-it,
title = {rinna/gemma-2-baku-2b-it},
author = {Chen, Xinqi and Wakatsuki, Toshiaki and Sawada, Kei},
url = {https://huggingface.co/rinna/gemma-2-baku-2b-it}
}
@inproceedings{sawada2024release,
title = {Release of Pre-Trained Models for the {J}apanese Language},
author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
month = {5},
year = {2024},
pages = {13898--13905},
url = {https://aclanthology.org/2024.lrec-main.1213},
note = {\url{https://arxiv.org/abs/2404.01657}}
}
References
@article{gemma-2-2024,
title = {Gemma 2},
url = {https://www.kaggle.com/models/google/gemma-2},
publisher = {Kaggle},
author = {Gemma Team},
year = {2024}
}
@article{huang2023chat,
title = {Chat Vector: A Simple Approach to Equip LLMs with Instruction Following and Model Alignment in New Languages},
author = {Huang, Shih-Cheng and Li, Pin-Zu and Hsu, Yu-Chi and Chen, Kuang-Ming and Lin, Yu Tung and Hsiao, Shih-Kai and Tzong-Han Tsai, Richard and Lee, Hung-yi},
year = {2023},
url = {https://arxiv.org/abs/2310.04799}
}
@article{hong2024orpo,
title = {ORPO: Monolithic Preference Optimization without Reference Model},
author = {Hong, Jiwoo and Lee, Noah and Thorne, James},
year = {2024},
url = {https://arxiv.org/abs/2403.07691}
}
License
- Downloads last month
- 2,661