rinna/nekomata-7b
Overview
We conduct continual pre-training of qwen-7b on 30B tokens from a mixture of Japanese and English datasets. The continual pre-training significantly improves the model's performance on Japanese tasks. It also enjoys the following great features provided by the original Qwen model.
- The inclusive Qwen vocabulary (vocab size > 150k) enables the model to processs Japanese texts much more efficiently than the previously released youri series.
- The model supports a maximum sequence length of 32768.
The name nekomata
comes from the Japanese word 猫又/ねこまた/Nekomata
, which is a kind of Japanese mythical creature (妖怪/ようかい/Youkai
).
Library
The model was trained using code based on EleutherAI/gpt-neox.
Model architecture
A 32-layer, 4096-hidden-size transformer-based language model. Please refer to the Qwen paper for architecture details.
Continual pre-training
The model was initialized with the qwen-7b model and continually trained on around 30B tokens from a mixture of the following corpora
- Japanese CC-100
- Japanese C4
- Japanese OSCAR
- The Pile
- Wikipedia
- rinna curated Japanese dataset
Contributors
Release date
December 21, 2023
Benchmarking
Please refer to rinna's LM benchmark page (Sheet 20231221).
How to use the model
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("rinna/nekomata-7b", trust_remote_code=True)
# Use GPU with bf16
# model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-7b", device_map="auto", trust_remote_code=True, bf16=True)
# Use GPU with fp16
# model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-7b", device_map="auto", trust_remote_code=True, fp16=True)
# Use CPU
# model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-7b", device_map="cpu", trust_remote_code=True)
# Automatically select device and precision
model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-7b", device_map="auto", trust_remote_code=True)
text = "西田幾多郎は、"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
token_ids.to(model.device),
max_new_tokens=200,
min_new_tokens=200,
do_sample=True,
temperature=1.0,
top_p=0.95,
pad_token_id=tokenizer.pad_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id
)
output = tokenizer.decode(output_ids.tolist()[0])
print(output)
Tokenization
The model uses the original Qwen tokenizer. It augments the cl100k
tiktoken tokenizer and has a vocabulary size of 151,936. The inclusive vocabulary helps the model to reach a better tokenization efficiency, especially for Japanese texts.
We compared the Qwen
tokenizer (as used in nekomata
) and the llama-2
tokenizer (as used in youri
) on different text collections and found that the Qwen tokenizer achieves a much better byte2token rate (i.e. the average number of tokens produced from 1 byte of text) as following. A lower byte2token rate indicates a better tokenization efficiency.
Tokenizer | Japanese | English | Multilingual |
---|---|---|---|
Qwen | 0.24 | 0.27 | 0.27 |
llama-2 | 0.40 | 0.29 | 0.36 |
How to cite
@misc{rinna-nekomata-7b,
title = {rinna/nekomata-7b},
author = {Zhao, Tianyu and Kaga, Akio and Sawada, Kei},
url = {https://huggingface.co/rinna/nekomata-7b}
}
@inproceedings{sawada2024release,
title = {Release of Pre-Trained Models for the {J}apanese Language},
author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
month = {5},
year = {2024},
pages = {13898--13905},
url = {https://aclanthology.org/2024.lrec-main.1213},
note = {\url{https://arxiv.org/abs/2404.01657}}
}
References
@software{gpt-neox-library,
title = {{GPT}-{N}eo{X}: Large Scale Autoregressive Language Modeling in {P}y{T}orch},
author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel},
doi = {10.5281/zenodo.5879544},
month = {8},
year = {2021},
version = {0.0.1},
url = {https://www.github.com/eleutherai/gpt-neox}
}
License
- Downloads last month
- 576