Moriyasu_Qwen2_JP_7B

Model Description

Moriyasu_Qwen2_JP_7B is a large language model trained by Moriyasu. Based on Qwen/Qwen2-7B, it has been enhanced for Japanese usage through additional pre-training and instruction tuning.

Model Performance

JGLUE tasks

We used the lm-evaluation-harness repo to evaluate across 8 tasks, and the results are as follows:

Model	JCommonsenseQA	JNLI	JMARC	JSQuAD	JAQKET-V2	XL-SUM	XWINOGRAD	MGSM	JA AVG (8 tasks)
	3-shot	3-shot	0-shot	2-shot	1-shot	1-shot	0-shot	5-shot
	Acc.	Balanced Acc.	Balanced Acc.	Char-F1	Char-F1	ROUGE-2	Acc.	Acc.
Moriyasu_Qwen2_JP_7B (ours)	0.9491	0.9111	0.9550	0.8748	0.8924	0.1966	0.8238	0.5560	0.7699
Qwen2-7B-Instruct	0.9080	0.7807	0.9329	0.9290	0.8334	0.1905	0.7216	0.6120	0.7385
SakanaAI/EvoLLM-JP-v1-7B	0.8919	0.6602	0.9555	0.9210	0.8641	0.2331	0.8165	0.4760	0.7273
Llama-3-ELYZA-JP-8B	0.9240	0.6485	0.9567	0.9204	0.8743	0.2135	0.7821	0.4920	0.7264
Llama-3-Swallow-8B-Instruct-v0.1	0.9249	0.6212	0.9427	0.9373	0.9083	0.1961	0.7404	0.5000	0.7214
Tanuki-8B-dpo-v1.0	0.7918	0.4305	0.9226	0.8229	0.7799	0.1168	0.7039	0.4360	0.6256

Japanese tasks

For this evaluation, we used swallow-evaluation repo to evaluate our model. The results of other models are taken from the report Llama-3.1-Swallow-8B-Instruct-v0.2 .

Model	JCom.	JEMHopQA	NIILC	JSQuAD	XL-Sum	MGSM	WMT20-en-ja	WMT20-ja-en	JMMLU	JHumanEval	Ja Avg
	4-shot	4-shot	4-shot	4-shot	1-shot	4-shot	4-shot	4-shot	5-shot	0-shot
	EM acc	Char-F1	Char-F1	Char-F1	ROUGE-2	EM acc	BLEU	BLEU	EM acc	pass@1
Moriyasu_Qwen2_JP_7B (ours)	0.9321	0.4823	0.6046	0.9201	0.1382	0.5560	0.2636	0.1892	0.5273	0.2976	0.4911
RakutenAI-7B-chat	0.9035	0.2600	0.4619	0.8647	0.1339	0.2120	0.2667	0.1966	0.4504	0.2299	0.3980
Qwen2-7B-Instruct	0.8856	0.3902	0.3859	0.8967	0.1277	0.5720	0.2041	0.1909	0.5713	0.5683	0.4793
Qwen2.5-7B-Instruct	0.9151	0.4293	0.3910	0.8908	0.1676	0.6240	0.2108	0.1916	0.6252	0.5305	0.4976
Tanuki-8B-dpo-v1.0	0.2770	0.2937	0.3710	0.6669	0.1016	0.4280	0.2385	0.1820	0.3078	0.2555	0.3122
Llama 3 8B Instruct	0.8785	0.3812	0.3936	0.8955	0.1273	0.4160	0.2143	0.2035	0.4719	0.2872	0.4269
Llama 3.1 8B Instruct	0.8829	0.4272	0.4112	0.8856	0.1481	0.5280	0.2174	0.1990	0.5086	0.4976	0.4706
Llama 3 Youko 8B Instruct	0.9196	0.4850	0.5178	0.9001	0.2085	0.4680	0.2559	0.1906	0.4691	0.2695	0.4684
Llama-3-ELYZA-JP-8B	0.9017	0.5124	0.5016	0.9113	0.1677	0.4600	0.2509	0.1846	0.4829	0.3811	0.4754
Llama 3 heron brain 8B v0.3	0.9231	0.4933	0.5694	0.9056	0.2178	0.4560	0.2771	0.2168	0.4993	0.3177	0.4876
Llama 3 Swallow 8B Instruct	0.9178	0.4963	0.5168	0.9088	0.1296	0.4880	0.2522	0.2254	0.4835	0.3927	0.4811
Llama 3.1 Swallow 8B Instruct v0.1	0.9240	0.5874	0.5736	0.9170	0.1380	0.5080	0.2820	0.2282	0.5301	0.3665	0.5055
Llama 3.1 Swallow 8B Instruct v0.2	0.9294	0.5601	0.5988	0.9148	0.1372	0.5280	0.2878	0.2270	0.5504	0.4079	0.5141

Japanese MTBench

For this evaluation, we use FastChat and gpt-4o-2024-08-06 for judgement and reference answer.

Due to limited computational resources, we conducted evaluations on only a select number of models.

Model	coding	extraction	humanities	math	reasoning	roleplay	stem	writing	JMTAvg
Moriyasu_Qwen2_JP_7B (ours)	0.515	0.710	0.845	0.685	0.585	0.815	0.710	0.765	0.704
Llama-3-ELYZA-JP-8B	0.365	0.72	0.730	0.400	0.555	0.670	0.580	0.785	0.601
Llama 3.1 Swallow 8B Instruct v0.1	0.480	0.680	0.705	0.475	0.425	0.710	0.620	0.645	0.592

Elyza task 100:

For this benchmark, we use Elyza task 100 dataset and gpt4o scoring prompt of Elyza. Link prompt from this blog

Model	Score
Moriyasu_Qwen2_JP_7B (ours)	3.37
Llama-3-ELYZA-JP-8B	3.66
Llama 3.1 Swallow 8B Instruct v0.1	3.32

Nejumi leaderboard 3

We will contact Nejumi soon to evaluate on this benchmark

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
path = 'AIJapanese/Moriyasu_Qwen2_JP_7B'
model = AutoModelForCausalLM.from_pretrained(
    path,
    torch_dtype=torch.bfloat16, 
    device_map="auto",
    use_cache=True
)
tokenizer = AutoTokenizer.from_pretrained(path)

system_prompt = "あなたは誠実で優秀な日本人アシスタントです。常に可能な限り最も役立つ回答を提供するように努めてください。"
prompt = "日本で一番高い山は何ですか "
conversation = [{"role": "system", "content": system_prompt }]
conversation.append({"role": "user", "content": prompt})
text = tokenizer.apply_chat_template(
    conversation,
    tokenize=False,
    add_generation_prompt=True)

model_inputs = tokenizer(text,return_tensors="pt").to(model.device)
generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=2048,
    temperature = 0.2,
    #top_p=0.95,
    #top_k=40,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Training Datasets

Pre-training dataset

The model is continually pre-trained on Japanese data from the Qwen2-7b model while maintaining the model's English ability (80% Japanese, 20% English). We use about 120 billion tokens sampled from, Japanese and English Wikipedia articles, Japanese CC-100 Japanese C4, Japanese OSCAR ,The Pile, Webfined, Japanese websites, book data, mathematics and code,...

Instruction Tuning

We generated about 1 million Instruction data from various methods such as generated data, translated data, and data manually tagged by humans.

Contact:

If you have any questions, please contact me at: hajimemoriyasu3@gmail.com

AIJapanese
/

Moriyasu_Qwen2_JP_7B