Shisa 7B

Shisa 7B (shisa-7b-v1) is a bilingual Japanese and English (JA/EN) general-purpose chat model that aims to achieve strong Japanese language performance while retaining robust English capabilities, using a synthetic-data driven approach.

This model is based on Mistral 7B with a custom JA-optimized extended tokenizer that is >2X more efficient in Japanese than Mistral's original tokenizer. The base model was pre-trained for an additional 8B primarily Japanese tokens. It was then subsequently fine-tuned with an expanded, machine-translated version of airoboros-3.1, a set of the highest-scoring items from ultrafeedback_binarized, and additional freshly generated airoboros data directly to the target languages.

We also release our base model, datasets, and pipeline code under a permissive Apache 2.0 license which can be used for any purpose, commercial or otherwise:

shisa-base-7b-v1 - our base model w/ an extended tokenizer and additional JA pre-training
shisa-pretrain-en-ja-v1 - our pre-training data set
ultra-orca-boros-en-ja - a synthetically generated, machine-translated, programmatically validated JA/EN fine-tuning dataset
shisa-en-ja-dpo-v1 - Small subset of DPO pairs from ultrafeedback, along with JA DPO pairs using GPT-4 generated items as the chosen value, and outputs from our preliminary 7b model as the rejected values
Shisa repository - this includes our translation, dataset generation, training, and evaluation code

Moreover, we are in the process of publishing extended writeups and more details of our process, including ablation results, testing methodology, and key findings on our project wiki that may be of interest to fellow researchers.

Fine-Tuning

Our original intuition was to see if we could create a stronger Japanese model using the best existing public JA training sets and incorporating them. After initial review and testing, however, we decided that focusing solely on translation/generation of our own synthetic datasets could yield superior results with less training.

We compared multiple translation tools and, via manual review, judged that while gpt-4 almost always delivered the highest quality translations, Google's text-bison-32k was a good balance of quality, cost and throughput. Over various iterations, we refined our translation approach to include some additional algorithms for flagging and filtering invalid translations, re-translating and backfilling as necessary.

We also took this project as an opportunity to apply some newer techniques such as incorporating NEFTune and DPO training.

For our v1 release, we picked from our release candidates based on a significant amount of human preference testing (thousands of generations and multiple rounds of pairwise comparisons). We analyzed our results with both win/loss/draw and BTL modeling (iLSR) using choix).

The best candidate model was fine-tuned in a 3-step process:

First, the model was fine-tuned on ultra-orca-boros-en-ja and SlimOrca (WandB Log)
Next, we add one additional epoch at performed using only a subset of Japanese ultra-orca-boros-en-ja items to enhance JA performance (as SlimOrca from the first step is mostly EN) (WandB Log)
Finally, the model was tuned using a DPOTrainer on a small subset of ultrafeedback (EN) and our own JA DPO dataset which uses gpt-4 outputs as the chosen values and outputs from stage 1's prelim model as rejected values. (WandDB Log )

During our training process, we also gained some key insights on why some existing Japanese models seem to underperform even versus models that have no additional JA training, and we hope that sharing this analysis will be useful to other teams developing Japanese language models.

While we need to explore this further, as an experimental validation, we applied a version of our fine-tuning set onto an existing base model ("Gamma 7B") and the initial JA MT-Bench results suggests that we can drastically increase functional performance with our tuning approach:

Model	Score
shisa-gamma-7b-allsources-v0.4	5.65
ja-stablelm-instruct-gamma-7b*	4.01

Performance

Throughout our training, we did extensive human evaluation for each model to cross-validate our model performance, and we are currently conducting ongoing larger scale manual head-to-head testing between models. Our intention is open up and scale this data collection as we further develop our tools. For more information and updates, please see our project wiki.

While we believe llm-jp-eval is a useful metric for our base model, and it was extremely useful during our tuning process for initial validations, as our fine-tune training includes a percentage of the benchmark train splits, we provide these llm-jp-eval results primarily as a point of interest:

AVR	MC	NLI	QA	RC
0.7480	0.8900	0.8040	0.4153	0.8825

(We run a slightly modified llm-jp-eval to support testing of Qwen and to emit a bos_token if available)

For our final model, since it's customary to include benchmarks, we've used Stability AI Japan's Japanese MT-Bench as a more representative test of our model's capabilities. For our JA MT-Bench testing we use a Japanese prompt ("あなたは役立つアシスタントです。") as well as --num-choices 4 in an effort to reduce sampling variability, however we've still observed regular 0.5+ point (and sometimes even greater swings) between generations, as well as issues with default prompts and parameters when testing, so again, we'd urge caution in over-interpreting these scores and treating them as more of a probabilistic directional indicator, rather than a definitive score or ranking:

Benchmark	Score
JA MT-Bench	5.23
MT-Bench	5.71

There is an MT-Bench Leaderboard, but as JA MT-Bench is still under development, for convenience, here is a comparison of the JA MT-Bench scores of some other models (our scores were rated by gpt-4-0613):

Model	Score
gpt-4-0613	9.40
gpt-4-1106-preview	9.17
gpt-3.5-turbo*	8.41
Qwen-14B-Chat	7.47
shisa-7b-v1	5.23
ELYZA-japanese-Llama-2-7b-fast-instruct*	4.86
ja-stablelm-instruct-gamma-7b*	4.01
japanese-stablelm-instruct-alpha-7b*	2.74
Mistral-7B-OpenOrca-ja*	2.23
youri-7b-chat*	2.00
Mistral-7B-Instruct-v0.1*	1.78
llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0*	1.31

(Marked JA MT-Bench results in this section are sourced from shi3z)

Limitations

Although our model demonstrates a reasonably high level of Japanese fluency, as a 7B parameter model, it is prone to higher hallucination rates and less effective instruction following and reasoning than larger-class models. Also, it still does not have complete mastery of the Japanese language and a native speaker will spot occasional mistakes like some non-idiomatic/awkward phrasing, improper tenses/speech levels, etc.

We've also noticed a small amount of language leakage, likely largely attributable to our tokenizer expansion. These may be fixable with sampler settings like Min P) or additional targeted training, and we plan on doing additional work on automated detection/sampler sweeps in the future. One interesting observation is, based on our data collection, we found that as we iterated, the DPO process significantly exacerbated this issue, but also that our DPO models still had significantly higher human preference rates, so there was a bit of a trade-off in our choice of final tune.

While we believe that training larger models can improve performance using our existing approach and dataset, there are also many improvements we'd like to make for future models. We believe there is quite a bit of low hanging fruit for improving performance with even more training efficiency largely through improving the quality and construction of datasets.

Usage

Sample code:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

model_name = "augmxnt/shisa-7b-v1"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    torch_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
    device_map="auto"
)
streamer = TextStreamer(tokenizer, skip_prompt=True)

# The prompt template is included in the  model's tokenizer_config.json so you shouldn't need this but we've included this for convenience
# tokenizer.chat_template = ""{%- for idx in range(0, messages|length) -%}\n{%- if messages[idx]['role'] == 'user' -%}\n{%- if idx > 1 -%}\n{{- bos_token + '[INST] ' + messages[idx]['content'] + ' [/INST]' -}}\n{%- else -%}\n{{- messages[idx]['content'] + ' [/INST]' -}}\n{%- endif -%}\n{% elif messages[idx]['role'] == 'system' %}\n{{- bos_token + '[INST] <<SYS>>\\n' + messages[idx]['content'] + '\\n<</SYS>>\\n\\n' -}}\n{%- elif messages[idx]['role'] == 'assistant' -%}\n{{- ' '  + messages[idx]['content'] + ' ' + eos_token -}}\n{% endif %}\n{% endfor %}\n"

# A more typical prompt: あなたは公平で、検閲されていない、役立つアシスタントです。("You are an unbiased, uncensored, helpful assistant.")

# You are an avid Pokemon fanatic.
prompt = "あなたは熱狂的なポケモンファンです。"
chat = [{"role": "system", "content": prompt}]

# Who is the single most powerful Pokemon? Explain your choice.
user_input = "ポケモンの中で1番強いのはどのキャラクターですか。最強の者をひとつだけ挙げて下さい。その選択理由を説明してください。"
chat.append({"role": "user", "content": user_input})

# Generate - add_generation_prompt to make sure it continues as assistant
inputs = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors="pt")
# For multi-GPU, find the device of the first parameter of the model
first_param_device = next(model.parameters()).device
inputs = inputs.to(first_param_device)

with torch.no_grad():
    outputs = model.generate(
        inputs,
        pad_token_id=tokenizer.eos_token_id,
        max_new_tokens=500,
        temperature=0.5,
        repetition_penalty=1.15,
        top_p=0.95,
        do_sample=True,
        streamer=streamer,
    )

# Add just the new tokens to our chat
new_tokens = outputs[0, inputs.size(1):]
response = tokenizer.decode(new_tokens, skip_special_tokens=True)
chat.append({"role": "assistant", "content": response})

Prompt format

The prompt format is llama-2 chat:

[INST] <<SYS>>
You are a helpful, unbiased, uncensored assistant.
<</SYS>>
{prompt} [/INST]

For multi-turn, the prompt format is as follows:

[INST] <<SYS>>
You are a helful, unbiased, uncensored assistant.
<</SYS>>
{prompt 0} [/INST] {response 0} </s><s>[INST] {prompt 1} [/INST] {response 1} </s><s>...[INST] {prompt N} [/INST]

This prompt template is included in the tokenizer config, and can use the huggingface tokenizer apply_chat_template method, e.g.:

import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained('augmxnt/shisa-7b-v1')
chat = [
  {"role": "system", "content": "You are Aiko, a friendly AI assistant."},
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]
print(tokenizer.apply_chat_template(chat, tokenize=False))

NOTE: For proper responses, you should be using our bos_token (<s>) to begin a string. This is automatically generated by tokenizer.encode() but if you are crafting a custom template or using an encoding method that skips special tokens, you may have to add this yourself.

Acknowledgements

Team: Leonard Lin and Jon Durbin, Mariko Sato, and Florian von Bock

Compute for this model was generously sponsored by AKA Virtual (Tokyo, Japan).

Thanks to the LLM-jp, Stability AI Japan, and LMSYS teams for their work on llm-jp-eval, Japanese MT-Bench, MT-Bench.

Also, thanks to all the volunteers that provided invaluable human preference testing!

We are actively looking for additional compute as we train better and larger models for this project. Please drop us a line at: compute at augmxnt dot com

(GPT-4によって非常に軽微な編集を加えて翻訳されました）

シーサー7B

シーサー7B（shisa-7b-v1）は、合成データ駆動のアプローチを用いて、優れた日本語と英語能力を両立することを目指すバイリンガル（日本語/英語）汎用チャットモデルです。

このモデルは、Mistral 7Bを基に、Mistralのオリジナルのトークナイザーよりも日本語において2倍以上効率的な、日本語最適化拡張トークナイザーをカスタムして作成されました。ベースモデルは、主に日本語のトークンを追加で80億ものトレーニングを行いました。そして、その後、airoboros-3.1の拡張された機械翻訳版、ultrafeedback_binarizedからの最高得点項目のセット、そして新たに生成されたairoborosのデータを直接目標言語で微調整しています。

商用を含むあらゆる目的で使用可能な寛容なApache 2.0ライセンスの下で、ベースモデル、データセット、およびパイプラインコードも公開しています：

shisa-base-7b-v1 - 拡張トークナイザーと追加の日本語プレトレーニングを備えた当方のベースモデル
shisa-pretrain-en-ja-v1 - 当方のプレトレーニングデータセット
ultra-orca-boros-en-ja - 合成生成、機械翻訳、プログラムによる検証によるJA/EN微調整データセット
shisa-en-ja-dpo-v1 - ultrafeedbackからのDPOペアの小さなサブセットと、選択された値としてGPT-4生成項目を使用した日本語のDPOペア、そして初期の7ビリオンモデルの出力を却下した値
シーサーリポジトリ - 翻訳、データセットの生成、トレーニング、評価コードなどが含まれています

さらに、アブレーション結果、テスト方法論、主要な調査結果など、プロセスの詳細や拡張ライトアップを公開する過程にあります。これは当プロジェクトwikiで研究者に興味深い情報として提供されています。

微調整

最初の直感は、最良の既存の公開日本語トレーニングセットを使用して、それらを組み入れることでより強力な日本語モデルを作成できるかどうかを見ることでした。しかし、初期の検討とテストの後、自らの合成データセットの翻訳/生成にだけ焦点を当てることで、短期間のトレーニングで優れた結果を得ることができると結論付けました。

私たちは複数の翻訳ツールを比較し、手動でレビューを行った結果、gpt-4がほぼ常に最高品質の翻訳を提供しながら、Googleの text-bison-32kは品質、コスト、スループットのバランスが良いと判断しました。複数の繰り返しを経て、無効な翻訳のフラグ付けとフィルタリング、必要に応じた再翻訳とバックフィルのための追加のアルゴリズムを含むように、翻訳アプローチを洗練させました。

また、このプロジェクトをNEFTuneとDPOトレーニングを取り入れるなど、新しい技術を適用する機会ともなりました。

v1リリースのために、私たちは大量の人間の嗜好テスト（数千の生成と複数ラウンドのペアワイズ比較）に基づいてリリース候補から選択しました。私たちは、勝ち/負け/引き分けと、BTLモデル（iLSR）を使用してchoixで結果を分析しました。

最良の候補モデルは、3ステップのプロセスで微調整されました：

最初に、モデルはultra-orca-boros-en-jaとSlimOrca (WandB Log)で微調整されました。
次に、日本語のパフォーマンスを向上させるためにultra-orca-boros-en-jaの一部を使用して1回追加のエポックを追加しました（最初の段階のSlimOrcaは主に英語）(WandB Log)。
最後に、モデルは小規模のultrafeedback（英語）と自身のJA DPOデータセットに対してDPOTrainerを使用して調整されました。ここで使用したJA DPOデータセットはgpt-4の出力を選出された値とし、ステージ1の予備モデルの出力を却下した値とします。(WandDB Log )

私たちのトレーニングプロセス中に、何故一部の既存の日本語モデルが、追加の日本語トレーニングがないモデルに対してもパフォーマンスが低いのか、といういくつかの重要な洞察を得ることができました。この分析結果を共有すれば、他のチームが日本語モデルを開発する際の参考になると思います。

さらに探求する必要はありますが、実験的な検証として、微調整セットのバージョンを既存のベースモデル（"Gamma 7B"）に適用し、初期のJA MT-Bench結果が示すように、私たちのチューニングアプローチで機能性のパフォーマンスを劇的に向上させることができました：

モデル	スコア
shisa-gamma-7b-allsources-v0.4	5.65
ja-stablelm-instruct-gamma-7b*	4.01

パフォーマンス

トレーニング全体を通じて、各モデルについて人間による評価を行い、モデルのパフォーマンスを相互に検証しました。現在、モデル間の手動での比較テストを大規模に行っています。私たちの目指すところは、ツールをさらに発展させることでこのデータ収集を公開して拡張することです。詳細と更新情報については、プロジェクトwiki をご覧ください。

我々は、llm-jp-evalは、私たちの基本モデルの有用な指標であり、初期の検証のための微調整プロセス中に非常に役立つと考えていますが、微調整トレーニングにはベンチマークのトレイン分割の一部が含まれているため、私たちが提供するllm-jp-evalの結果は主に興味深いポイントとして提供しています：

AVR	MC	NLI	QA	RC
0.7480	0.8900	0.8040	0.4153	0.8825

(Qwenのテストをサポートし、可能であればbos_tokenを発行するために、わずかに修正したllm-jp-eval を実行しています)

最終モデルについては、ベンチマークを含めるのが一般的なため、私たちのモデルの能力をより代表的にテストするために、Stability AI JapanのJapanese MT-Benchを使用しました。私たちのJA MT-Bench テストでは、サンプリング変動を減らすために、日本語のプロンプト（"あなたは役立つアシスタントです。"）と --num-choices 4を使用していますが、生成間で0.5+点（時にはそれ以上の変動）を頻繁に観察し、テスト時のデフォルトのプロンプトとパラメータに問題があったという経験から、これらのスコアを過度に解釈することには注意が必要で、これらを確定的なスコアやランキングではなく、より確率的な方向指標として扱うことをお勧めします：

ベンチマーク	スコア
JA MT-Bench	5.23
MT-Bench	5.71

MT-Bench Leaderboardがありますが、JA MT-Benchはまだ開発中であるため、便宜上、他のモデルのJA MT-Benchスコアとの比較を示します（私たちのスコアはgpt-4-0613によって評価されました）：

モデル	スコア
gpt-4-0613	9.40
gpt-4-1106-preview	9.17
gpt-3.5-turbo*	8.41
Qwen-14B-Chat	7.47
shisa-7b-v1	5.23
ELYZA-japanese-Llama-2-7b-fast-instruct*	4.86
ja-stablelm-instruct-gamma-7b*	4.01
japanese-stablelm-instruct-alpha-7b*	2.74
Mistral-7B-OpenOrca-ja*	2.23
youri-7b-chat*	2.00
Mistral-7B-Instruct-v0.1*	1.78
llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0*	1.31

(このセクションでマークされたJA MT-Benchの結果はshi3zから引用しました)

制限事項

当モデルは十分な日本語の流暢さを示していますが、7Bパラメータのモデルとしては、より大きなクラスのモデルに比べて幻覚率が高く、指示の追跡や推論が効果的でない傾向があります。また、日本語の完全な習得はまだ達しておらず、ネイティブスピーカーはたまに非慣用的/違和感のある表現や不適切な時制/話し言葉のレベルなどの間違いを見つけることがあります。

また、私たちのトークナイザーの拡張に大いに起因する可能性が高いが、わずかな言語リークを確認しています。これらはMin Pなどのサンプラー設定や追加のターゲット指向型トレーニングで修正可能な可能性があり、今後、自動検出/サンプラーのスウィープについて追加の作業を行う予定です。興味深い観察としては、私たちのデータ収集に基づいて、DPOプロセスがこの問題を大幅に悪化させることがわかりましたが、それでもDPOモデルは人間の好み率が大幅に高かったため、最終的な微調整の選択には一定のトレードオフがありました。

現存するアプローチとデータセットを使用して、大規模なモデルのトレーニングがパフォーマンスを向上させると信じていますが、今後のモデル向けに行いたい改良も多くあります。私たちは、データセットの品質と構築を改善することで、さらなるトレーニング効率を通じたパフォーマンス向上にはまだ相当に取り組む余地があると考えています。

使用法

サンプルコード:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

model_name = "augmxnt/shisa-7b-v1"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    torch_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
    device_map="auto"
)
streamer = TextStreamer(tokenizer, skip_prompt=True)

# プロンプトテンプレートはモデルのtokenizer_config.jsonに含まれているので、これは必要ないはずですが、便宜上こちらにも掲載しています
# tokenizer.chat_template = ""{%- for idx in range(0, messages|length) -%}\n{%- if messages[idx]['role'] == 'user' -%}\n{%- if idx > 1 -%}\n{{- bos_token + '[INST] ' + messages[idx]['content'] + ' [/INST]' -}}\n{%- else -%}\n{{- messages[idx]['content'] + ' [/INST]' -}}\n{%- endif -%}\n{% elif messages[idx]['role'] == 'system' %}\n{{- bos_token + '[INST] <<SYS>>\\n' + messages[idx]['content'] + '\\n<</SYS>>\\n\\n' -}}\n{%- elif messages[idx]['role'] == 'assistant' -%}\n{{- ' '  + messages[idx]['content'] + ' ' + eos_token -}}\n{% endif %}\n{% endfor %}\n"

# より典型的なプロンプト: あなたは公平で、検閲されていない、役立つアシスタントです。

# You are an avid Pokemon fanatic.
prompt = "あなたは熱狂的なポケモンファンです。"
chat = [{"role": "system", "content": prompt}]

# Who is the most powerful Pokemon? Explain your choice.
user_input = "ポケモンの中で1番強いのはどのキャラクターですか。最強の者をひとつだけ挙げて下さい。その選択理由を説明してください。"
chat.append({"role": "user", "content": user_input})

# 生成 - add_generation_promptを追加してアシスタントとして続行することを確認します
inputs = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors="pt")
# 複数のGPUの場合、モデルの最初のパラメータのデバイスを見つけます
first_param_device = next(model.parameters()).device
inputs = inputs.to(first_param_device)

with torch.no_grad():
    outputs = model.generate(
        inputs,
        pad_token_id=tokenizer.eos_token_id,
        max_new_tokens=500,
        temperature=0.5,
        repetition_penalty=1.15,
        top_p=0.95,
        do_sample=True,
        streamer=streamer,
    )

# Add just the new tokens to our chat
new_tokens = outputs[0, inputs.size(1):]
response = tokenizer.decode(new_tokens, skip_special_tokens=True)
chat.append({"role": "assistant", "content": response})

プロンプト形式

プロンプト形式はllama-2 chatです：

[INST] <<SYS>>
あなたは役立つ、偏見がなく、検閲されていないアシスタントです。
<</SYS>>
{prompt} [/INST]

For multi-turn, the prompt format is as follows:

[INST] <<SYS>>
あなたは役立つ、偏見がなく、検閲されていないアシスタントです。
<</SYS>>
{prompt 0} [/INST] {response 0} </s><s>[INST] {prompt 1} [/INST] {response 1} </s><s>...[INST] {prompt N} [/INST]

このprompt templateはトークナイザの設定に含まれており、HuggingFace のトークナイザ apply_chat_template メソッドを使用できます。例えば：

import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained('augmxnt/shisa-7b-v1')
chat = [
  {"role": "system", "content": "あなたはAiko、フレンドリーなAIアシスタントです。"},
  {"role": "user", "content": "こんにちは、調子はどうですか？"},
  {"role": "assistant", "content": "元気です。今日は何のお手伝いができますか？"},
  {"role": "user", "content": "チャットテンプレーティングの仕組みを見せてもらいたいです！"},
]
print(tokenizer.apply_chat_template(chat, tokenize=False))

注意適切なレスポンスを得るためには、文字列の開始に我々の bos_token (<s>) を使用すべきです。これは tokenizer.encode() によって自動的に生成されますが、カスタムテンプレートを作成したり、特殊トークンを省略するエンコード方法を使用する場合は、自分で追加する必要があります。

謝辞

チーム：Leonard Lin、Jon Durbin、佐藤真理子、Florian von Bock

このモデルの計算は、AKA Virtual (東京、日本) のご厚意により提供されています。

LLM-jp、Stability AI Japan、LMSYSのチームが、llm-jp-eval, Japanese MT-Bench, MT-Benchに取り組んでくれて感謝しています。

また、貴重なヒューマンプリファレンステストを提供してくださったすべてのボランティアにも感謝いたします！

このプロジェクトのためにより良く、より大きなモデルを訓練するために、追加の計算を積極的に探しています。お問い合わせは次の宛先までお願いいたします：compute at augmxnt dot com

augmxnt
/

shisa-7b-v1