Update README.md

963e93f verified about 1 month ago

No virus

3.59 kB

	---
	license: llama3
	datasets:
	- augmxnt/ultra-orca-boros-en-ja-v1
	language:
	- ja
	- en
	base_model: meta-llama/Meta-Llama-3-8B-Instruct
	---
	shisa-v2 Base Model ablation

	Using a [fork](https://github.com/shisa-ai/shaberi) of [Lightblue's Shaberi benchmark framework](https://github.com/lightblue-tech/japanese_llm_eval):

	\| Model \| Average \| ELYZA-tasks-100 \| MT-Bench \| Rakuda \| Tengu-Bench \|
	\|----------------------------------------\|---------\|-----------------\|----------\|--------\|-------------\|
	\| gpt-4-turbo-2024-04-09 \| 8.75 \| 8.78 \| 8.74 \| 9.18 \| 8.31 \|
	\| CohereForAI/c4ai-command-r-plus \| 7.69 \| 7.50 \| 7.43 \| 9.05 \| 6.79 \|
	\| gpt-3.5-turbo-0125 \| 7.17 \| 7.24 \| 6.98 \| 7.64 \| 6.82 \|
	\| shisa-ai/shisa-v1-llama3-70b \| 7.17\| 7.16 \| 7.45 \| 7.98 \| 6.09 \|
	\| karakuri-ai/karakuri-lm-70b-chat-v0.1 \| 6.84 \| 6.86 \| 6.43 \| 7.85 \| 6.23 \|
	\| lightblue/ao-karasu-72B \| 6.81 \| 7.19 \| 6.54 \| 7.25 \| 6.27 \|
	\| shisa-ai/shisa-v1-llama3-8b^ \| 6.29\| 6.62 \| 6.41 \| 7.05\|5.07 \|
	\| shisa-ai/shisa-swallowmx-13a47b-v1 \| 6.17 \| 6.48 \| 6.07 \| 7.11 \| 5.03 \|
	\| shisa-ai/shisa-v1-llama3-8b \| 6.10\| 6.52 \| 6.20 \| 6.37\|5.33 \|
	\| Rakuten/RakutenAI-7B-chat \| 5.58 \| 5.92 \| 4.60 \| 6.58 \| 5.24 \|
	\| shisa-ai/shisa-v1-gemma-8b \| 5.64 \| 6.50 \| 5.42 \| 5.10 \| 5.55 \|
	\| augmxnt/shisa-gamma-7b-v1 \| 5.56 \| 5.84 \| 4.00 \| 6.73 \| 5.68 \|
	\| lightblue/qarasu-14B-chat-plus-unleashed \| 5.20 \| 5.58 \| 4.74 \| 5.46 \| 5.01 \|
	\| cyberagent/calm2-7b-chat \| 4.76 \| 4.90 \| 3.58 \| 5.75 \| 4.81 \|
	\| mistralai/Mistral-7B-Instruct-v0.2 \| 4.69 \| 5.78 \| 4.65 \| 3.80 \| 4.53 \|
	\| shisa-ai/shisa-v1-yi1.5-9b \| 4.63\| 5.98 \| 4.28 \| 3.26\|5.00 \|

	^ Shaberi uses `temperature=0.0`, no sampling, for all generations by default. This is actually different from [JA MT-Bench's default settings](https://github.com/Stability-AI/FastChat/blob/jp-stable/fastchat/llm_judge/common.py#L37) which has different temperature per category.
	This means that Shaberi's results can't be compared to other JA MT-Bench results (like [my comparison chart](https://github.com/AUGMXNT/shisa/wiki/Evals-:-JA-MT%E2%80%90Bench) or the [Nejumi Leaderboard](https://wandb.ai/wandb-japan/llm-leaderboard/reports/Nejumi-LLM-Leaderboard-Evaluating-Japanese-Language-Proficiency--Vmlldzo2MzU3NzIy)).
	Like some other models, if you look at the results you'll notice repetition loops. For Llama models, you usually want something like a `repetition_penalty` of 1.15/1.18 to get rid of repetition loops.
	Because Shaberi uses the vLLM's OpenAI API server, it doesn't support repetition penalty, doing a `frequency_penalty` sweep (0.0, 0.5, 0.8) I found 0.5 to remove repetitions and improve output in general. There is no decay/window so for long generations, this may not be optimal.
	For the improved generations, I used the following sampler settings: `temperature 0.2, min_p 0.1, frequency_penalty 0.5` (OpenAI doesn't support min_p, but vLLM adds it and it's [basically always the superior sampler](https://github.com/huggingface/transformers/issues/27670)).

	---
	license: llama3
	datasets:
	- augmxnt/ultra-orca-boros-en-ja-v1
	language:
	- ja
	- en
	base_model: meta-llama/Meta-Llama-3-8B-Instruct
	---
	shisa-v2 Base Model ablation

	Using a [fork](https://github.com/shisa-ai/shaberi) of [Lightblue's Shaberi benchmark framework](https://github.com/lightblue-tech/japanese_llm_eval):

	\| Model \| Average \| ELYZA-tasks-100 \| MT-Bench \| Rakuda \| Tengu-Bench \|
	\|----------------------------------------\|---------\|-----------------\|----------\|--------\|-------------\|
	\| gpt-4-turbo-2024-04-09 \| 8.75 \| 8.78 \| 8.74 \| 9.18 \| 8.31 \|
	\| CohereForAI/c4ai-command-r-plus \| 7.69 \| 7.50 \| 7.43 \| 9.05 \| 6.79 \|
	\| gpt-3.5-turbo-0125 \| 7.17 \| 7.24 \| 6.98 \| 7.64 \| 6.82 \|
	\| shisa-ai/shisa-v1-llama3-70b \| 7.17\| 7.16 \| 7.45 \| 7.98 \| 6.09 \|
	\| karakuri-ai/karakuri-lm-70b-chat-v0.1 \| 6.84 \| 6.86 \| 6.43 \| 7.85 \| 6.23 \|
	\| lightblue/ao-karasu-72B \| 6.81 \| 7.19 \| 6.54 \| 7.25 \| 6.27 \|
	\| shisa-ai/shisa-v1-llama3-8b^ \| 6.29\| 6.62 \| 6.41 \| 7.05\|5.07 \|
	\| shisa-ai/shisa-swallowmx-13a47b-v1 \| 6.17 \| 6.48 \| 6.07 \| 7.11 \| 5.03 \|
	\| shisa-ai/shisa-v1-llama3-8b \| 6.10\| 6.52 \| 6.20 \| 6.37\|5.33 \|
	\| Rakuten/RakutenAI-7B-chat \| 5.58 \| 5.92 \| 4.60 \| 6.58 \| 5.24 \|
	\| shisa-ai/shisa-v1-gemma-8b \| 5.64 \| 6.50 \| 5.42 \| 5.10 \| 5.55 \|
	\| augmxnt/shisa-gamma-7b-v1 \| 5.56 \| 5.84 \| 4.00 \| 6.73 \| 5.68 \|
	\| lightblue/qarasu-14B-chat-plus-unleashed \| 5.20 \| 5.58 \| 4.74 \| 5.46 \| 5.01 \|
	\| cyberagent/calm2-7b-chat \| 4.76 \| 4.90 \| 3.58 \| 5.75 \| 4.81 \|
	\| mistralai/Mistral-7B-Instruct-v0.2 \| 4.69 \| 5.78 \| 4.65 \| 3.80 \| 4.53 \|
	\| shisa-ai/shisa-v1-yi1.5-9b \| 4.63\| 5.98 \| 4.28 \| 3.26\|5.00 \|

	^ Shaberi uses `temperature=0.0`, no sampling, for all generations by default. This is actually different from [JA MT-Bench's default settings](https://github.com/Stability-AI/FastChat/blob/jp-stable/fastchat/llm_judge/common.py#L37) which has different temperature per category.
	This means that Shaberi's results can't be compared to other JA MT-Bench results (like [my comparison chart](https://github.com/AUGMXNT/shisa/wiki/Evals-:-JA-MT%E2%80%90Bench) or the [Nejumi Leaderboard](https://wandb.ai/wandb-japan/llm-leaderboard/reports/Nejumi-LLM-Leaderboard-Evaluating-Japanese-Language-Proficiency--Vmlldzo2MzU3NzIy)).
	Like some other models, if you look at the results you'll notice repetition loops. For Llama models, you usually want something like a `repetition_penalty` of 1.15/1.18 to get rid of repetition loops.
	Because Shaberi uses the vLLM's OpenAI API server, it doesn't support repetition penalty, doing a `frequency_penalty` sweep (0.0, 0.5, 0.8) I found 0.5 to remove repetitions and improve output in general. There is no decay/window so for long generations, this may not be optimal.
	For the improved generations, I used the following sampler settings: `temperature 0.2, min_p 0.1, frequency_penalty 0.5` (OpenAI doesn't support min_p, but vLLM adds it and it's [basically always the superior sampler](https://github.com/huggingface/transformers/issues/27670)).