shisa-ai/shisa-v1-llama3-8b.2e5

shisa-v2 Base Model ablation

Using a fork of Lightblue's Shaberi benchmark framework:

Model	Average	ELYZA-tasks-100	MT-Bench	Rakuda	Tengu-Bench
gpt-4-turbo-2024-04-09	8.75	8.78	8.74	9.18	8.31
CohereForAI/c4ai-command-r-plus	7.69	7.50	7.43	9.05	6.79
gpt-3.5-turbo-0125	7.17	7.24	6.98	7.64	6.82
shisa-ai/shisa-v1-llama3-70b	7.17	7.16	7.45	7.98	6.09
karakuri-ai/karakuri-lm-70b-chat-v0.1	6.84	6.86	6.43	7.85	6.23
lightblue/ao-karasu-72B	6.81	7.19	6.54	7.25	6.27
shisa-ai/shisa-v1-llama3-8b^	6.29	6.62	6.41	7.05	5.07
shisa-ai/shisa-swallowmx-13a47b-v1	6.17	6.48	6.07	7.11	5.03
shisa-ai/shisa-v1-llama3-8b	6.10	6.52	6.20	6.37	5.33
Rakuten/RakutenAI-7B-chat	5.58	5.92	4.60	6.58	5.24
shisa-ai/shisa-v1-gemma-8b	5.64	6.50	5.42	5.10	5.55
augmxnt/shisa-gamma-7b-v1	5.56	5.84	4.00	6.73	5.68
lightblue/qarasu-14B-chat-plus-unleashed	5.20	5.58	4.74	5.46	5.01
cyberagent/calm2-7b-chat	4.76	4.90	3.58	5.75	4.81
mistralai/Mistral-7B-Instruct-v0.2	4.69	5.78	4.65	3.80	4.53
shisa-ai/shisa-v1-yi1.5-9b	4.63	5.98	4.28	3.26	5.00

^ Shaberi uses temperature=0.0, no sampling, for all generations by default. This is actually different from JA MT-Bench's default settings which has different temperature per category. This means that Shaberi's results can't be compared to other JA MT-Bench results (like my comparison chart or the Nejumi Leaderboard). Like some other models, if you look at the results you'll notice repetition loops. For Llama models, you usually want something like a repetition_penalty of 1.15/1.18 to get rid of repetition loops. Because Shaberi uses the vLLM's OpenAI API server, it doesn't support repetition penalty, doing a frequency_penalty sweep (0.0, 0.5, 0.8) I found 0.5 to remove repetitions and improve output in general. There is no decay/window so for long generations, this may not be optimal. For the improved generations, I used the following sampler settings: temperature 0.2, min_p 0.1, frequency_penalty 0.5 (OpenAI doesn't support min_p, but vLLM adds it and it's basically always the superior sampler).

shisa-ai
/

shisa-v1-llama3-8b.2e5

Finetuned from

Dataset used to train shisa-ai/shisa-v1-llama3-8b.2e5

Space using shisa-ai/shisa-v1-llama3-8b.2e5 1

Finetuned from meta-llama/Meta-Llama-3-8B-Instruct

Dataset used to train shisa-ai/shisa-v1-llama3-8b.2e5

Space using shisa-ai/shisa-v1-llama3-8b.2e5 1

Finetuned from