shisa-ai
/

shisa-v1-llama3-8b.2e5

@@ -17,14 +17,18 @@ Using a [fork](https://github.com/shisa-ai/shaberi) of [Lightblue's Shaberi benc
 | CohereForAI/c4ai-command-r-plus        | 7.69    | 7.50            | 7.43     | 9.05   | 6.79        |
 | karakuri-ai/karakuri-lm-70b-chat-v0.1  | 6.84    | 6.86            | 6.43     | 7.85   | 6.23        |
 | lightblue/ao-karasu-72B                | 6.81    | 7.19            | 6.54     | 7.25   | 6.27        |
 | **shisa-ai/shisa-llama3-8b-v1**        | **6.10**| **6.52**        | **6.20** | **6.37**|**5.33**    |
 | Rakuten/RakutenAI-7B-chat              | 5.58    | 5.92            | 4.60     | 6.58   | 5.24        |
 | shisa-ai/shisa-gemma-7b-v1             | 5.64    | 6.50            | 5.42     | 5.10   | 5.55        |
 | augmxnt/shisa-gamma-7b-v1              | 5.56    | 5.84            | 4.00     | 6.73   | 5.68        |
-| lightblue/qarasu-14B-chat-plus-unleashed | 5.20 | 5.58            | 4.74     | 5.46   | 5.01        |
 | cyberagent/calm2-7b-chat               | 4.76    | 4.90            | 3.58     | 5.75   | 4.81        |
 | mistralai/Mistral-7B-Instruct-v0.2     | 4.69    | 5.78            | 4.65     | 3.80   | 4.53        |
 | shisa-ai/shisa-yi1.5-9b-v1             | 4.63    | 5.98            | 4.28     | 3.26   | 5.00        |

 | CohereForAI/c4ai-command-r-plus        | 7.69    | 7.50            | 7.43     | 9.05   | 6.79        |
 | karakuri-ai/karakuri-lm-70b-chat-v0.1  | 6.84    | 6.86            | 6.43     | 7.85   | 6.23        |
 | lightblue/ao-karasu-72B                | 6.81    | 7.19            | 6.54     | 7.25   | 6.27        |
+| **shisa-ai/shisa-llama3-8b-v1^**       | **6.29**| **6.62**        | **6.41** | **7.05**|**5.07**    |
 | **shisa-ai/shisa-llama3-8b-v1**        | **6.10**| **6.52**        | **6.20** | **6.37**|**5.33**    |
 | Rakuten/RakutenAI-7B-chat              | 5.58    | 5.92            | 4.60     | 6.58   | 5.24        |
 | shisa-ai/shisa-gemma-7b-v1             | 5.64    | 6.50            | 5.42     | 5.10   | 5.55        |
 | augmxnt/shisa-gamma-7b-v1              | 5.56    | 5.84            | 4.00     | 6.73   | 5.68        |
+| lightblue/qarasu-14B-chat-plus-unleashed | 5.20  | 5.58            | 4.74     | 5.46   | 5.01        |
 | cyberagent/calm2-7b-chat               | 4.76    | 4.90            | 3.58     | 5.75   | 4.81        |
 | mistralai/Mistral-7B-Instruct-v0.2     | 4.69    | 5.78            | 4.65     | 3.80   | 4.53        |
 | shisa-ai/shisa-yi1.5-9b-v1             | 4.63    | 5.98            | 4.28     | 3.26   | 5.00        |
+^ Shaberi uses `temperature=0.0`, no sampling, for all generations by default. This is actually different from [JA MT-Bench's default settings](https://github.com/Stability-AI/FastChat/blob/jp-stable/fastchat/llm_judge/common.py#L37) which has different temperature per category.
+This means that Shaberi's results can't be compared to other JA MT-Bench results (like [my comparison chart](https://github.com/AUGMXNT/shisa/wiki/Evals-:-JA-MT%E2%80%90Bench) or the [Nejumi Leaderboard](https://wandb.ai/wandb-japan/llm-leaderboard/reports/Nejumi-LLM-Leaderboard-Evaluating-Japanese-Language-Proficiency--Vmlldzo2MzU3NzIy)).
+Like some other models, if you look at the results you'll notice repetition loops. For Llama models, you usually want something like a `repetition_penalty` of 1.15/1.18 to get rid of repetition loops.
+Because Shaberi uses the vLLM's OpenAI API server, it doesn't support repetition penalty, doing a `frequency_penalty` sweep (0.0, 0.5, 0.8) I found 0.5 to remove repetitions and improve output in general. There is no decay/window so for long generations, this may not be optimal.
+For the improved generations, I used the following sampler settings: `temperature 0.2, min_p 0.1, frequency_penalty 0.5` (OpenAI doesn't support min_p, but vLLM adds it and it's [basically always the superior sampler](https://github.com/huggingface/transformers/issues/27670)).