shisa-ai
/

shisa-v1-llama3-70b

@@ -10,17 +10,25 @@ datasets:
 - augmxnt/ultra-orca-boros-en-ja-v1
 ---
-shisa-v2 Base Model ablation
-This model uses a LR of 8e-6 that slightly improves performance vs the original 2e-5
-It also uses NEFTune, although the expected impact may be neglible for this dataset.
-(this appears to validate the Llama 3 8B LR ablations for predicting improved LR hyperparameter)
-While the last model matched gpt-3.5-turbo, I think it's fair to say that this model allows us to farily say that it "beats" it.
-Using a [fork](https://github.com/shisa-ai/shaberi) of [Lightblue's Shaberi benchmark framework](https://github.com/lightblue-tech/japanese_llm_eval):
 | Model                                  | Average | ELYZA-tasks-100 | MT-Bench | Rakuda | Tengu-Bench |
 |----------------------------------------|---------|-----------------|----------|--------|-------------|
@@ -29,6 +37,7 @@ Using a [fork](https://github.com/shisa-ai/shaberi) of [Lightblue's Shaberi benc
 | **shisa-ai/shisa-v1-llama3-70b**       | **7.30**| **7.34**        | **7.67** | **8.15** | **6.04**  |
 | gpt-3.5-turbo-0125                     | 7.17    | 7.24            | 6.98     | 7.64   | 6.82        |
 | **shisa-ai/shisa-v1-llama3-70b**       | **7.17**| **7.16**        | **7.45** | **7.98** | **6.09**  |
 | karakuri-ai/karakuri-lm-70b-chat-v0.1  | 6.84    | 6.86            | 6.43     | 7.85   | 6.23        |
 | lightblue/ao-karasu-72B                | 6.81    | 7.19            | 6.54     | 7.25   | 6.27        |
 | **shisa-ai/shisa-v1-llama3-8b^**       | **6.29**| **6.62**        | **6.41** | **7.05**|**5.07**    |

 - augmxnt/ultra-orca-boros-en-ja-v1
 ---
+# shisa-v2 Base Model ablation
+This is a fine-tune Llama 3 70B Instruct with the primary `shisa-v1` dataset to improve Japanese language capabilities.
+This model uses a LR of 8e-6 that slightly improves performance vs the original 2e-5 tune (based on and validating predictive power of the the
+results of the Llama 3 8B LR ablations).
+It also uses NEFTune, although the expected impact is neglible for this dataset.
+While the 2e-5 model matched gpt-3.5-turbo performance, this 2e6 version consistently edges it out, so I think it's fair to say that this model "beats" it.
+There are a selection of GGUF quants here: https://huggingface.co/shisa-ai/shisa-v1-llama3-70b-gguf
+While this is merely a test ablation on the road to `shisa-v2`, as the strongest commercially usable open JA model I've tested so far, this model may be of general interest.
+## Performance
+Measured using a [fork](https://github.com/shisa-ai/shaberi) of [Lightblue's Shaberi benchmark framework](https://github.com/lightblue-tech/japanese_llm_eval):
 | Model                                  | Average | ELYZA-tasks-100 | MT-Bench | Rakuda | Tengu-Bench |
 |----------------------------------------|---------|-----------------|----------|--------|-------------|
 | **shisa-ai/shisa-v1-llama3-70b**       | **7.30**| **7.34**        | **7.67** | **8.15** | **6.04**  |
 | gpt-3.5-turbo-0125                     | 7.17    | 7.24            | 6.98     | 7.64   | 6.82        |
 | **shisa-ai/shisa-v1-llama3-70b**       | **7.17**| **7.16**        | **7.45** | **7.98** | **6.09**  |
+| karakuri-ai/karakuri-lm-8x7b-chat-v0.1 | 7.00    | 7.18            | 6.30     | 7.98   | 6.55        |
 | karakuri-ai/karakuri-lm-70b-chat-v0.1  | 6.84    | 6.86            | 6.43     | 7.85   | 6.23        |
 | lightblue/ao-karasu-72B                | 6.81    | 7.19            | 6.54     | 7.25   | 6.27        |
 | **shisa-ai/shisa-v1-llama3-8b^**       | **6.29**| **6.62**        | **6.41** | **7.05**|**5.07**    |