Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -79,7 +79,7 @@ Detailed information on how the new training strategy works and the advantages i
 ### Prompt Template:
-We trained on vicuna prompt template. Please add the following stopping string to your client: </s>,</p> (we did not add the special tokens to the training config)
 ```
 You are a helpful AI Assistant.
@@ -91,17 +91,17 @@ ASSISTANT:
 ## Evaluation
 **Open LLM Leaderboard:**
-* benchmarks were done with the newest version of lm-evaluation-harness on a batch-size of 1:
 | Metric                | Value                     |
 |-----------------------|---------------------------|
-| Avg.                  | **68.92**  |
 | ARC (25-shot)         | 59.98         |
-| HellaSwag (10-shot)   | 82.28  |
-| MMLU (5-shot)         | 63.53|
-| TruthfulQA (0-shot)   | 61.2 |
-| Winogrande (5-shot)   | 80.27  |
-| GSM8K (5-shot)        | 66.26        |
 Dispite the fact that we archived great results on the Open LLM leaderboard benchmarks the model subjectively does not feel as smart as comparable mistral finetunes. Most of its answers are coherent but we observed that the model sometimes answers realy lazy or odd.

 ### Prompt Template:
+We trained on vicuna prompt template. Please add the following stopping string to your client: '</s>','</p>'' (we did not add the special tokens to the training config)
 ```
 You are a helpful AI Assistant.
 ## Evaluation
 **Open LLM Leaderboard:**
 | Metric                | Value                     |
 |-----------------------|---------------------------|
+| Avg.                  | **67.83**  |
 | ARC (25-shot)         | 59.98         |
+| HellaSwag (10-shot)   | 81.91  |
+| MMLU (5-shot)         | 63.76|
+| TruthfulQA (0-shot)   | 61 |
+| Winogrande (5-shot)   | 76.64  |
+| GSM8K (5-shot)        | 63.68        |
 Dispite the fact that we archived great results on the Open LLM leaderboard benchmarks the model subjectively does not feel as smart as comparable mistral finetunes. Most of its answers are coherent but we observed that the model sometimes answers realy lazy or odd.