cstr commited on
Commit
885447c
1 Parent(s): 4fc4f67

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -3
README.md CHANGED
@@ -26,11 +26,97 @@ This is a progressive (mostly dare-ties, but also slerp) merge with the intentio
26
 
27
  There is also a 4q_k_m quantized [GGUF](https://huggingface.co/cstr/Spaetzle-v69-7b-GGUF).
28
 
 
 
 
 
 
 
 
29
  It achieves (running quantized) in
30
  - German EQ Bench: Score (v2_de): 62.59 (Parseable: 171.0).
31
  - English EQ Bench: Score (v2): 76.43 (Parseable: 171.0).
32
 
33
- It should work sufficiently well with ChatML prompt template (for all merged models should have seen ChatML prompts at least in DPO stage).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
  Spaetzle-v69-7b is a merge of the following models using [LazyMergekit](https://colab.research.google.com/drive/1obulZ1ROXHjYLn6PPZJwRR6GzgQogxxb?usp=sharing):
36
  * [abideen/AlphaMonarch-dora](https://huggingface.co/abideen/AlphaMonarch-dora)
@@ -53,8 +139,7 @@ The merge tree in total involves the following original models:
53
  - [FelixChao/WestSeverus-7B-DPO-v2](https://huggingface.co/FelixChao/WestSeverus-7B-DPO-v2)
54
  - [cognitivecomputations/openchat-3.5-0106-laser](https://huggingface.co/cognitivecomputations/openchat-3.5-0106-laser)
55
 
56
-
57
- ## 🧩 Configuration
58
 
59
  ```yaml
60
  models:
 
26
 
27
  There is also a 4q_k_m quantized [GGUF](https://huggingface.co/cstr/Spaetzle-v69-7b-GGUF).
28
 
29
+ It should work sufficiently well with ChatML prompt template (for all merged models should have seen ChatML prompts at least in DPO stage).
30
+
31
+ ## Evaluation
32
+
33
+ Benchmark scores are not the possible optimum, as the model attempts a compromise with a number of parameters, like German language performance, instruction following, reasoning capabilities, robustness (so far, i did not encounter inserted tokens, e.g.), model licensing, and other criteria.
34
+ Nevertheless, they are not too bad:
35
+
36
  It achieves (running quantized) in
37
  - German EQ Bench: Score (v2_de): 62.59 (Parseable: 171.0).
38
  - English EQ Bench: Score (v2): 76.43 (Parseable: 171.0).
39
 
40
+ | Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
41
+ |--------------------------------------------------------------|------:|------:|---------:|-------:|------:|
42
+ |[Spaetzle-v69-7b](https://huggingface.co/cstr/Spaetzle-v69-7b)| 44.48| 75.84| 66.15| 46.59| 58.27|
43
+
44
+ ### AGIEval
45
+ | Task |Version| Metric |Value| |Stderr|
46
+ |------------------------------|------:|--------|----:|---|-----:|
47
+ |agieval_aqua_rat | 0|acc |25.98|± | 2.76|
48
+ | | |acc_norm|23.62|± | 2.67|
49
+ |agieval_logiqa_en | 0|acc |39.78|± | 1.92|
50
+ | | |acc_norm|39.48|± | 1.92|
51
+ |agieval_lsat_ar | 0|acc |23.48|± | 2.80|
52
+ | | |acc_norm|23.91|± | 2.82|
53
+ |agieval_lsat_lr | 0|acc |50.00|± | 2.22|
54
+ | | |acc_norm|51.76|± | 2.21|
55
+ |agieval_lsat_rc | 0|acc |63.94|± | 2.93|
56
+ | | |acc_norm|64.31|± | 2.93|
57
+ |agieval_sat_en | 0|acc |76.70|± | 2.95|
58
+ | | |acc_norm|77.67|± | 2.91|
59
+ |agieval_sat_en_without_passage| 0|acc |46.12|± | 3.48|
60
+ | | |acc_norm|44.17|± | 3.47|
61
+ |agieval_sat_math | 0|acc |34.09|± | 3.20|
62
+ | | |acc_norm|30.91|± | 3.12|
63
+
64
+ Average: 44.48%
65
+
66
+ ### GPT4All
67
+ | Task |Version| Metric |Value| |Stderr|
68
+ |-------------|------:|--------|----:|---|-----:|
69
+ |arc_challenge| 0|acc |63.23|± | 1.41|
70
+ | | |acc_norm|64.16|± | 1.40|
71
+ |arc_easy | 0|acc |85.90|± | 0.71|
72
+ | | |acc_norm|82.49|± | 0.78|
73
+ |boolq | 1|acc |87.80|± | 0.57|
74
+ |hellaswag | 0|acc |67.05|± | 0.47|
75
+ | | |acc_norm|85.19|± | 0.35|
76
+ |openbookqa | 0|acc |38.40|± | 2.18|
77
+ | | |acc_norm|48.40|± | 2.24|
78
+ |piqa | 0|acc |82.75|± | 0.88|
79
+ | | |acc_norm|84.28|± | 0.85|
80
+ |winogrande | 0|acc |78.53|± | 1.15|
81
+
82
+ Average: 75.84%
83
+
84
+ ### TruthfulQA
85
+ | Task |Version|Metric|Value| |Stderr|
86
+ |-------------|------:|------|----:|---|-----:|
87
+ |truthfulqa_mc| 1|mc1 |50.67|± | 1.75|
88
+ | | |mc2 |66.15|± | 1.48|
89
+
90
+ Average: 66.15%
91
+
92
+ ### Bigbench
93
+ | Task |Version| Metric |Value| |Stderr|
94
+ |------------------------------------------------|------:|---------------------|----:|---|-----:|
95
+ |bigbench_causal_judgement | 0|multiple_choice_grade|56.84|± | 3.60|
96
+ |bigbench_date_understanding | 0|multiple_choice_grade|66.67|± | 2.46|
97
+ |bigbench_disambiguation_qa | 0|multiple_choice_grade|40.70|± | 3.06|
98
+ |bigbench_geometric_shapes | 0|multiple_choice_grade|24.79|± | 2.28|
99
+ | | |exact_str_match |10.58|± | 1.63|
100
+ |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|31.00|± | 2.07|
101
+ |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|23.00|± | 1.59|
102
+ |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|58.00|± | 2.85|
103
+ |bigbench_movie_recommendation | 0|multiple_choice_grade|45.80|± | 2.23|
104
+ |bigbench_navigate | 0|multiple_choice_grade|52.10|± | 1.58|
105
+ |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|69.55|± | 1.03|
106
+ |bigbench_ruin_names | 0|multiple_choice_grade|48.88|± | 2.36|
107
+ |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|30.96|± | 1.46|
108
+ |bigbench_snarks | 0|multiple_choice_grade|73.48|± | 3.29|
109
+ |bigbench_sports_understanding | 0|multiple_choice_grade|74.14|± | 1.40|
110
+ |bigbench_temporal_sequences | 0|multiple_choice_grade|42.70|± | 1.56|
111
+ |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|23.60|± | 1.20|
112
+ |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|18.40|± | 0.93|
113
+ |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|58.00|± | 2.85|
114
+
115
+ Average: 46.59%
116
+
117
+ Average score: 58.27%
118
+
119
+ ## 🧩 Merge Configuration
120
 
121
  Spaetzle-v69-7b is a merge of the following models using [LazyMergekit](https://colab.research.google.com/drive/1obulZ1ROXHjYLn6PPZJwRR6GzgQogxxb?usp=sharing):
122
  * [abideen/AlphaMonarch-dora](https://huggingface.co/abideen/AlphaMonarch-dora)
 
139
  - [FelixChao/WestSeverus-7B-DPO-v2](https://huggingface.co/FelixChao/WestSeverus-7B-DPO-v2)
140
  - [cognitivecomputations/openchat-3.5-0106-laser](https://huggingface.co/cognitivecomputations/openchat-3.5-0106-laser)
141
 
142
+ For this last merge:
 
143
 
144
  ```yaml
145
  models: