macadeliccc
commited on
Commit
•
2a9557b
1
Parent(s):
fa411bf
Update README.md
Browse files
README.md
CHANGED
@@ -83,88 +83,92 @@ print(generate_response(prompt), "\n")
|
|
83 |
## Eval
|
84 |
|
85 |
evaluation [colab](https://colab.research.google.com/drive/1FpwgsGzCR4tORTxAwUxpN3PcP22En2xk?usp=sharing)
|
86 |
-
|
87 |
| Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
|
88 |
|---------------------------------------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
|
89 |
|[laser-dolphin-mixtral-2x7b-dpo](https://huggingface.co/macadeliccc/laser-dolphin-mixtral-2x7b-dpo)| 41.31| 73.67| 61.69| 42.79| 54.87|
|
90 |
|
|
|
|
|
|
|
|
|
|
|
91 |
### AGIEval
|
92 |
| Task |Version| Metric |Value| |Stderr|
|
93 |
|------------------------------|------:|--------|----:|---|-----:|
|
94 |
-
|agieval_aqua_rat | 0|acc |
|
95 |
-
| | |acc_norm|21.
|
96 |
-
|agieval_logiqa_en | 0|acc |34.
|
97 |
-
| | |acc_norm|35.
|
98 |
-
|agieval_lsat_ar | 0|acc |
|
99 |
-
| | |acc_norm|
|
100 |
-
|agieval_lsat_lr | 0|acc |
|
101 |
-
| | |acc_norm|
|
102 |
-
|agieval_lsat_rc | 0|acc |
|
103 |
-
| | |acc_norm|
|
104 |
-
|agieval_sat_en | 0|acc |
|
105 |
-
| | |acc_norm|
|
106 |
-
|agieval_sat_en_without_passage| 0|acc |
|
107 |
-
| | |acc_norm|41.
|
108 |
-
|agieval_sat_math | 0|acc |
|
109 |
-
| | |acc_norm|
|
110 |
-
|
111 |
-
Average:
|
112 |
|
113 |
### GPT4All
|
114 |
| Task |Version| Metric |Value| |Stderr|
|
115 |
|-------------|------:|--------|----:|---|-----:|
|
116 |
-
|arc_challenge| 0|acc |58.
|
117 |
-
| | |acc_norm|
|
118 |
-
|arc_easy | 0|acc |
|
119 |
-
| | |acc_norm|
|
120 |
-
|boolq | 1|acc |87.
|
121 |
-
|hellaswag | 0|acc |
|
122 |
-
| | |acc_norm|
|
123 |
-
|openbookqa | 0|acc |
|
124 |
-
| | |acc_norm|
|
125 |
-
|piqa | 0|acc |81.
|
126 |
-
| | |acc_norm|
|
127 |
-
|winogrande | 0|acc |
|
128 |
-
|
129 |
-
Average: 73.
|
130 |
|
131 |
### TruthfulQA
|
132 |
| Task |Version|Metric|Value| |Stderr|
|
133 |
|-------------|------:|------|----:|---|-----:|
|
134 |
-
|truthfulqa_mc| 1|mc1 |
|
135 |
-
| | |mc2 |
|
136 |
|
137 |
-
Average:
|
138 |
|
139 |
### Bigbench
|
140 |
| Task |Version| Metric |Value| |Stderr|
|
141 |
|------------------------------------------------|------:|---------------------|----:|---|-----:|
|
142 |
-
|bigbench_causal_judgement | 0|multiple_choice_grade|
|
143 |
-
|bigbench_date_understanding | 0|multiple_choice_grade|
|
144 |
-
|bigbench_disambiguation_qa | 0|multiple_choice_grade|
|
145 |
-
|bigbench_geometric_shapes | 0|multiple_choice_grade|
|
146 |
-
| | |exact_str_match |
|
147 |
-
|bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|
|
148 |
-
|bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|
|
149 |
-
|bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|
|
150 |
-
|bigbench_movie_recommendation | 0|multiple_choice_grade|
|
151 |
-
|bigbench_navigate | 0|multiple_choice_grade|
|
152 |
-
|bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|
|
153 |
-
|bigbench_ruin_names | 0|multiple_choice_grade|
|
154 |
-
|bigbench_salient_translation_error_detection | 0|multiple_choice_grade|
|
155 |
-
|bigbench_snarks | 0|multiple_choice_grade|
|
156 |
-
|bigbench_sports_understanding | 0|multiple_choice_grade|70.
|
157 |
-
|bigbench_temporal_sequences | 0|multiple_choice_grade|
|
158 |
-
|bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|
|
159 |
-
|bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|17.
|
160 |
-
|bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|
|
161 |
-
|
162 |
-
Average:
|
163 |
-
|
164 |
-
Average score:
|
165 |
-
|
166 |
-
Elapsed time: 02:
|
167 |
-
|
168 |
## Citations
|
169 |
|
170 |
Fernando Fernandes Neto and Eric Hartford. "Optimizing Large Language Models Using Layer-Selective Rank Reduction and Random Matrix Theory." 2024.
|
|
|
83 |
## Eval
|
84 |
|
85 |
evaluation [colab](https://colab.research.google.com/drive/1FpwgsGzCR4tORTxAwUxpN3PcP22En2xk?usp=sharing)
|
86 |
+
## Summary of previuous evaluation
|
87 |
| Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
|
88 |
|---------------------------------------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
|
89 |
|[laser-dolphin-mixtral-2x7b-dpo](https://huggingface.co/macadeliccc/laser-dolphin-mixtral-2x7b-dpo)| 41.31| 73.67| 61.69| 42.79| 54.87|
|
90 |
|
91 |
+
## Detailed current evaluation
|
92 |
+
| Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
|
93 |
+
|---------------------------------------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
|
94 |
+
|[laser-dolphin-mixtral-2x7b-dpo](https://huggingface.co/macadeliccc/laser-dolphin-mixtral-2x7b-dpo)| 42.25| 73.45| 63.44| 43.96| 55.77|
|
95 |
+
|
96 |
### AGIEval
|
97 |
| Task |Version| Metric |Value| |Stderr|
|
98 |
|------------------------------|------:|--------|----:|---|-----:|
|
99 |
+
|agieval_aqua_rat | 0|acc |21.26|± | 2.57|
|
100 |
+
| | |acc_norm|21.65|± | 2.59|
|
101 |
+
|agieval_logiqa_en | 0|acc |34.72|± | 1.87|
|
102 |
+
| | |acc_norm|35.64|± | 1.88|
|
103 |
+
|agieval_lsat_ar | 0|acc |26.96|± | 2.93|
|
104 |
+
| | |acc_norm|26.96|± | 2.93|
|
105 |
+
|agieval_lsat_lr | 0|acc |45.88|± | 2.21|
|
106 |
+
| | |acc_norm|46.08|± | 2.21|
|
107 |
+
|agieval_lsat_rc | 0|acc |59.48|± | 3.00|
|
108 |
+
| | |acc_norm|59.48|± | 3.00|
|
109 |
+
|agieval_sat_en | 0|acc |73.79|± | 3.07|
|
110 |
+
| | |acc_norm|73.79|± | 3.07|
|
111 |
+
|agieval_sat_en_without_passage| 0|acc |42.23|± | 3.45|
|
112 |
+
| | |acc_norm|41.26|± | 3.44|
|
113 |
+
|agieval_sat_math | 0|acc |37.27|± | 3.27|
|
114 |
+
| | |acc_norm|33.18|± | 3.18|
|
115 |
+
|
116 |
+
Average: 42.25%
|
117 |
|
118 |
### GPT4All
|
119 |
| Task |Version| Metric |Value| |Stderr|
|
120 |
|-------------|------:|--------|----:|---|-----:|
|
121 |
+
|arc_challenge| 0|acc |58.36|± | 1.44|
|
122 |
+
| | |acc_norm|58.02|± | 1.44|
|
123 |
+
|arc_easy | 0|acc |82.20|± | 0.78|
|
124 |
+
| | |acc_norm|77.40|± | 0.86|
|
125 |
+
|boolq | 1|acc |87.52|± | 0.58|
|
126 |
+
|hellaswag | 0|acc |67.50|± | 0.47|
|
127 |
+
| | |acc_norm|84.43|± | 0.36|
|
128 |
+
|openbookqa | 0|acc |34.40|± | 2.13|
|
129 |
+
| | |acc_norm|47.00|± | 2.23|
|
130 |
+
|piqa | 0|acc |81.61|± | 0.90|
|
131 |
+
| | |acc_norm|82.59|± | 0.88|
|
132 |
+
|winogrande | 0|acc |77.19|± | 1.18|
|
133 |
+
|
134 |
+
Average: 73.45%
|
135 |
|
136 |
### TruthfulQA
|
137 |
| Task |Version|Metric|Value| |Stderr|
|
138 |
|-------------|------:|------|----:|---|-----:|
|
139 |
+
|truthfulqa_mc| 1|mc1 |45.90|± | 1.74|
|
140 |
+
| | |mc2 |63.44|± | 1.56|
|
141 |
|
142 |
+
Average: 63.44%
|
143 |
|
144 |
### Bigbench
|
145 |
| Task |Version| Metric |Value| |Stderr|
|
146 |
|------------------------------------------------|------:|---------------------|----:|---|-----:|
|
147 |
+
|bigbench_causal_judgement | 0|multiple_choice_grade|58.42|± | 3.59|
|
148 |
+
|bigbench_date_understanding | 0|multiple_choice_grade|60.70|± | 2.55|
|
149 |
+
|bigbench_disambiguation_qa | 0|multiple_choice_grade|38.37|± | 3.03|
|
150 |
+
|bigbench_geometric_shapes | 0|multiple_choice_grade|21.73|± | 2.18|
|
151 |
+
| | |exact_str_match | 0.00|± | 0.00|
|
152 |
+
|bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|35.00|± | 2.14|
|
153 |
+
|bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|23.57|± | 1.61|
|
154 |
+
|bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|50.33|± | 2.89|
|
155 |
+
|bigbench_movie_recommendation | 0|multiple_choice_grade|45.00|± | 2.23|
|
156 |
+
|bigbench_navigate | 0|multiple_choice_grade|50.00|± | 1.58|
|
157 |
+
|bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|60.35|± | 1.09|
|
158 |
+
|bigbench_ruin_names | 0|multiple_choice_grade|51.12|± | 2.36|
|
159 |
+
|bigbench_salient_translation_error_detection | 0|multiple_choice_grade|32.26|± | 1.48|
|
160 |
+
|bigbench_snarks | 0|multiple_choice_grade|67.96|± | 3.48|
|
161 |
+
|bigbench_sports_understanding | 0|multiple_choice_grade|70.59|± | 1.45|
|
162 |
+
|bigbench_temporal_sequences | 0|multiple_choice_grade|35.80|± | 1.52|
|
163 |
+
|bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|22.56|± | 1.18|
|
164 |
+
|bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|17.20|± | 0.90|
|
165 |
+
|bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|50.33|± | 2.89|
|
166 |
+
|
167 |
+
Average: 43.96%
|
168 |
+
|
169 |
+
Average score: 55.77%
|
170 |
+
|
171 |
+
Elapsed time: 02:43:45
|
|
|
172 |
## Citations
|
173 |
|
174 |
Fernando Fernandes Neto and Eric Hartford. "Optimizing Large Language Models Using Layer-Selective Rank Reduction and Random Matrix Theory." 2024.
|